Monitoring of NetBackup engineering software development infrastructure

NetBackup 8.1 was announced at the Veritas Vision conference and released near the end of September 2017, with many important new features. For this article, we’ll look at a small part of the inside of the development effort: monitoring of NetBackup internal software development (build and test) infrastructure.

The NetBackup engineering business unit is one of the largest business units of Veritas. A few scrum teams work together to develop and maintain the software build and test infrastructure and tools that the larger engineering group uses to develop NetBackup.

The build and test scrum teams provide many automated services that are expected to run on regular schedules or be highly available – 24 hours a day, 7 days a week. In order to support the development of NetBackup 8.1 so that it could be shipped on schedule and with quality, the various teams improved the monitoring of this key infrastructure.

The goals for monitoring include the following:

  • Send out a notification when an automated service fails to function, to ensure the issue can be resolved early, with minimal downtime.
  • Gather more metrics data to know what is failing and how often failure affects our processes and infrastructure, which allows for better debugging and prevention of future errors.

Without such monitoring, the build and test teams rely on engineers to report problems when they are encountered. Monitoring by the build and test teams increases development productivity.

Monitoring Technical Implementation

The four main teams that worked on monitoring will be referred to by these pseudonyms for the purpose of this article:

  • Team A
  • Team B
  • Team C
  • Team D

These are the main software applications for monitoring used across the build and test teams:

  • Slack, our favorite real-time messaging solution.
  • Prometheus, a time-series database for monitoring data.
  • Grafana, display time-series data on dashboards.
  • AlertManager, send alerts when conditions are met, i.e. on outages.
  • Docker, deploy software via “containers”, a lightweight virtualization mechanism.

These individual software applications are configured and deployed into our intranet to provide the monitoring and alerting services. Some teams had additional or different software applications that they used, as detailed below. In particular, some of the teams used significantly different techniques for deploying the live monitoring services.

Note that some of the data in the following images are not real data and are examples only. For example, Torvalds does not work at this company, we just put his name in there to anonymize the names of other people who might not want to be as famous.

Team A

Example of Team A’s Slack integration for an alert using AlertManager.image001.pngDeployment: Previously, deploying the monitoring infrastructure was a manual process, with the steps documented on an intranet wiki page. Team A developed automated deployment in the IAC (Infrastructure As Code) style using Ansible.

Additional software components: Ansible.

Development effort: Mid-to-high priority, a few team members worked on this for more than one sprint.

Services monitored

  • rsync extsrc: Monitor synchronizing copies (via rsync) of third-party dependencies to software build machines, and send alerts on rsync errors.
  • Scheduled regression testing launch: Monitors the launching of all scheduled regression tests.
  • NFS failures: Monitor software build machines for NFS access failures.

Results

  • Development status: Minimum viable product. The original requirements were satisfied, but the implementation was not the best and had room for improvement.
  • The rsync monitor has proven to be very useful in preventing job failures.
  • The scheduled regression test launch monitor has been very helpful in preventing a number of regression test launch failures that would have resulted in no regression test results report.
  • The system administration team and Team A were able to identify system configurations that caused NFS issues with the help of monitoring data.

Team B

Deployment: IAC style using Ansible.

Additional software components: Ansible.

Development effort: Low priority, only about one team member worked on this for one sprint.

Services monitored

  • Jenkins CI/CD server.
  • Ansible controller machine, used for executing IAC.
  • The monitoring server itself.

Results

  • Development status: Team B only reached a proof-of-concept with their monitoring.

Team C

Example of Team C’s monitoring dashboard using Grafana.

image002.png

Deployment: Zero-downtime deployments with Fabric. Microservice via Docker container images.

Additional software components: Fabric.

Development effort: Advanced, all team members participated. Excellent development/testing environment and documentation developed.

Custom software components

  • The testing environment for the monitoring infrastructure: custom software was written for fault injection within a Docker development environment so that the monitoring environment could be tested without waiting for real failure data.

Services monitored

  • Time-aggregated test suite pass/fails, categorized by verdict.
  • Build/test steps running for over 24 hours.
  • Build/test steps queued for too long.
  • Source checkout failures.
  • Bullseye code coverage runs.
  • Above-average queue times for the nightly scheduled regression tests.

Results

  • Development status: Well-designed and complete.
  • The monitoring development and testing environment streamlined the upgrade to Prometheus 2 with breaking changes. The issues could be debugged in Docker containers, and the upgrade was pushed to production with no disruption to monitoring.
  • The entire team was able to work on and learn the monitoring.
  • The produced monitoring helped detect and fix some issues before tickets were filed, mainly with database problems in the status of build/test jobs.

Team D

Example of Team D’s Slack integration for an alert using AlertManager.image003.pngDeployment: Docker container images are pushed to Artifactory and deployed in a Docker Swarm using Rancher.

Additional software components: Ansible, Artifactory, Rancher.

Development effort: Basic/medium, a few team members participated to work on a few stories to set up minimum viable product monitoring.

Custom software components

  • A Prometheus exporter written in Python, packaged in a Docker container.

Services monitored

  • Free space on NFS shares.
  • Build/test machine resource pool capacity.

Results

  • Development status: Minimum viable product.
  • The NFS free space alerting helped proactively address infrastructure issues that would have caused outages in the build and test environment.
  • The resource pool capacity monitor alerted the team to resource pool health anomalies, helping the team to proactively address resource issues.

NetBackup 8.1 End Game

The monitoring initiatives had a positive impact on finalizing the NetBackup 8.1 release. Some initiatives are small, others very helpful. Nevertheless, every little bit of help counted, especially with the engineers working hard, placing an extra load on our build and test infrastructure and tools for this release. NetBackup 8.1 was shipped on schedule and with excellent quality, so a bonus was awarded to NetBackup Engineering for this important release. The monitoring initiatives truly helped the build and test teams achieve one of their culture goals:

"We are partners with engineering, with our skin in the game, in all situations."

This article covered mainly the technical details of our monitoring implementation leading up to the NetBackup 8.1 release. There were a few complications and lessons learned from these monitoring initiatives, especially from the cultural dimension, and monitoring didn’t stop at the end of the NetBackup 8.1. release. These additional points will be covered in more depth in a follow-up article, and this article is key to setting the stage for that further discussion.

Thanks to our contributors and the many people who have worked on the monitoring implementation.

Also special thanks to the people who have worked to make this article possible. A few names are worth mentioning for posterity:

Writers/Editors: Carlos Fitts, Andrew Makousky, Dinesh Shenoy.

Reviewers: Christopher Engesser, Carlos Fitts, Michael Hauglid, Brad Krusemark, Mitchell Then, Ingrit Tota, Jou Vang, VOX community organizers.

External links