NetBackup 8.1 was announced at the Veritas Vision conference and released near the end of September 2017, with many important new features. For this article, we’ll look at a small part of the inside of the development effort: monitoring of NetBackup internal software development (build and test) infrastructure.
The NetBackup engineering business unit is one of the largest business units of Veritas. A few scrum teams work together to develop and maintain the software build and test infrastructure and tools that the larger engineering group uses to develop NetBackup.
The build and test scrum teams provide many automated services that are expected to run on regular schedules or be highly available – 24 hours a day, 7 days a week. In order to support the development of NetBackup 8.1 so that it could be shipped on schedule and with quality, the various teams improved the monitoring of this key infrastructure.
The goals for monitoring include the following:
Without such monitoring, the build and test teams rely on engineers to report problems when they are encountered. Monitoring by the build and test teams increases development productivity.
Monitoring Technical Implementation
The four main teams that worked on monitoring will be referred to by these pseudonyms for the purpose of this article:
These are the main software applications for monitoring used across the build and test teams:
These individual software applications are configured and deployed into our intranet to provide the monitoring and alerting services. Some teams had additional or different software applications that they used, as detailed below. In particular, some of the teams used significantly different techniques for deploying the live monitoring services.
Note that some of the data in the following images are not real data and are examples only. For example, Torvalds does not work at this company, we just put his name in there to anonymize the names of other people who might not want to be as famous.
Team A
Example of Team A’s Slack integration for an alert using AlertManager.Deployment: Previously, deploying the monitoring infrastructure was a manual process, with the steps documented on an intranet wiki page. Team A developed automated deployment in the IAC (Infrastructure As Code) style using Ansible.
Additional software components: Ansible.
Development effort: Mid-to-high priority, a few team members worked on this for more than one sprint.
Services monitored
Results
Team B
Deployment: IAC style using Ansible.
Additional software components: Ansible.
Development effort: Low priority, only about one team member worked on this for one sprint.
Services monitored
Results
Team C
Example of Team C’s monitoring dashboard using Grafana.
Deployment: Zero-downtime deployments with Fabric. Microservice via Docker container images.
Additional software components: Fabric.
Development effort: Advanced, all team members participated. Excellent development/testing environment and documentation developed.
Custom software components
Services monitored
Results
Team D
Example of Team D’s Slack integration for an alert using AlertManager.Deployment: Docker container images are pushed to Artifactory and deployed in a Docker Swarm using Rancher.
Additional software components: Ansible, Artifactory, Rancher.
Development effort: Basic/medium, a few team members participated to work on a few stories to set up minimum viable product monitoring.
Custom software components
Services monitored
Results
NetBackup 8.1 End Game
The monitoring initiatives had a positive impact on finalizing the NetBackup 8.1 release. Some initiatives are small, others very helpful. Nevertheless, every little bit of help counted, especially with the engineers working hard, placing an extra load on our build and test infrastructure and tools for this release. NetBackup 8.1 was shipped on schedule and with excellent quality, so a bonus was awarded to NetBackup Engineering for this important release. The monitoring initiatives truly helped the build and test teams achieve one of their culture goals:
"We are partners with engineering, with our skin in the game, in all situations."
This article covered mainly the technical details of our monitoring implementation leading up to the NetBackup 8.1 release. There were a few complications and lessons learned from these monitoring initiatives, especially from the cultural dimension, and monitoring didn’t stop at the end of the NetBackup 8.1. release. These additional points will be covered in more depth in a follow-up article, and this article is key to setting the stage for that further discussion.
Thanks to our contributors and the many people who have worked on the monitoring implementation.
Also special thanks to the people who have worked to make this article possible. A few names are worth mentioning for posterity:
Writers/Editors: Carlos Fitts, Andrew Makousky, Dinesh Shenoy.
Reviewers: Christopher Engesser, Carlos Fitts, Michael Hauglid, Brad Krusemark, Mitchell Then, Ingrit Tota, Jou Vang, VOX community organizers.
External links
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.