After upgrading to NetBackup 6.5, a customer began to notice that his backup disk was filling up too quickly and backup jobs were failing. The customer contacted Symantec tech support because he thought the problem was a result of the upgrade.
After talking with the customer, Rich, a senior tech support engineer, began to troubleshoot the problem and quickly identified what was causing the failures. The backup disk was filling up too quickly because expired backup images were not being cleaned up appropriately. These older backups consisted of pre-upgrade backups as well as post-upgrade backups. Initially Rich thought the expired version 5 backups weren't being cleaned up because of a known issue he had dealt with in the past, so he spent several hours testing his theory in the lab. However, that issue was not present in the customer's system, so Rich was able to clean up the expired files successfully.
Rich then turned his attention back to the image cleanup failures. Every time a cleanup job failed, it produced a specific status code. Upon researching the status code, Rich discovered that there was an upcoming technical article about to be published on the Symantec site that focused on the cause of that specific failure—a corruption in the policy database. An extra file that didn't belong in the policy database was causing the image cleanup process to fail. Once Rich identified the file and removed it, everything worked perfectly.
Second problem, second solution
While troubleshooting the backup failures, Rich discovered another problem that had to do with legacy version 5 images. A number of these images were stored on the customer's disk but they didn't have any corresponding entries in the image database. NetBackup was seeing these "rogue fragments" as files on the disk, but couldn't identify them. Rich designed a method to determine which files were rogue fragments that didn't have corresponding database entries. He then documented the procedure for the customer so he could delete the fragments. "This issue has been around as long as we've had disk storage units," says Rich. "These fragments should have been removed long ago, but the customer had to spend some time locating the rogue fragment files and deleting them." Rich explained to the customer that several things can cause rogue fragments to be created—the disk being offline when images expire, permissions issues related to stored files, or other disk issues—but they are not created during normal operation of NetBackup.
Prior to upgrading, many customers run a NetBackup catalog consistency (NBCC) check to iron out inconsistencies among the image database, volume database, and media database so that when they perform the upgrade it will go smoothly. The customer asked Rich why the NBCC didn't catch these rogue images. Rich explained that the NBCC isn't designed to catch them, but at the time of this service call a patch was already under development to include this command in the NBCC.
The entire process of troubleshooting the problem and resolving it took Rich less than 48 hours. After the problems were taken care of, he continued to monitor the backups for the next three weeks to be sure the cleanup was happening as it was supposed to. The customer was very happy that Rich was able to provide crisp, concise explanations for the various problems and develop a plan to work through each issue. Rich received high praise from the customer for his communication and troubleshooting capabilities, and there have been no backup failures since.