Backup & Recovery

In some cases, frontline technical support encounters an issue it just doesn’t have the resources to solve. In a recent case, a large, multinational electronics manufacturing customer called in with NetBackup problems. A number of its backup processes had stalled, leaving running backups unfinished and new backups unable to start. The two environments affected were extensive—literally thousands of backup jobs executing across the enterprise through more than a dozen media servers. The affected backup jobs did not appear to hang at any particular point in the process nor were the same jobs affected every time. After collecting data, frontline support escalated the issue to the backline support team.

Hung processes

Debbie, a Symantec NetBackup support specialist, took the lead in tracking down the intermittent stalled backups. One of the first things that caught her attention was the customer’s use of numerous scripts tracking media allocations on several of their servers, but Debbie did not see a consistent pattern that isolated the hangs to one particular server or job. “Frontline support did an excellent job collecting data, though,” says Debbie. “By running PSTACK to trace running processes, they captured all the needed thread data.” Debbie sent the PSTACK thread results to engineering, and the fault was quickly isolated.

“Some of the NBRB processes were definitely hung,” recalls Debbie. “You would expect these processes to be actively processing, but a number of them were inactive for over an hour.” Debbie compiled the data and her notes on the case and forwarded it all to the engineering support team. Engineering suspected that the large number of commands being run by the customer scripts was causing the NBRB Process to encounter a thread deadlock condition, causing the application hang.

To Debbie’s delight, engineering developed new code within hours.

“This was an extraordinarily fast development of new binary code,” explains Debbie.”Only six hours after I passed the information to engineering, they delivered new code to our test labs.” Debbie set up the test bed and ran the binary under configurations and circumstances similar to the customer site, overloading the test servers with a like number of commands that had been triggering the condition. She found that the new binary code worked in the lab. An initial install on a single customer server worked as well. After applying the fix to their affected master servers, the customer was able to run without fear of an application hang, which allowed them to protect the data of thousands of its employees.

Fast resolution
“It worked,” Debbie says. No NBRB processes or backup jobs hung after the patch was installed, and data backup across the enterprise continued successfully.

The entire process—from the hand-off by the frontline support staff to the installation of new binary code—took just two days. “We are lucky to have the engineers we do,” says Debbie. There is no denying engineering did a tremendous job, yet every step in the amazingly fast resolution depended on teamwork. Frontline knew when it needed to escalate, and they provided the right set of data to the backline team so that the issue could be escalated quickly to the engineering Customer Focus Team (CFT).

Debbie’s examination of the resource traces before the handoff to engineering gave them a head start, and her thorough testing of the new code in the lab enabled the customer to install the patch with confidence. “The customer was extremely pleased,” Debbie remembers. “All backup systems were working again.”

After this case was resolved, technical support was contacted by another customer with the same backup-locking problem. Due to the great collaboration and teamwork between Debbie, frontline support, and engineering, this customer and perhaps many others now have a quick fix for stalled backups.