Many servers fail at once
A financial company suddenly started having trouble with its servers—a large number of them, all at once. Instead of starting up normally, they would begin loading and just hang. The company was using Veritas Storage Foundation 4.0 to manage its data storage and maximize use of its hardware, and Veritas Cluster Server to guard against disruptions. But with so many servers failing at once, clustering was no help. There simply weren't enough operational servers for failover to work.
The company called Symantec for support, and because the problem was business critical, it was escalated quickly. It soon came to the attention of John, a Symantec engineer with more than eight years' experience providing support. Because he had also provided support at IBM, he was especially familiar with the AIX operating system running this company's servers.
John began by having the customer perform a dump of the system and upload it to a Symantec FTP site. A review of the log revealed the problem. The customer had just upgraded the operating system on its servers from IBM AIX 5L for POWER Version 5.2 to Version 5.3, and had run into trouble with the Storage Foundation daemon vxconfigd.
Veritas Volume Manager (a component of Storage Foundation) uses vxconfigd to keep track of the hardware configuration the application manages. But because the operating system upgrade changed some file system paths in AIX, vxconfigd was no longer linking to the hardware correctly, and that's what was causing the servers to hang.
Back to normal in 10 minutes
There were three ways the customer's IT staff could have avoided the problem, John says. First, they could have uninstalled Storage Foundation, completed the operating system upgrade, and then reinstalled the application. In that case, vxconfigd would automatically have linked to the hardware correctly. Second, they could have installed a patch to Storage Foundation to update the links before performing the upgrade.
The third solution—the only one that would work without starting the upgrade over from scratch—was to use an ln command (a file linking command in UNIX and its variants) to restore the hardware links. With Jitendra's help, that's what this customer did. "I gave them instructions on how to link vxconfigd correctly, using the ln command," he says. "Everything was back to normal within 10 minutes."
The entire process—from when the customer first contacted Symantec Support Services, through getting the system dump, diagnosing the problem, and providing instructions for fixing it—took less than eight hours. "They were very happy with the support," John says. "And they have more trust in our product, too. They saw that if a problem arises, the support team is really behind them."