A mobile phone company had a problem with its billing database locking up. The company was running Oracle Real Application Clusters with Storage Foundation. The storage device would stop processing, and IT would have to restart the machine, a process that could take an hour and a half. Meanwhile, billing processing would remain at a standstill. This was especially problematic because the Christmas shopping season was approaching, when mobile phone companies do most of their business.
Dave, a Symantec support engineer with more than 20 years' experience, immediately suspected a problem with I/O (input and output). "It looked like we were losing [data] packets that were meant to go to the disk," he says.
From an hour and a half to two minutes
His first action was to instruct the company's IT staff on some tuning modifications. Changing the write-throttling parameters reduced the number of I/O packets that could remain unaccounted for in the switches before new packets were sent. This adjustment produced a marked improvement. Rather than needing a restart, the system would hang for about two minutes, then continue as usual. "That was a way better situation than when we started, but we knew it wasn't fixed," Dave says. "We knew we still had an underlying problem."
Since changing throttling parameters had helped, Dave was now convinced that missing I/O packets were the likely culprit. But why were they going missing? It's the kind of thing that just shouldn't happen, Dave says. "Either you have a connection or you don't. You should not have this situation where it's working partially. But I was seeing empirical evidence that it was working partially."
The only way to find out would be to "instrument" the Storage Area Network—that is, hook it up to another device that would show precisely how things were working inside. When the mobile company hesitated, other vendors helped make the case. The company's Sun storage tech confirmed Dave's observations. Even better, HP, which supported the storage device, brought in a pair of fibre channel analyzers—$80,000 pieces of equipment. "It allows you to crack the fibre channel packet and look at the SCSI commands and protocol inside," Dave says.
Lost in the switches
HP began making traces during the two-minute hangs, and discovered the problem occurred when the multiple hosts in the SAN failed to receive acknowledgement from the storage disk array that a packet had been received. Meanwhile, the array failed to see selected write-read sequences requested by the hosts. "It was almost like asking about someone you'd never heard of—you'd get the same kind of response, a blank stare for a bit until conversation picked back up," Dave says.
The problem appeared to be in the switched network itself, so HP called in Brocade, which had supplied the switches for the SAN. Brocade brought even more instrumentation, and tested each of the two redundant SAN fabrics separately. Then, it performed a firmware upgrade on each. And with that, the hangups stopped.
An active partner
Fixing the hangup problem freed up about 30 of the client's tech staff, all of whom had been spending at least some of their time on the problem. And, impressed that Dave worked so hard to help solve a problem that did not originate with Symantec software, the company has begun treating both him and Symantec more as active partners. "I get to look at systems before they upgrade them," he says. "And they're asking for me specifically—it's kind of scary."