MSDP status 2074, but volume is not down
(Reposting here, didn't mean for this to be posted under PureDisk...)
Does anyone here have any experience yet using Cisco UCS blades as NBU media servers?
We are migrating our existing Oracle x86 and Solaris Sparc media server hardware connecting to Data Domains over NFS, to Cisco UCS blades running Red Hat Linux and fiber channel attached storage using NBU's Media Server Deduplication Pools. We are running NBU 7.5.0.5 on all servers, and most clients are running 7.5.0.4.
We have been experiencing sporadic status 2074's during heavy backup loads on one MSDP. But the MSDP is not actually down, and the jobs auto-retry 1-2 hours later and are successful. I am not upping the MSDP, and I can't find any indication that anything else is cycling the services or rebooting the server.
I do not find a storaged.log or spoold.log on the server, and I just uncommented the /usr/openv/lib/ost-plugins/pd.conf line to enable PureDisk logging and cycled NBU on that server a little while ago, so I don't have any logs from that yet.
If you have worked with this combination of hardware/OS/NBU, do you have any advice on UCS tuning or settings? We're also having other issues with TCP tuning (socket reads/writes failing).
This was written to handle things exactly like this:
http://www.symantec.com/docs/TECH156743
The reason for this if they are intermittent is because the Disk Polling Service checks the status of the disk and then marks it up or down depending on weather or not it gets a response. The default timeout is 60 seconds. So when the disk is under load, the network interface is overwhelmed, or the system is in any other way experiencing a performance issue that delays the response to DPS then it gets marked as down( status 2074 but can be other status codes reported as well).
If this is due to performance issues you can easily tell because the device will come back up within a few minutes and backups will start running again. Which of course makes sense. As soon as all the backups fail then the performance returns to normal and it returns the response to DPS within 60 seconds again and NetBackup marks the device as up again.
I'm not going to get into infrastructure issues that can cause this but anything that has a shared architecture like blade servers, VM's, etc. are going to be more suspect to this type of behavior unless they have dedicated resources(dedicated bandwidth, network interface, raw mapped devices, etc). Also as noted the manufacturer of the network interface you are using can make a huge difference. And it has been known for some time that some cards do require you to shut off offloading or other features to get it to perform as expected.
This is why the DPS timeouts work so well. When under a load and not returning a response as quickly as you would like it allows for a longer period of time to get a response from the server before marking it as down.
The above technote tries to cover a large amount of territory in a small amount of space but the solutions down at the bottom cover quite a few if not all the reasons you may see this. I need to update it as it is a little out of date. We do now allow iSCSI for MSDP as long as it's a 10GB dedicated iSCSI network. I also need to add 2074 to the list of status codes that can cause this.
Thanks
Let me know if you have further questions.