Forum Discussion

RonCaplinger's avatar
12 years ago

MSDP status 2074, but volume is not down

(Reposting here, didn't mean for this to be posted under PureDisk...)

 

Does anyone here have any experience yet using Cisco UCS blades as NBU media servers?

We are migrating our existing Oracle x86 and Solaris Sparc media server hardware connecting to Data Domains over NFS, to Cisco UCS blades running Red Hat Linux and fiber channel attached storage using NBU's Media Server Deduplication Pools.  We are running NBU 7.5.0.5 on all servers, and most clients are running 7.5.0.4.

We have been experiencing sporadic status 2074's during heavy backup loads on one MSDP.  But the MSDP is not actually down, and the jobs auto-retry 1-2 hours later and are successful.  I am not upping the MSDP, and I can't find any indication that anything else is cycling the services or rebooting the server. 

I do not find a storaged.log or spoold.log on the server, and I just uncommented the /usr/openv/lib/ost-plugins/pd.conf line to enable PureDisk logging and cycled NBU on that server a little while ago, so I don't have any logs from that yet.

If you have worked with this combination of hardware/OS/NBU, do you have any advice on UCS tuning or settings?  We're also having other issues with TCP tuning (socket reads/writes failing).

 

  • This was written to handle things exactly like this:

    http://www.symantec.com/docs/TECH156743

    The reason for this if they are intermittent is because the Disk Polling Service checks the status of the disk and then marks it up or down depending on weather or not it gets a response. The default timeout is 60 seconds. So when the disk is under load, the network interface is overwhelmed, or the system is in any other way experiencing a performance issue that delays the response to DPS then it gets marked as down( status 2074 but can be other status codes reported as well). 

    If this is due to performance issues you can easily tell because the device will come back up within a few minutes and backups will start running again. Which of course makes sense. As soon as all the backups fail then the performance returns to normal and it returns the response to DPS within 60 seconds again and NetBackup marks the device as up again.

    I'm not going to get into infrastructure issues that can cause this but anything that has a shared architecture like blade servers, VM's, etc. are going to be more suspect to this type of behavior unless they have dedicated resources(dedicated bandwidth, network interface, raw mapped devices, etc). Also as noted the manufacturer of the network interface you are using can make a huge difference. And it has been known for some time that some cards do require you to shut off offloading or other features to get it to perform as expected.

    This is why the DPS timeouts work so well. When under a load and not returning a response as quickly as you would like it allows for a longer period of time to get a response from the server before marking it as down. 

    The above technote tries to cover a large amount of territory in a small amount of space but the solutions down at the bottom cover quite a few if not all the reasons you may see this. I need to update it as it is a little out of date. We do now allow iSCSI for MSDP as long as it's a 10GB dedicated iSCSI network. I also need to add 2074 to the list of status codes that can cause this.

    Thanks 

    Let me know if you have further questions.

  • Not worked on such a system but have seen such behavoir on heavily loaded systems

    The All Log Entries report tends to show this up with entries showing the disk pool going up and down like a yo-yo!

    Mine have all been sorted out by adding the following to the Master and MDSP Media Servers:

    /usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO

    or for Windows:

    <install path>\veritas\netbackup\db\config\DPS_PROXYDEFAULTRECVTMO

    and entering a value of 800 into the file

    Hope this helps

  • Great suggestion, but I already set the DPS_* touchfiles  when these servers were built.  Oddly, those files are not being backed up on this particular media server when it backs itself up; the SIZE_DATA_BUFFERS* and NUMBER_DATA_BUFFERS*, also in that same directory, are being backed up, though. 

    Those touchfiles *are* being backed up on the other new media servers, just not this one.  And last night's backups did not encounter any 2074's.  Something's odd here.

  • Do you just have that one DPS_* touch file or the other two as well?

    I have found it works best when only the one exists

    Any exclusions set on the server that doesn't back them up?

  • I have all three DPS_* touch files added.  I'll delete the other two; but I didn't have any jobs fail with 2074s last night, so I don't know exactly what happened there or if I can confirm this has fixed the issue.

    I don't see anything in the exclude list that skips just those files but backs up the buffer files.

  • OK - I guess until you have had a clean run with the full backups running when the system is at maximum load you wont be able to tell

    Let is run with just the one DPS file in place and see how it goes

    Not sure why it doesn't backup those files

  • We've been testing some of the various TCP settings available and think we may have a workaround, since we couldn't pinpoint the actual source.  We have increased the TCP Ring Buffers to their max value, 4096, and disabled all TCP offloading.  It appears that both are required, and once the changes are made on the UCS side, the blades *have* to be rebooted for the changes to take effect.

    We have seen almost no status code 24's or 14's in NetBackup since these changes, and no more dropped packets are showing up when we run "ifconfig" on the hosts. 

    Unfortunately, we still have one other issue with using the UCS blades and will be reverting back to regular, physical x86 Red Hat boxes.  Our Quantum i2000 library and tape drives had to be connected to these servers through our SAN fabric, and every time we have to reset the Fiber Interfaces on the SAN or the UCS enclosure, the tape drives are changing connection paths to the media server, which is causing the OS to see the drives on the wrong path and subsequently requires me to delete them from NBU and re-discover them in both the OS and NBU.  This also requires a stop/start of LTID.  This FI reset has had to happen about a dozen times in the past month and since it requires manual intervention on my part to correct, the other team decided it would be easier for both of us to move my servers back and direct-connect my tape drives. 

  • Do you not have the facility to set persistent bindings for your fibre connections?

    A lot of SANs do this when they get re-sets so i always to got use the fibre card software (HBA Anywhere, elxcfg etc.) to setup persistent bindings on each client to ensure that after any sort of re-set or re-boot the drives and robotics come back with the same LUNs etc.

     

  • This was written to handle things exactly like this:

    http://www.symantec.com/docs/TECH156743

    The reason for this if they are intermittent is because the Disk Polling Service checks the status of the disk and then marks it up or down depending on weather or not it gets a response. The default timeout is 60 seconds. So when the disk is under load, the network interface is overwhelmed, or the system is in any other way experiencing a performance issue that delays the response to DPS then it gets marked as down( status 2074 but can be other status codes reported as well). 

    If this is due to performance issues you can easily tell because the device will come back up within a few minutes and backups will start running again. Which of course makes sense. As soon as all the backups fail then the performance returns to normal and it returns the response to DPS within 60 seconds again and NetBackup marks the device as up again.

    I'm not going to get into infrastructure issues that can cause this but anything that has a shared architecture like blade servers, VM's, etc. are going to be more suspect to this type of behavior unless they have dedicated resources(dedicated bandwidth, network interface, raw mapped devices, etc). Also as noted the manufacturer of the network interface you are using can make a huge difference. And it has been known for some time that some cards do require you to shut off offloading or other features to get it to perform as expected.

    This is why the DPS timeouts work so well. When under a load and not returning a response as quickly as you would like it allows for a longer period of time to get a response from the server before marking it as down. 

    The above technote tries to cover a large amount of territory in a small amount of space but the solutions down at the bottom cover quite a few if not all the reasons you may see this. I need to update it as it is a little out of date. We do now allow iSCSI for MSDP as long as it's a 10GB dedicated iSCSI network. I also need to add 2074 to the list of status codes that can cause this.

    Thanks 

    Let me know if you have further questions.

  • None that I can find.  The UCS system is a blade setup, with a single chassis and multiple blades inside it.  None of the blades have direct connection to the network or fiber, but it is different than something like VMware.  You can dedicate bandwidth to specific systems as you wish. 

    As such, the UCS system no longer seems to have settings for persistent binding of the vHBA's anymore (it apparently use to, but my Unix Ops guy can't find that setting on our current version of Cisco's OS). 

    Just as well.  I wasn't fully expecting the blades to be perfect for media server application, but my Unix Ops guys were pretty gung-ho about the idea.  Now that we have verified this, we won't by trying any sort of virtualization for my backup servers.

  • Thanks, Brook!  That's what I was trying to find before.  I was sure there would be something better to explain it.