MSDP status 2074, but volume is not down

Question

(Reposting here, didn't mean for this to be posted under PureDisk...)
	&nbsp;
	Does anyone here have any experience yet using Cisco UCS blades as NBU media servers?
	We are migrating our existing Oracle x86 and Solaris Sparc media server hardware connecting to Data Domains over NFS, to Cisco UCS blades running Red Hat Linux and fiber channel attached storage using NBU's Media Server Deduplication Pools.&nbsp; We are running NBU 7.5.0.5 on all servers, and most clients are running 7.5.0.4.
	We have been experiencing sporadic status 2074's during heavy backup loads on one MSDP.&nbsp; But the MSDP is not actually down, and the jobs auto-retry 1-2 hours later and are successful.&nbsp; I am not upping the MSDP, and I can't find any indication that anything else is cycling the services or rebooting the server.&nbsp;
	I do not find a storaged.log or spoold.log on the server, and I just uncommented the /usr/openv/lib/ost-plugins/pd.conf line to enable PureDisk logging and cycled NBU on that server a little while ago, so I don't have any logs from that yet.
	If you have worked with this combination of hardware/OS/NBU, do you have any advice on UCS tuning or settings?&nbsp; We're also having other issues with TCP tuning (socket reads/writes failing).

&nbsp;

brook_humphrey · Accepted Answer

This was written to handle things exactly like this:
http://www.symantec.com/docs/TECH156743
The reason for this if they are intermittent is because the Disk Polling Service checks the status of the disk and then marks it up or down depending on weather or not it gets a response. The default timeout is 60 seconds. So when the disk is under load, the network interface is overwhelmed, or the system is in any other way experiencing a performance issue that delays the response to DPS then it gets marked as down( status 2074 but can be other status codes reported as well).&nbsp;
If this is due to performance issues you can easily tell because the device will come back up within a few minutes and backups will start running again. Which of course makes sense. As soon as all the backups fail then the performance returns to normal and it returns the response to DPS within 60 seconds again and NetBackup marks the device as up again.
I'm not going to get into infrastructure issues that can cause this but anything that has a shared architecture like blade servers, VM's, etc. are going to be more suspect to this type of behavior unless they have dedicated resources(dedicated bandwidth, network interface, raw mapped devices, etc). Also as noted the manufacturer of the network interface you are using can make a huge difference. And it has been known for some time that some cards do require you to shut off offloading or other features to get it to perform as expected.
This is why the DPS timeouts work so well. When under a load and not returning a response as quickly as you would like it allows for a longer period of time to get a response from the server before marking it as down.&nbsp;
The above technote tries to cover a large amount of territory in a small amount of space but the solutions down at the bottom cover quite a few if not all the reasons you may see this. I need to update it as it is a little out of date. We do now allow iSCSI for MSDP as long as it's a 10GB dedicated iSCSI network. I also need to add 2074 to the list of status codes that can cause this.
Thanks&nbsp;
Let me know if you have further questions.

mark_solutions · Answer

Not worked on such a system but have seen such behavoir on heavily loaded systems
The All Log Entries report tends to show this up with entries showing the disk pool going up and down like a yo-yo!
Mine have all been sorted out by adding the following to the Master and MDSP Media Servers:
/usr/openv/netbackup/db/config/DPS_PROXYDEFAULTRECVTMO
or for Windows:
&lt;install path&gt;\veritas
etbackup\db\config\DPS_PROXYDEFAULTRECVTMO
and entering a value of 800 into the file
Hope this helps

roncaplinger · Answer

Great suggestion, but I already set the DPS_* touchfiles&nbsp; when these servers were built.&nbsp; Oddly, those files are not being backed up on this particular media server when it backs itself up; the SIZE_DATA_BUFFERS* and NUMBER_DATA_BUFFERS*, also in that same directory, are being backed up, though.&nbsp;
Those touchfiles *are* being backed up on the other new media servers, just not this one.&nbsp; And last night's backups did not encounter any 2074's.&nbsp; Something's odd here.

mark_solutions · Answer

Do you just have that one DPS_* touch file or the other two as well?
I have found it works best when only the one exists
Any exclusions set on the server that doesn't back them up?

roncaplinger · Answer

I have all three DPS_* touch files added.&nbsp; I'll delete the other two; but I didn't have any jobs fail with 2074s last night, so I don't know exactly what happened there or if I can confirm this has fixed the issue.
I don't see anything in the exclude list that skips just those files but backs up the buffer files.

mark_solutions · Answer

OK - I guess until you have had a clean run with the full backups running when the system is at maximum load you wont be able to tell
Let is run with just the one DPS file in place and see how it goes
Not sure why it doesn't backup those files

Forum Discussion

MSDP status 2074, but volume is not down

10 Replies

Related Content

NBU status: 2074, EMM status: Disk volume is down

Disk volume is down(2074)

Error 2074 Disk volume is down

Status 2074 - Disk Volume Down

Disk Volume is Down: Status 2074 - Can't be UP

Recent Discussions

command: bperror

MS-SharePoint policy restore error (2804) .

How to restore a backup

How to configure RBAC

10 years old netbackup appliance database service down, ssl certification out date