Solved: Puredisk volume down alert (via opscenter)

phoenix24x1 · ‎12-06-2013

Hey All,

NBU 7.5.0.4 on all master/media server
5220 Appliance (2.5.1)

I am receiving a Puredisk volume down alert (via opscenter), however no backups fail in the window that Opscenter states. I believe the alert to be bogus and I am looking on the log directories via the CLISH to verify this. However, Im not sure which specific logs to investigate.

Any input is greatly appreciated.

Mark_Solutions · ‎12-09-2013

It may well be that your volume is going up and down regularly

Run an All Log Entries report from the NBU Admin Console for a 24 period (or at least one covering the alert time)

This will tell you if it is really going up and down (and they often do but aren't noticed unless you actually run a report)

If they are then my advice is to add the DPS timeout file to your Master Server and the appliance

It is a flat file, located at /usr/openv/netbackup/db/config/ (or windows equivalent) and is named

DPS_PROXYDEFAULTRECVTMO

You may well find that it already exists - but check the value in the file - 800 i have found to work best - it oftern uses 3600 which is too high - 800 works nicely

If you have to create or change the file then you need a NetBackup re-start to register it

Hope this helps

View solution in original post

Mark_Solutions · ‎12-09-2013

It may well be that your volume is going up and down regularly

Run an All Log Entries report from the NBU Admin Console for a 24 period (or at least one covering the alert time)

This will tell you if it is really going up and down (and they often do but aren't noticed unless you actually run a report)

If they are then my advice is to add the DPS timeout file to your Master Server and the appliance

It is a flat file, located at /usr/openv/netbackup/db/config/ (or windows equivalent) and is named

DPS_PROXYDEFAULTRECVTMO

You may well find that it already exists - but check the value in the file - 800 i have found to work best - it oftern uses 3600 which is too high - 800 works nicely

If you have to create or change the file then you need a NetBackup re-start to register it

Hope this helps

D_Flood · ‎12-10-2013

If Mark's advice shows that it's only going "down" for one or two minutes it may be something I've seen with our 5220's running 2.5.3. The Master queries each Appliance once a minute and if the Appliance is too busy doing other things, that query fails and it gets marked "down". This especially seems to happen right at the end of the twice-a-day transaction log processing.

What I had to do to decrease the false alarms was to change the Media Server Connect timeout in the properties of the Master server (and Appliance Media Servers) to 240 seconds. That seems to let it tollerate the occational loss of response to the regular "are you still there?" queries.

Mark_Solutions · ‎12-11-2013

The DRP_PROXY*** files do a similar ting to that and help with these regualr checks which do get delayed during very busy backup periods

They also work with Enterprise Disk status so are handy for that too

phoenix24x1 · ‎12-23-2013

I did find http://www.symantec.com/docs/TECH176451 sometime ago and verified we indeed have the files mark references with the default values.

The connection timemout on the master server properties is above the defaults as well

By chance i actually saw it live and it seemed there were many failures with 2074's, however they would re-attemnpt and complete successfully leading me to believe this was happening on only isolated jobs. I was wrong! There were a very large amount of jobs starting at the same time and the 2074's shortly after. I have been working to better load balance the environment and at this point we are not seeing the issue anymore.

If by chance the issue resurfaces I will try changing the values in the files as mark describes.

Thanks all!

PS - apologies it took so long to get back, i was over filtering email
=)

VOX

Puredisk volume down alert (via opscenter)