Forum Discussion

phoenix24x1's avatar
11 years ago

Puredisk volume down alert (via opscenter)

Hey All,

  • NBU 7.5.0.4 on all master/media server
  • 5220 Appliance (2.5.1)

I am receiving a Puredisk volume down alert (via opscenter), however no backups fail in the window that Opscenter states. I believe the alert to be bogus and I am looking on the log directories via the CLISH to verify this. However, Im not sure which specific logs to investigate. 

Any input is greatly appreciated.

 

  • It may well be that your volume is going up and down regularly

    Run an All Log Entries report from the NBU Admin Console for a 24 period (or at least one covering the alert time)

    This will tell you if it is really going up and down (and they often do but aren't noticed unless you actually run a report)

    If they are then my advice is to add the DPS timeout file to your Master Server and the appliance

    It is a flat file, located at /usr/openv/netbackup/db/config/ (or windows equivalent) and is named

    DPS_PROXYDEFAULTRECVTMO

    You may well find that it already exists - but check the value in the file - 800 i have found to work best - it oftern uses 3600 which is too high - 800 works nicely

    If you have to create or change the file then you need a NetBackup re-start to register it

    Hope this helps

  • It may well be that your volume is going up and down regularly

    Run an All Log Entries report from the NBU Admin Console for a 24 period (or at least one covering the alert time)

    This will tell you if it is really going up and down (and they often do but aren't noticed unless you actually run a report)

    If they are then my advice is to add the DPS timeout file to your Master Server and the appliance

    It is a flat file, located at /usr/openv/netbackup/db/config/ (or windows equivalent) and is named

    DPS_PROXYDEFAULTRECVTMO

    You may well find that it already exists - but check the value in the file - 800 i have found to work best - it oftern uses 3600 which is too high - 800 works nicely

    If you have to create or change the file then you need a NetBackup re-start to register it

    Hope this helps

  • If Mark's advice shows that it's only going "down" for one or two minutes it may be something I've seen with our 5220's running 2.5.3.  The Master queries each Appliance once a minute and if the Appliance is too busy doing other things, that query fails and it gets marked "down".  This especially seems to happen right at the end of the twice-a-day transaction log processing.

     

    What I had to do to decrease the false alarms was to change the Media Server Connect timeout in the properties of the Master server (and Appliance Media Servers) to 240 seconds.  That seems to let it tollerate the occational loss of response to the regular "are you still there?" queries.

     

  • The DRP_PROXY*** files do a similar ting to that and help with these regualr checks which do get delayed during very busy backup periods

    They also work with Enterprise Disk status so are handy for that too

  • I did find http://www.symantec.com/docs/TECH176451  sometime ago and verified we indeed have the files mark references with the default values.

    The connection timemout on the master server properties is above the defaults as well

     

    By chance i actually saw it live and it seemed there were many failures with 2074's, however they would re-attemnpt and complete successfully leading me to believe this was happening on only isolated jobs. I was wrong! There were a very large amount of jobs starting at the same time and the 2074's shortly after. I have been working to better load balance the environment and at this point we are not seeing the issue anymore.

    If by chance the issue resurfaces I will try changing the values in the files as mark describes.

     

     

    Thanks all!

    PS - apologies it took so long to get back, i was over filtering email
    =)