Forum Discussion

Alberto_Colombo's avatar
13 years ago

Random Problem with N5020 (backup/restore)

 

Problem
NetBackup deduplication disk pool/disk volume intermittently goes to a DOWN state. 
Backups/restore/duplications can fail with status 213 no storage units is available and 2074.
 
Here is our setup:
 
One NetBackup Master Server 7.1.0.3 with RHEL 5.6
One NetBackup Media Server 7.1.0.3 with win 2008
one N5020 with 1.4.1.1
 
bpstulist -L
 
Label:                DSU-5020
Storage Unit Type:    Disk
Media Subtype:        DiskPool (6)
Host Connection:      srvprdminbms01.siram.net
Concurrent Jobs:      99
On Demand Only:       yes
Robot Type:           (not robotic)
Max Fragment Size:    51200
Max MPX:              1
Block Sharing:        yes
File System Export:   yes
Ok On Root:           no
Disk Pool:            N5020_Dedupe_Pool
 
From time to time we experience inexplicable down on our backup envirorment: backup/restore using N5020 hang up,
and finally we have to restart Netbackup Services on Netbackup Master.
 
After that, anything looks like ok, and backup/restore can go on as usual.
 
Can someone give us some hint about this random but quite dangerous problem?
 
thank you in advance,
Alberto

4 Replies

  • I think I saw something similar back in the early days of the 5000 appliances but thought the issue died out in v1.3.  If you look in the "Disk Logs" report and see messages about the disk pool/volume going down and then coming back up 2-5 mins (or so) later, tell your support guy to look up an internal doc of doc id 316680 (or have him contact me).  

    Long story short, IF this is the problem, NBU is pestering the appliance for an "are you alive" message every 60 seconds and the appliance may be too busy to respond that frequently so we can adjust the timeout with the presence of the following touch file:

    Configuration file: <InstallPath>\VERITAS\NetBackup\db\config\DPS_PROXYDEFAULTRECVTMO 

    You'd want to put a value in the touch file that is larger than the time you're seeing the disk pool as offline in your disk logs.  I don't think I've ever seen one go longer than 5 mins, so if you put "600" (for 600 secs) in your touch file you should be more than good.  One thing about that is that NBU may not see the storage pool as down for up to 10 mins after it's actually offline, but while your backups will stop writing, it won't cause a problem for any kind of data consistency or anything.  NBU just won't see the disk as down for 10 minutes instead of 1 minute, but your false positives on the pool being down should also go to 0.

    Hope this helps!

  •  

    Hi Chad,
    yesterday Symantec Support wrote me:
     
    Please see this technote:  http://www.symantec.com/docs/TECH176451  for a solution involving creating text files in order to increase the timeout.
     
    From the technote: (Note! These go on the media server)
     Increased DPS proxy timeouts to 3600 seconds (max):
        a. This one is just an empty file:
    C:\Program Files\Veritas\NetBackup\db\config\DPS_PROXYNOEXPIRE
        b. Create this file with the value of 1800 inside:
    C:\Program Files\Veritas\NetBackup\db\config\DPS_PROXYDEFAULTSENDTMO
        c. Create this file with the value of 1800 inside:
    C:\Program Files\Veritas\NetBackup\db\config\DPS_PROXYDEFAULTRECVTMO
        d. Restart nbrmms
     
     
    I've checked "Disk logs" (my fault not have done it before!) and i can see the exactly the warnings you have described: PureDisk Volume is marked down and after n secs/mins is marked up.
    it happened quite a few times in the last 3 days...
     
    Last night we have experienced no stop in our backup environment, but it's also true that i've also modified in a sensible way the environment itself (as you wrote me in a different post, i'm now pulling optimized duplicated data using a different media server), so i cannot be sure that just having implemented these new parametres solved the issue.
     
    i'll wait for one more day just to be sure that this problem won't re-appear, and then i'll mark it as solved.
    thank you very much,
    Alberto 
  • I have had similar issues on a customers site with a 5200 unit (so slightly different)

    It was resolved by three things - it has not had an issue for months since doing this:

    1. The DPS_PROXYDEFAULTRECVTMO with a value of 800 but the other two removed - this needs a full service re-start to take effect (I usually reboot)

    2. The SIZE and NUMBER DATA_BUFFERS files removed

    3. The keep alive setting changed:

    # echo 510 > /proc/sys/net/ipv4/tcp_keepalive_time

    # echo 3 > /proc/sys/net/ipv4/tcp_keepalive_intvl

    # echo 3 > /proc/sys/net/ipv4/tcp_keepalive_probes

    These need to be kept persisitent though -

    The changes would be rendered persistent with an addition such as the following to /etc/sysctl.conf

     ## Keepalive at 8.5 minutes

     # start probing for heartbeat after 8.5 idle minutes (default 7200 sec)

    net.ipv4.tcp_keepalive_time=510

    # close connection after 4 unanswered probes (default 9)

    net.ipv4.tcp_keepalive_probes=3 

    # wait 45 seconds for reponse to each probe (default 75

    net.ipv4.tcp_keepalive_intvl=3

     They Don’t need a restart to take effect and then run :

    chkconfig boot.sysctl on

    Hope this helps

  • hi Mark, thank you for your reply, after applying the parametres with values specified by Symantec Support i did experienced NO MORE stop (even if we still have some quite painfull performance problem)  so i still didn't apply your values.

    after our env will be up and running with no more performance problems, i think i'll try to apply them.

     

    regards,

    Alberto