04-22-2015 07:19 AM
Last week we upgraded our Master and Media servers to Netbackup 7.6.0.4 from 7.5.0.7 completing on the 17th April. Everything worked fine until 19th April when the we noticed Oracle and SQL backups were failing with an error 6 and an Exchange backup which normally takes 2.5 hours was still running after 2 days. Client servers are running a variety of versions from 7.1 to 7.5.0.7. We have two media servers and we only have this problem on one media server and the problem only affects some of the servers.
Error codes we are getting are 6, 41 and 196.
Investigating the problem the we found that the problem is the backups to the MSDP storage times out or is very slow in the case of the Exchange backup. Looking in the logs I have seen the following errors:
bptm
00:00:56.003 [7068.8068] <16> 86851:bptm:7068:hpwp-med01-bkp.hcuk.pri: [ERROR] PDVFS: pdvfs_lib_log: mtstrm_is_server_started: Cr_ErrnoException: Get Process Status command failed: Cr_ErrnoException: Timed out after waiting 1200s to send command Get Process Status to mtstrmd
00:00:56.003 [7068.8068] <16> 86851:bptm:7068:hpwp-med01-bkp.hcuk.pri: [ERROR] PDVFS: PdvfsMtstrmStartDaemon: The multi-threaded streaming daemon is not ready to receive requests even after 4 attempts. Check if mtstrmd is running and look for errors in mtstrmd.log.
00:00:56.003 [7068.8068] <16> 86851:bptm:7068:hpwp-med01-bkp.hcuk.pri: [ERROR] PDSTS: start_mtstrm_session: PdvfsMtstrmStartDaemon failed (18:Improper link)
00:30:28.676 [8116.13800] <16> 84077:bptm:8116:hpwp-med01-bkp.hcuk.pri: libsts openp() 15/04/21 00:30:28: stsm_open_module C:\Program Files\Veritas\NetBackup\bin\\ost-plugins\libstspizapi.dll failed err 2060053 platerr 193: LoadLibrary
dbclient
00:45:29.702 [11258] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/11258.0.1429572612
01:00:30.755 [11258] <16> readCommFile: ERR - timed out after 900 seconds while reading from /usr/openv/netbackup/logs/user_ops/dbext/logs/11258.0.1429572612
01:00:30.756 [11258] <32> serverResponse: ERR - could not read from comm file </usr/openv/netbackup/logs/user_ops/dbext/logs/11258.0.1429572612>
01:00:30.756 [11258] <16> CreateNewImage: ERR - serverResponse() failed
01:00:30.756 [11258] <16> VxBSACreateObject: ERR - Could not create new image with file /arch_ultq4t6kk_s17085_p1_t877566612.
bpdm
00:00:17.860 [900.10140] <16> 87153:bptm:900:hpwp-med01-bkp.hcuk.pri: libsts openp() 15/04/22 00:00:17: stsm_open_module C:\Program Files\Veritas\NetBackup\bin\\ost-plugins\libstspizapi.dll failed err 2060053 platerr 193: LoadLibrary failed
These errors seem to show the backups are unable to write to the storage as the backups seem to go to the media server ok and then fail because they cannot write to the storage. Apart from the upgrade the Netbackup configuration has not changed.
My gut feeling is that the failures are not caused by the upgrade and possibly not by Netbackup as these errors seem to be a consquence rather than a cause and we had at least a days worth of successful backups beform we started to have problems but as always for a Netbackup Administrator the problem is proving it.
Is anyone able give me any ideas? I've worked on this for 2 days and I am out of them.
04-22-2015 07:37 AM
The err 2060053 looks like:
https://www-secure.symantec.com/connect/forums/nbu-7602-windows-media-server-zero-length-libstspizapidll
04-22-2015 07:55 AM
In the first line of bptm error messages we see:
PDVFS: pdvfs_lib_log: mtstrm_is_server_started
...which appears to be bptm checking to see whether the CR (Content Router) is up and started, but it timed out after 20 minutes.
Maybe check your MSDP pool on the problematic media server - it could be struggling for some reason.
Is there a CR queue processing back log?
Run the 'queueinfo' type commands from here:
https://www-secure.symantec.com/connect/forums/netbackup-7104-msdp-storaged-critical-error
04-22-2015 08:24 AM
Thanks sdo
That is what I am seeing. The output of the queueinfo commando is:
c:\Program Files\Veritas\pdde>crcontrol --queueinfo
total queue size : 3781148
creation date of oldest tlog : Wed Apr 22 12:29:24 2015
04-22-2015 08:30 AM
What does this show:
crcontrol --dsstat
...which is the quick version of the command (which I believe is effectively what it showed last time that it ran the longer version itself - or if you ran the longer version).
.
Don't run this next command yet - which is sometimes referred to as teh detailed version, but really it just means get an upto date picture:
crcontrol --dsstat 1
04-22-2015 08:34 AM
Here is the crcontrol --dsstat output:
c:\Program Files\Veritas\pdde>crcontrol --dsstat
************ Data Store statistics ************
Data storage Raw Size Used Avail Use%
58.2T 55.8T 41.2T 14.7T 74%
Number of containers : 248264
Average container size : 183904950 bytes
Space allocated for containers : 45656978603092
Reserved space : 2558972882944 b
Reserved space percentage : 4.0%
Use "--dsstat 1" to get more accurate statistics
04-22-2015 09:07 AM
We raised a support call and an old fashioned reboot seemed to have got the backups working. Will need to now investigate why the problem occurred.