Master server marking media server disk pool as of...

LouiseD · ‎03-19-2015

On a daily basis our Master server is marking a media server disk pool as offline for a period of 30 minutes causing backups to fail with

Error nbjm(pid=5516) NBU status: 2106, EMM status: Storage Server is down or unavailable
Disk storage server is down(2106)

When the disk pool is marked as down the server does not seem to be under any great load and communication and networking between the master and media servers is also ok.

The following errors appear in disk pool logs several times.

bptm image copy is not ready, retry attempt: 0 of 500 object is busy, cannot be closed
nbemm The usage of one or more system resources has exceeded a warning level. Operations will or could be suspended. Please take action immediately to remedy this situation.

Any advice how to troubleshoot this issue would be appreciated.

Thanks in advance.

Master and Media server - Windows Server 2008 R2 running Netbackup 7.6

sdo · ‎03-19-2015

What do these commands show:

nbdevquery -liststs -stype PureDisk -U

nbdevquery -listdp -stype PureDisk -U

nbdevquery -listdv -stype PureDisk -U

LouiseD · ‎03-19-2015

Output of commands below.

>nbdevquery -liststs -stype PureDisk -U

Storage Server : A
Storage Server Type : PureDisk
Storage Type : Formatted Disk, Network Attached
State : UP
Flag : OpenStorage
Flag : CopyExtents
Flag : AdminUp
Flag : InternalUp
Flag : LifeCycle
Flag : CapacityMgmt
Flag : FragmentImages
Flag : Cpr
Flag : FT-Transfer
Flag : OptimizedImage

Storage Server : B
Storage Server Type : PureDisk
Storage Type : Formatted Disk, Network Attached
State : UP
Flag : OpenStorage
Flag : CopyExtents
Flag : AdminUp
Flag : InternalUp
Flag : LifeCycle
Flag : CapacityMgmt
Flag : FragmentImages
Flag : Cpr
Flag : FT-Transfer
Flag : OptimizedImage

>nbdevquery -listdp -stype PureDisk -U

Disk Pool Name : DU_PD_01
Disk Pool Id : DU_PD_01
Disk Type : PureDisk
Status : UP
Flag : Patchwork
Flag : Visible
Flag : OpenStorage
Flag : SingleStorageServer
Flag : CopyExtents
Flag : AdminUp
Flag : InternalUp
Flag : LifeCycle
Flag : CapacityMgmt
Flag : FragmentImages
Flag : Cpr
Flag : FT-Transfer
Flag : OptimizedImage
Raw Size (GB) : 62857.43
Usable Size (GB) : 62857.43
Num Volumes : 1
High Watermark : 98
Low Watermark : 80
Max IO Streams : -1
Comment :
Storage Server : A (UP)

Disk Pool Name : AB_PD_01
Disk Pool Id : AB_PD_01
Disk Type : PureDisk
Status : UP
Flag : Patchwork
Flag : Visible
Flag : OpenStorage
Flag : SingleStorageServer
Flag : CopyExtents
Flag : AdminUp
Flag : InternalUp
Flag : LifeCycle
Flag : CapacityMgmt
Flag : FragmentImages
Flag : Cpr
Flag : FT-Transfer
Flag : OptimizedImage
Raw Size (GB) : 62857.43
Usable Size (GB) : 62857.43
Num Volumes : 1
High Watermark : 98
Low Watermark : 80
Max IO Streams : -1
Comment :
Storage Server : B (UP)

>nbdevquery -listdv -stype PureDisk -U

Disk Pool Name : DU_PD_01
Disk Type : PureDisk
Disk Volume Name : PureDiskVolume
Disk Media ID : @aaaac
Total Capacity (GB) : 62857.43
Free Space (GB) : 12878.01
Use% : 79
Status : UP
Flag : ReadOnWrite
Flag : AdminUp
Flag : InternalUp
Num Read Mounts : 0
Num Write Mounts : 1
Cur Read Streams : 0
Cur Write Streams : 1
Num Repl Sources : 0
Num Repl Targets : 0

Disk Pool Name : AB_PD_01
Disk Type : PureDisk
Disk Volume Name : PureDiskVolume
Disk Media ID : @aaaaf
Total Capacity (GB) : 62857.43
Free Space (GB) : 28140.18
Use% : 55
Status : UP
Flag : ReadOnWrite
Flag : AdminUp
Flag : InternalUp
Num Read Mounts : 0
Num Write Mounts : 1
Cur Read Streams : 0
Cur Write Streams : 8
Num Repl Sources : 0
Num Repl Targets : 0

It is storage server B disk pool AB_PD_01 we are having issues with.

Thanks.

Nicolai · ‎03-19-2015

Released from the anti-span quarantine

Nicolai · ‎03-19-2015

Do you have Mcafee installed ?

This tech note explain to keep Mcafee away from scanning Netbackup processes and files

3RD PARTY: NetBackup Services are randomly shutting down on Windows servers after applying a patch for McAfee McShield 8.5 or 8.7i.

http://www.symantec.com/docs/TECH56658

LouiseD · ‎03-19-2015

No, McAfee isn't installed on any of the backup servers.

Thanks.

sdo · ‎03-19-2015

No other AV installed on the MSDP servers?

LouiseD · ‎03-19-2015

There is no AV on the MSDP servers.

sdo · ‎03-19-2015

This interests me: "nbemm The usage of one or more system resources has exceeded a warning level. Operations will or could be suspended. Please take action immediately to remedy this situation."

Have a look at EMM logs around the time of the event using:

1) If you can't rememeber which VxUL (Veritas Unified Logging) OID (Originator ID) is related to any given NetBackup component, I like to use this to firstly identify the OID that I need:

> for /f "tokens=1" %a in ('vxlogcfg -l -p 51216') do (vxlogcfg -l -p 51216 -o %a | findstr /i "oidnames")

2) Then you can identify which OIDs are related to EMM, with:

> for /f "tokens=1" %a in ('vxlogcfg -l -p 51216') do (vxlogcfg -l -p 51216 -o %a | findstr /i "oidnames" | findstr /i "emm")

...which should make the 'name' stand out in the list.

3) Now you know which OID(s) you are interested in, but first let's remind ourselves what the default logging levels are for any OID which does not have specific logging levels set:

> vxlogcfg -l -p 51216 -o Default | findstr /i "level"

4) Now let's compare the default logging level, with the logging level of the OID that were interested in, this time were interested in the logging levels of the "nbemm" sub-system/component, so that's OID number 111:

> vxlogcfg -l -p 51216 -o 111 | findstr /i "level"

...yes we could have used the name, instead of the number:

> vxlogcfg -l -p 51216 -o nbemm | findstr /i "level"

...but who ever remembers all of the names and the numbers (but I bet some on here do) - hence the listsing above to identify the OID name or the OID number.

5) Now we know which OID to use, and we've reminded ourselves what the logging levels are, we can begin to query the logs:

> vxlogview -p 51216 -o 111 -b "01/03/2015 19:00:00" -e "19/03/2015 19:05:00"

...but this will most likely generate a huge listing...

...so we need to narrow down the time frame, or narrow down the severity...

...and until you start, you never know whether it's a narrowing of time, or a narrowing of severity which will help you 'spot' the problem(s)...

...so let's use the "-L" switch, with another of these switches to look for increasingly less severe errors, i.e. I first like to look for the most severe events, and then look for decreasingly less severe, e.g. first look for 'emergency' events:

> vxlogview -p 51216 -o 111 -b "01/03/2015 19:00:00" -e "19/03/2015 19:05:00" -L -M

...then critical messages:

> vxlogview -p 51216 -o 111 -b "01/03/2015 19:00:00" -e "19/03/2015 19:05:00" -L -C

...then errors:

> vxlogview -p 51216 -o 111 -b "01/03/2015 19:00:00" -e "19/03/2015 19:05:00" -L -E

...then warnings:

> vxlogview -p 51216 -o 111 -b "01/03/2015 19:00:00" -e "19/03/2015 19:05:00" -L -W

...sadly the vxlogview command doesn't appear to have a way of saying something like 'all warnings and above' e.g. you can't do this '-L -W+' - anyway - narrowing the severity can sometimes help narrow down the time frame, so that you have a smaller set of 'all entries' log to look at - but also narrowing the severity can help you scan a wider period of time... because we don't yet know whether the root cause of your issue occurs at the time of the failure - or is caused by condition that occured at some as yet unknown time before - so sometimes searching for emergencyy, critical, and errors over a wider period of time can you help you spot the pertinent messages.

...lastly you can pipe the output through find or findstr to filter out more and more useless messages, e.g.:

> vxlogview -p 51216 -o 111 -b "01/03/2015 19:00:00" -e "19/03/2015 19:05:00" -L -E | find /i /v "generic error"

HTH.

So, if you can narrow down your logs - then I'm sure others will be greatful at having less to look through.

LouiseD · ‎03-23-2015

Thanks for the detailed response.

We have resolved the immediate problem of the disk being marked as down by increasing the DPS proxy timeout.

http://www.symantec.com/business/support/index?page=content&id=HOWTO94399

I will follow the above steps to try to find out what is actually going wrong.

Thanks.

VOX

Master server marking media server disk pool as offline