Forum Discussion

Katiebee's avatar
Katiebee
Level 4
12 years ago

Vault failing Netbackup 7.1

Hi,

I use a vault to eject tapes at the end of the week, it failed last week and I ejected the tapes manually, I have restarted everything including the tape library, but the vault has failed this week as well. I am logging the issue with the tape library supplier as well asI'm not sure where the fault is, but is anyone able to give me any help?

The details from the jobs look like this:

26/07/2013 13:17:10 - vault waiting for global lock
26/07/2013 13:17:10 - requesting resource appnbu-v.NBVAULT.MAXJOBS
26/07/2013 13:17:22 - granted resource appnbu-v.NBVAULT.MAXJOBS
26/07/2013 13:17:24 - vault global lock acquired
26/07/2013 13:17:24 - vault waiting for session ID lock
26/07/2013 13:17:24 - requesting resource appnbu-v.VAULT_CREATE_SESSION_ID.LOCK_TLD(0)_ONSITETAPES
26/07/2013 13:17:28 - granted resource appnbu-v.VAULT_CREATE_SESSION_ID.LOCK_TLD(0)_ONSITETAPES
26/07/2013 13:17:29 - vault session ID lock acquired
26/07/2013 13:17:29 - vault session ID lock released
26/07/2013 13:32:53 - duplication skipped
26/07/2013 13:32:53 - vault waiting for assign slot lock
26/07/2013 13:32:53 - requesting resource appnbu-v.VAULT_ASSIGN_SLOT.LOCK_TLD(0)_OnsiteTapes
26/07/2013 13:32:58 - granted resource appnbu-v.VAULT_ASSIGN_SLOT.LOCK_TLD(0)_OnsiteTapes
26/07/2013 13:32:58 - vault assign slot lock acquired
26/07/2013 13:33:00 - vault assign slot lock released
26/07/2013 13:33:00 - catalog backup skipped
26/07/2013 13:33:00 - vault waiting for assign slot lock
26/07/2013 13:33:00 - requesting resource appnbu-v.VAULT_ASSIGN_SLOT.LOCK_TLD(0)_OnsiteTapes
26/07/2013 13:33:10 - granted resource appnbu-v.VAULT_ASSIGN_SLOT.LOCK_TLD(0)_OnsiteTapes
26/07/2013 13:33:11 - vault assign slot lock acquired
26/07/2013 13:33:11 - vault assign slot lock released
26/07/2013 13:33:12 - before eject, waiting for media to be unmounted; sleeping for 180 seconds
26/07/2013 13:36:12 - starting eject operation
26/07/2013 13:36:12 - begin Eject/Report
26/07/2013 13:36:12 - connecting
26/07/2013 13:36:12 - connected; connect time: 00:00:00
26/07/2013 13:36:12 - vault waiting for eject lock
26/07/2013 13:36:12 - requesting resource appnbu-v.VAULT_EJECT.LOCK_0
26/07/2013 13:36:17 - granted resource appnbu-v.VAULT_EJECT.LOCK_0
26/07/2013 13:36:18 - vault eject lock acquired
26/07/2013 13:36:18 - suspend media for this session: failed to suspend 1 of 52 media at eject time
26/07/2013 13:36:18 - starting eject of 52 media
26/07/2013 13:40:34 - eject complete with status 287.  0 of 52 media ejected
26/07/2013 13:40:34 - vault eject lock released
26/07/2013 13:58:35 - vault global lock released
vault eject failed(287)

And in the problem reports I see this:

 

26/07/2013 13:32:37 MasterCluster   Error 0 Retrieve <16> vltrun@main Vault Session FAILED [PRFL=0/OnsiteTapes/WeeklyEject SID=331 JID=2518878 EC=287]
26/07/2013 13:32:37 MasterCluster   Error 0 Retrieve <16> vltrun@main FAILed NB_EC=287 NB_MSG=vault eject failed
26/07/2013 13:36:18 MasterCluster   Error 0 Retrieve <16> vltrun@SuspendMedia^332: Suspend FAILed MEDIA=000726, HOST=appnbu-v. EMM Err=2001049 ( Media is already allocated )
26/07/2013 13:36:18 MasterCluster   Error 0 Retrieve <16> vltrun@SuspendMedia^332 FAILed MM_EC=199 MM_MSG=the media is allocated for use
26/07/2013 13:36:18 MasterCluster   Error 0 Retrieve <16> vltrun@SuspendMedia^332: Leaving with DMN=2 SC=199
26/07/2013 13:36:18 MasterCluster   Error 0 Retrieve <16> vltrun@VaultRobot::suspendMediaToEject()^332: suspend failed for media 000726, ignoring...
26/07/2013 13:36:18 MasterCluster   Error 0 Retrieve <16> vltrun@VaultRobot::suspendMediaToEject()^332 FAILed NB_EC=97 NB_MSG=requested media id is in use, cannot process request
26/07/2013 13:36:18 MasterCluster   Error 0 Retrieve <16> vltrun@VaultRobot::suspendMediaToEject()^332: Leaving with DMN=1 SC=97
26/07/2013 13:36:18 MasterCluster   Error 0 Retrieve <16> vltrun@VaultScsiRobot::doEject()^332: there was a problem in doing suspending media to eject returned 97
26/07/2013 13:36:18 MasterCluster   Error 0 Retrieve <16> vltrun@VaultScsiRobot::doEject()^332 FAILed NB_EC=335 NB_MSG=failure occurred while suspending media for eject
26/07/2013 13:36:32 MasterCluster   Error 0 Retrieve <16> vltrun@VaultScsiRobot::doStartEject()^332: Executing command: "C:\Program Files\Veritas\Volmgr\bin\vmchange.exe" -res -multi_eject -sc -rn 0 -rt tld -rh appnbu-v -vh appnbu-v -v --- -ml 000245:000256:000273:000276
26/07/2013 13:40:20 MasterCluster   Error 0 Retrieve <16> vltrun@VaultScsiRobot::doStartEject()^332: Eject Failed
26/07/2013 13:40:34 MasterCluster   Error 0 Retrieve <16> vltrun@VaultScsiRobot::doEject()^332: There are problems in doing eject, returned 287
26/07/2013 13:40:34 MasterCluster   Error 0 Retrieve <16> vltrun@VaultScsiRobot::doEject()^332: Leaving with DMN=1 SC=335
26/07/2013 13:40:34 MasterCluster   Error 0 Retrieve <16> vltrun@VltSession::lock_and_operate^332 OP_STEP=eject_media FAILED
26/07/2013 13:40:34 MasterCluster   Error 0 Retrieve <16> vltrun@VltSession::lock_and_operate^332 FAILed NB_EC=287 NB_MSG=vault eject failed
26/07/2013 13:40:34 MasterCluster   Error 0 Retrieve <16> vltrun@VltSession::lock_and_operate^332: Leaving with DMN=1 SC=287

Any help would be much appriciated

Thanks

Katie

  • One of the messages I saw in the full detail.log that you sent me suggested that the issue 'could' be with the robot inventory being out-of-date.

    I say could, as with NBU a given error may have multiple causes, however it is sensible to try the 'most obvious' first.

    From waat I saw, vault did try to eject a number of tapes using a standard NBU command, so for at least some of the tapes it would appear the issue is not vault itself.

    Vault really only runs NBU commands (to do the dups, catalog backup and ejects, and so providing it gets as far as these commands, it pretty much clears vault as being the cause.

    Martin

  • I managed to eject a tape that had failed to vault manually from the library itself, so I tried a vault again (just the eject, skipped the dups and catalog) and this worked. Loading tapes then failed part way through, but looking at the robot logs which I am now collecting, it does seem to be a problem with the robot itself. Managed to load tapes for overnight duplications, came in this morning to find them all frozen...

    To start the logs Martin (mph999) advised that I set the following:

    On the robot control host.

    Create dir.

    <install path>\veritas\volmgr\debug\robots

    Create empty file

    <install path>\veritas\volmgr\ROBOT_DEBUG

    Add VERBOSE to

    <install path>\veritas\volmgr\vm.conf

    (Restart services on robot control host)

    I'm going to harass the library support a bit more as it really looks to me that the robot is not communicating correctly.

  • With a lot of help from mph999 and Marianne it has been decided that the fault or error is definitely coming from the library itself.

    This is demonstrated in the robot log:

    17:22:45.372 [2700.1124] <5> tldcd:move_medium: TLD(0) initiating MOVE_MEDIUM from addr 16 to addr 4139

    17:22:48.043 [2700.1124] <3> tldcd:log_scsi_error: TLD(0) key = 0x5, asc = 0x53, ascq = 0x81, ATTEMPTED TO EXTEND EMPTY ENTRY/EXIT PORT

    17:22:48.043 [2700.1124] <3> tldcd:move_medium: TLD(0) Move_medium error

    17:22:48.043 [2700.1124] <5> tldcd:tld_main: TLD(0) closing/unlocking robotic path

    17:22:48.043 [2700.1124] <3> tldcd:ChildExit: TLD Child process 2700 has terminated abnormally: Cannot move from media access port to slot

    17:22:48.965 [2392.7552] <6> tldcd:check_unit_attention: command_init to check for UA on robot 0

     

    17:36:19.492 [6580.3276] <5> tldcd:move_medium: TLD(0) initiating MOVE_MEDIUM from addr 19 to addr 4176

    17:36:21.805 [6580.3276] <3> tldcd:log_scsi_error: TLD(0) key = 0x5, asc = 0x53, ascq = 0x81, ATTEMPTED TO EXTEND EMPTY ENTRY/EXIT PORT

    17:36:21.805 [6580.3276] <3> tldcd:move_medium: TLD(0) Move_medium error

    17:36:21.805 [6580.3276] <5> tldcd:tld_main: TLD(0) closing/unlocking robotic path

    17:36:21.820 [6580.3276] <3> tldcd:ChildExit: TLD Child process 6580 has terminated abnormally: Cannot move from media access port to slot

     

    17:38:34.953 [7356.3332] <5> tldcd:move_medium: TLD(0) initiating MOVE_MEDIUM from addr 20 to addr 4177

    17:38:37.250 [7356.3332] <3> tldcd:log_scsi_error: TLD(0) key = 0x5, asc = 0x53, ascq = 0x81, ATTEMPTED TO EXTEND EMPTY ENTRY/EXIT PORT

    17:38:37.250 [7356.3332] <3> tldcd:move_medium: TLD(0) Move_medium error

    17:38:37.250 [7356.3332] <5> tldcd:tld_main: TLD(0) closing/unlocking robotic path

    17:38:37.250 [7356.3332] <3> tldcd:ChildExit: TLD Child process 7356 has terminated abnormally: Cannot move from media access port to slot

    With the library error message specifically being

    key = 0x5, asc = 0x53, ascq = 0x81, ATTEMPTED TO EXTEND EMPTY ENTRY/EXIT PORT

    Martin offered the following as well,

     To get the exact meaning, you will probably have to ask the library vendor, as it is their hardware that is producing the error.

    I'll have another look, but I didn't find an exact match in out TNs.

    However, I can confirm ASC/ ASCQ errors are sent from the hardware, impossible for NBU to generate this error.

    ....

    Key Code Qualifier is an error-code returned by a SCSI device.

    When a SCSI target device returns a check condition in response to a command, the initiator usually then issues a SCSI Request Sense command. This process is part of a SCSI protocol called Contingent Allegiance Condition. The target will respond to the Request Sense command with a set of SCSI sense data which includes three fields giving increasing levels of detail about the error

    The vendor Quantum has not really offered an explanation yet but is sending me a bucket load of parts and an engineer to investigate further.

    I'm marking this as solved as we have concluded that the fault is not with Netbackup

    Thank you everyone for you help

    Katie