09-28-2020 09:55 AM
Our backups started failing a few days ago and I've found this in the logs.
Sep 22 12:13:31 s_local@db3 tldcd[3666]: TLD(1) key = 0x4, asc = 0x40, ascq = 0x80, DIAGNOSTIC FAILURE ON COMPONENT ASCQ (80H-FFH)
Sep 22 12:13:31 s_local@db3 tldcd[3666]: TLD(1) Move_medium error
Sep 22 12:13:35 s_local@db3 ltid[2497]: Operator/EMM server has DOWN'ed drive HP.Ultrium5-SCSI.000 (device 2)
I've found a few articles about this error referrign to the tape library and netbackup being out of sync but it looks fine from what I can tell.
Could someone direct me on how to troubleshoot this further?
Thanks
09-28-2020 10:19 AM
09-29-2020 12:30 AM - edited 09-29-2020 01:48 AM
key = 0x4, asc = 0x40, ascq = 0x80
The key, sense, additional sense is a error from the library, that it couldn't carry out what is was asked to do. Either becuase of physical restriction (e.g empty a drive already empty, move a tape to a location already full) or becuase of hardware failurs.
I recommend to obtain the vendor documentation to lookup key = 0x4, asc = 0x40, ascq = 0x80 as each vendor as uniq key code qualifiers. In other words, you cannot be sure SCSI manual from vendor A will be the same as vendor B with regards to the key code qualifiers
10-01-2020 07:05 AM
Along with the two previous recommendations, you can determine what the OS sees and compare it to what NetBackup sees for the library. From the connected media server, run:
/usr/openv/volmgr/bin/robtest
Select your problematic library
Run: s d
This will Show you the tapes in the Drives (s and d) - make note of the home slot for the media
From the NBU Admin Console, go to Device Monitor and look at the drives of the problematic robot. If the tapes in the robot match what was shown by robtest, you can check to make sure that the slot they're supposed to go back to aren't occupied.
From your media server running robtest, use: s s (this will Show you the tapes in the Slots)
Make sure that there aren't any tapes in the home slots for the tapes in the drives. If there are tapes in the slots, the robot can't return the tapes to their home slot, and the tape will be 'stuck' and the drive will be marked down. You can confirm this by looking at the NBU Admin Console > Media > sort by "Slot" > see if NetBackup matches what robtest showed.
If you see that there are tapes in the home slots for the tapes in the drive(s), you can manually move the tapes around using robtest to free up the slots so the robot can unload the tapes.
If you need to use robtest to move tapes, use: m s[slot number] s[empty slot] (This Moves the tape in Slot[1] to Slot[2].
Do this for tapes in the home slot of the stuck tape, then you can move the tape from the drive to its home slot. To do this, use: m d[affected drive] s[home slot]
This will return the tape to its appropriate home slot. Confirm that the tape has moved to out of the drive (s d from robtest) and to its home slot (s s).
Once that's done, exit robtest (be sure to fully exit, as the library will be unusable if robtest is still running).
Go to your Admin Console > Devices > Robots > Inventory Robot... > Preview Device Configuration Changes and see if NetBackup registers the moved tapes.
If the inventory moves the tapes, your library and NBU should be synced back up and the drive should return to functionality, assuming there aren't other underlying hardware issues.
10-01-2020 07:07 AM
Forgot to add that if the preview shows the tapes being moved, make sure you do an actual inventory afterwards.
10-03-2020 12:12 AM - edited 10-03-2020 12:18 AM
I agree with Nicolai, but, would suggest not all ASC/ ASCQ errors are vendor specific, only some of them.
The ones listed here, should be the same for everyone ....
https://www.t10.org/lists/asc-num.htm
(t10.org is the home of the Technical Committee that 'define' scsi standards ... ) It's the go to place to loo kup ASC/ASCQ errors - if it's not listed, then yes, it's vendor specific.
40h/NNh DZTPROMAEBKVF DIAGNOSTIC FAILURE ON COMPONENT NN (80h-FFh)
Is seems in this case that ACS 40h/ and any ASCQ value is ' diag component failure'
The fact that it can be identified in the log, also hints that this specific one is not vendor specific.
Sep 22 12:13:31 s_local@db3 tldcd[3666]: TLD(1) key = 0x4, asc = 0x40, ascq = 0x80, DIAGNOSTIC FAILURE ON COMPONENT ASCQ (80H-FFH)
Hardware vendor time I think - NetBackup isn't causing this, and there is nothing that can be done from the NetBackup side to fix it.
10-05-2020 01:53 AM
Nice post Martin :)