cancel
Showing results for 
Search instead for 
Did you mean: 

Drives keep going DOWN

seasysadmin
Level 1

Our backups started failing a few days ago and I've found this in the logs. 

Sep 22 12:13:31 s_local@db3 tldcd[3666]: TLD(1) key = 0x4, asc = 0x40, ascq = 0x80, DIAGNOSTIC FAILURE ON COMPONENT ASCQ (80H-FFH)
Sep 22 12:13:31 s_local@db3 tldcd[3666]: TLD(1) Move_medium error
Sep 22 12:13:35 s_local@db3 ltid[2497]: Operator/EMM server has DOWN'ed drive HP.Ultrium5-SCSI.000 (device 2)

I've found a few articles about this error referrign to the tape library and netbackup being out of sync but it looks fine from what I can tell. 

Could someone direct me on how to troubleshoot this further? 

Thanks

6 REPLIES 6

jnardello
Moderator
Moderator
   VIP    Certified
Have you attempted to move a tape into the drive manually using robtest ? Does robtest show a tape in the drive already ? While it could be something as simple as NBU having the wrong robotic definition for the drive, it could also be a calibration issue with the library, a tape stuck in the drive, someone loaded too many tapes into the library & the drive cannot unload the existing tape to make room for your new tape (no slots available), or something else hardware-related. The diagnostic error message would make me lean towards a hardware problem until proven otherwise but robtest is usually a good place to start your troubleshooting.

Nicolai
Moderator
Moderator
Partner    VIP   

 key = 0x4, asc = 0x40, ascq = 0x80

The key, sense, additional sense is a error from the library, that it couldn't carry out what is was asked to do. Either becuase of physical restriction (e.g empty a drive already empty, move a tape to a location already full) or becuase of hardware failurs.

I recommend to obtain the vendor documentation to lookup  key = 0x4, asc = 0x40, ascq = 0x80 as each vendor as uniq  key code qualifiers. In other words, you cannot be sure SCSI manual from vendor A will be the same as vendor B with regards to the key code qualifiers

EthanH
Level 4

Along with the two previous recommendations, you can determine what the OS sees and compare it to what NetBackup sees for the library. From the connected media server, run:

/usr/openv/volmgr/bin/robtest

Select your problematic library

Run: s d

This will Show you the tapes in the Drives (s and d) - make note of the home slot for the media

From the NBU Admin Console, go to Device Monitor and look at the drives of the problematic robot. If the tapes in the robot match what was shown by robtest, you can check to make sure that the slot they're supposed to go back to aren't occupied.

From your media server running robtest, use: s s (this will Show you the tapes in the Slots)

Make sure that there aren't any tapes in the home slots for the tapes in the drives. If there are tapes in the slots, the robot can't return the tapes to their home slot, and the tape will be 'stuck' and the drive will be marked down. You can confirm this by looking at the NBU Admin Console > Media > sort by "Slot" > see if NetBackup matches what robtest showed.

If you see that there are tapes in the home slots for the tapes in the drive(s), you can manually move the tapes around using robtest to free up the slots so the robot can unload the tapes.

If you need to use robtest to move tapes, use: m s[slot number] s[empty slot] (This Moves the tape in Slot[1] to Slot[2].

Do this for tapes in the home slot of the stuck tape, then you can move the tape from the drive to its home slot. To do this, use: m d[affected drive] s[home slot]

This will return the tape to its appropriate home slot. Confirm that the tape has moved to out of the drive (s d from robtest) and to its home slot (s s).

Once that's done, exit robtest (be sure to fully exit, as the library will be unusable if robtest is still running).

Go to your Admin Console > Devices > Robots > Inventory Robot... > Preview Device Configuration Changes and see if NetBackup registers the moved tapes.

If the inventory moves the tapes, your library and NBU should be synced back up and the drive should return to functionality, assuming there aren't other underlying hardware issues.

Forgot to add that if the preview shows the tapes being moved, make sure you do an actual inventory afterwards.

mph999
Level 6
Employee Accredited

I agree with Nicolai, but, would suggest not all ASC/ ASCQ errors are vendor specific, only some of them.

The ones listed here, should be the same for everyone ....

https://www.t10.org/lists/asc-num.htm

(t10.org is the home of the Technical Committee that 'define' scsi standards ... )  It's the go to place to loo kup ASC/ASCQ errors - if it's not listed, then yes, it's vendor specific.

40h/NNh DZTPROMAEBKVF DIAGNOSTIC FAILURE ON COMPONENT NN (80h-FFh)

Is seems in this case that ACS 40h/ and any ASCQ value is ' diag component failure'

The fact that it can be identified in the log, also hints that this specific one is not vendor specific.

Sep 22 12:13:31 s_local@db3 tldcd[3666]: TLD(1) key = 0x4, asc = 0x40, ascq = 0x80, DIAGNOSTIC FAILURE ON COMPONENT ASCQ (80H-FFH)

Hardware vendor time I think - NetBackup isn't causing this, and there is nothing that can be done from the NetBackup side to fix it.

 

Nicolai
Moderator
Moderator
Partner    VIP   

Nice post Martin :)