02-02-2012 08:01 AM
So two nights ago. media ID 0340L3 downed one of our drives. I wasn't positive if this was the cause so i upped the drive and let it go.
Last night, the same media ID was loaded into the same drive and downed it again. I get these errors right before it downs the drive.
02/01/2012 19:00:01 - granted resource 0340L3
02/01/2012 19:00:01 - granted resource HP.ULTRIUM3-SCSI.001
02/01/2012 19:00:01 - granted resource t2k-hcart3-robot-tld-0
02/01/2012 19:00:01 - estimated 85999065 kbytes needed
02/01/2012 19:00:02 - started process bpbrm (pid=7065)
02/01/2012 19:00:02 - connecting
02/01/2012 19:00:02 - connected; connect time: 0:00:00
02/01/2012 19:00:09 - Error bptm (pid=7084) error requesting media, TpErrno = Robot operation failed
02/01/2012 19:00:09 - Warning bptm (pid=7084) media id 0340L3 load operation reported an error
So I was going to pull the media out of the library... Only to find that the second drive in the same robot used the media just fine. The other confusing thing is that in the sl48 manager it says that both drives are ok. (see attached image)
But netback up says it is down
t2k# /usr/openv/volmgr/bin/tpconfig -d
Id DriveName Type Residence
Drive Path Status
****************************************************************************
0 HP.ULTRIUM3-SCSI.000 hcart3 TLD(0) DRIVE=1
/dev/rmt/0cbn UP
1 HP.ULTRIUM3-SCSI.001 hcart3 TLD(0) DRIVE=2
/dev/rmt/1cbn DOWN
Currently defined robotics are:
TLD(0) robotic path = /dev/sg/c0t4l1
EMM Server = t2k
t2k# bpstulist -label t2k-hcart3-robot-tld-0 -L
Label: t2k-hcart3-robot-tld-0
Storage Unit Type: Media Manager
Host Connection: t2k
Number of Drives: 2
On Demand Only: no
Density: hcart3 (20)
Robot Type/Number: TLD (8) / 0
Max Fragment Size: 1048576
Max MPX/drive: 1
Also when i select drives under the devices menu in NBU, I get a message saying cannot connect on socket(25) but it still lists the drives
What would cause a tape to down the same drive twice but work just fine in the other?
Solved! Go to Solution.
02-02-2012 09:36 AM
I feel a little dumb now after further digging. Turns out this drive has been down for a really long time before I noticed it yesterday.
Using the SL48 web managment page, I noticed that drive 0 was empty and drive 1 was (n.a). Not sure what N.A was I dug a little further only to find out that tape 0343 was loaded into the drive 1 on 12/17/11... and never removed. So when NBU was last inventoried, it somehow did not account for this tape. I manually moved the tape back into the library using the SL48 Web management console and reinventoried NBU.
Hopefully this will help.
02-02-2012 08:11 AM
Most likely to be an issue with the one drive but are you sure it did not have any other errors that day
A drive gets three errors before it goes down so it could have had 2 errors from other tapes before this happened
It is worth checking for any alerts on the library itself first to see if that drive is consistently having errors.
A bit annoying at times - say for example a tape has the write protection on - it will try to load the tape 3 times before giving up and freezing the tape - but it will also down the drive as that drives has 3 errors reported against it.
Are the drives the same and do they have the same firmware? The SCSI enquiry string or the library interface should tell you, it may be that one is more up to date and so the other one need upgrading - or it may just need cleaning
02-02-2012 08:40 AM
I'll try the cleaning, I just realized that another policy just put the down drive with different media. But I don't think the three strike rule applies here. It seems to put the drive down on the first error
2/2012 11:07:06 - granted resource t2k.NBU_CLIENT.MAXJOBS.m4c
02/02/2012 11:07:06 - granted resource t2k.NBU_POLICY.MAXJOBS.m4c-popscan
02/02/2012 11:07:06 - granted resource 0333L3
02/02/2012 11:07:06 - granted resource HP.ULTRIUM3-SCSI.001
02/02/2012 11:07:06 - granted resource t2k-hcart3-robot-tld-0
02/02/2012 11:07:07 - estimated 38 kbytes needed
02/02/2012 11:07:07 - started process bpbrm (pid=7818)
02/02/2012 11:07:08 - connecting
02/02/2012 11:07:08 - connected; connect time: 0:00:00
02/02/2012 11:07:12 - mounting 0333L3
02/02/2012 11:07:14 - Error bptm (pid=7819) error requesting media, TpErrno = Robot operation failed
02/02/2012 11:07:14 - Warning bptm (pid=7819) media id 0333L3 load operation reported an error
Thats the first error since I upped the drive this morning and it is down again.
02-02-2012 08:48 AM
It may still be within its time limit of errors - I think it resets after 8 hours of no errors (dragging at the grey matter now!!)
It should raise an alert on the library to give you a clue and the bptm log will have the proper error in plain text or in hex (which I can translate for you if you post the hex error)
02-02-2012 09:36 AM
I feel a little dumb now after further digging. Turns out this drive has been down for a really long time before I noticed it yesterday.
Using the SL48 web managment page, I noticed that drive 0 was empty and drive 1 was (n.a). Not sure what N.A was I dug a little further only to find out that tape 0343 was loaded into the drive 1 on 12/17/11... and never removed. So when NBU was last inventoried, it somehow did not account for this tape. I manually moved the tape back into the library using the SL48 Web management console and reinventoried NBU.
Hopefully this will help.
02-02-2012 10:58 AM
You just need to ru a full invnetory every time you just do a tape swapping.
Use robtest every time you´ll face an issue like this to be sure that there are no stuck tapes inside the drives.
Use the cleaning tapes frequently.
Run the diagnosis tools from your library in order to check the hard status.
Check the /var/messages or event viewer and search for hardware issues even´s like 7 9 11 12 or 15.
Check the BPTM log with the verbose level at the higuest value 5 in order to detect the main problem or root cause.
02-02-2012 11:08 AM
I inventory the robot using NBU and I do this anytime I remove, add, or move media. For some reason, it did not pick up on the tape in the drive.
I've been advised (on this forum) that cleaning frequently is not always good as it ruins the heads on the drive over time?
02-02-2012 11:41 AM
I agree with the robtest advice given by kernel_panic.
When I saw your initial post with 'Robot operation failed' and 'load operation reported an error' my first thought was - try robtest.
's d' would have shown the tape in the drive.
I do not agree with regular drive cleaning.
Firmware on drives these days will generate a TapeAlert when it needs cleaning.
02-02-2012 12:44 PM
I think your reply above is probably the one that should be marked as the solution...assuming things are working OK now?
(In case you were curious, we're all so hyped up on marking solutions now because we don't want to see them again in 2 months in the almost totally useless "Can you solve these?" box :) )
02-02-2012 12:44 PM
If you have a tape in a drive and did not know it, netbackup will freeze your scratch tapes when it tries to load them, and fails. You can go through all your scratch tapes and have backups fail with no media available errors.
Since you figured out your drive issue - you might want to check your tapes as well!