Solved: You just need to ru a full

felonious_caper · ‎02-02-2012

So two nights ago. media ID 0340L3 downed one of our drives. I wasn't positive if this was the cause so i upped the drive and let it go.

Last night, the same media ID was loaded into the same drive and downed it again. I get these errors right before it downs the drive.

02/01/2012 19:00:01 - granted resource 0340L3

02/01/2012 19:00:01 - granted resource HP.ULTRIUM3-SCSI.001

02/01/2012 19:00:01 - granted resource t2k-hcart3-robot-tld-0

02/01/2012 19:00:01 - estimated 85999065 kbytes needed

02/01/2012 19:00:02 - started process bpbrm (pid=7065)

02/01/2012 19:00:02 - connecting

02/01/2012 19:00:02 - connected; connect time: 0:00:00

02/01/2012 19:00:09 - Error bptm (pid=7084) error requesting media, TpErrno = Robot operation failed

02/01/2012 19:00:09 - Warning bptm (pid=7084) media id 0340L3 load operation reported an error

So I was going to pull the media out of the library... Only to find that the second drive in the same robot used the media just fine. The other confusing thing is that in the sl48 manager it says that both drives are ok. (see attached image)

But netback up says it is down

t2k# /usr/openv/volmgr/bin/tpconfig -d
Id DriveName           Type   Residence
      Drive Path                                                       Status
****************************************************************************
0   HP.ULTRIUM3-SCSI.000 hcart3 TLD(0) DRIVE=1
      /dev/rmt/0cbn                                                    UP
1   HP.ULTRIUM3-SCSI.001 hcart3 TLD(0) DRIVE=2
      /dev/rmt/1cbn                                                    DOWN

Currently defined robotics are:
TLD(0)     robotic path = /dev/sg/c0t4l1

EMM Server = t2k

t2k# bpstulist -label t2k-hcart3-robot-tld-0 -L

Label:                t2k-hcart3-robot-tld-0
Storage Unit Type:    Media Manager
Host Connection:      t2k
Number of Drives:     2
On Demand Only:       no
Density:              hcart3 (20)
Robot Type/Number:    TLD (8) / 0
Max Fragment Size:    1048576
Max MPX/drive:        1

Also when i select drives under the devices menu in NBU, I get a message saying cannot connect on socket(25) but it still lists the drives

What would cause a tape to down the same drive twice but work just fine in the other?

felonious_caper · ‎02-02-2012

I feel a little dumb now after further digging. Turns out this drive has been down for a really long time before I noticed it yesterday.

Using the SL48 web managment page, I noticed that drive 0 was empty and drive 1 was (n.a). Not sure what N.A was I dug a little further only to find out that tape 0343 was loaded into the drive 1 on 12/17/11... and never removed. So when NBU was last inventoried, it somehow did not account for this tape. I manually moved the tape back into the library using the SL48 Web management console and reinventoried NBU.

Hopefully this will help.

View solution in original post

Mark_Solutions · ‎02-02-2012

Most likely to be an issue with the one drive but are you sure it did not have any other errors that day

A drive gets three errors before it goes down so it could have had 2 errors from other tapes before this happened

It is worth checking for any alerts on the library itself first to see if that drive is consistently having errors.

A bit annoying at times - say for example a tape has the write protection on - it will try to load the tape 3 times before giving up and freezing the tape - but it will also down the drive as that drives has 3 errors reported against it.

Are the drives the same and do they have the same firmware? The SCSI enquiry string or the library interface should tell you, it may be that one is more up to date and so the other one need upgrading - or it may just need cleaning

felonious_caper · ‎02-02-2012

I'll try the cleaning, I just realized that another policy just put the down drive with different media. But I don't think the three strike rule applies here. It seems to put the drive down on the first error

2/2012 11:07:06 - granted resource t2k.NBU_CLIENT.MAXJOBS.m4c
02/02/2012 11:07:06 - granted resource t2k.NBU_POLICY.MAXJOBS.m4c-popscan
02/02/2012 11:07:06 - granted resource 0333L3
02/02/2012 11:07:06 - granted resource HP.ULTRIUM3-SCSI.001
02/02/2012 11:07:06 - granted resource t2k-hcart3-robot-tld-0
02/02/2012 11:07:07 - estimated 38 kbytes needed
02/02/2012 11:07:07 - started process bpbrm (pid=7818)
02/02/2012 11:07:08 - connecting
02/02/2012 11:07:08 - connected; connect time: 0:00:00
02/02/2012 11:07:12 - mounting 0333L3
02/02/2012 11:07:14 - Error bptm (pid=7819) error requesting media, TpErrno = Robot operation failed
02/02/2012 11:07:14 - Warning bptm (pid=7819) media id 0333L3 load operation reported an error

Thats the first error since I upped the drive this morning and it is down again.

Mark_Solutions · ‎02-02-2012

It may still be within its time limit of errors - I think it resets after 8 hours of no errors (dragging at the grey matter now!!)

It should raise an alert on the library to give you a clue and the bptm log will have the proper error in plain text or in hex (which I can translate for you if you post the hex error)

felonious_caper · ‎02-02-2012

I feel a little dumb now after further digging. Turns out this drive has been down for a really long time before I noticed it yesterday.

Using the SL48 web managment page, I noticed that drive 0 was empty and drive 1 was (n.a). Not sure what N.A was I dug a little further only to find out that tape 0343 was loaded into the drive 1 on 12/17/11... and never removed. So when NBU was last inventoried, it somehow did not account for this tape. I manually moved the tape back into the library using the SL48 Web management console and reinventoried NBU.

Hopefully this will help.

Kernel_Panic · ‎02-02-2012

You just need to ru a full invnetory every time you just do a tape swapping.

Use robtest every time you´ll face an issue like this to be sure that there are no stuck tapes inside the drives.

Use the cleaning tapes frequently.

Run the diagnosis tools from your library in order to check the hard status.

Check the /var/messages or event viewer and search for hardware issues even´s like 7 9 11 12 or 15.

Check the BPTM log with the verbose level at the higuest value 5 in order to detect the main problem or root cause.

felonious_caper · ‎02-02-2012

I inventory the robot using NBU and I do this anytime I remove, add, or move media. For some reason, it did not pick up on the tape in the drive.

I've been advised (on this forum) that cleaning frequently is not always good as it ruins the heads on the drive over time?

Marianne · ‎02-02-2012

I agree with the robtest advice given by kernel_panic.

When I saw your initial post with 'Robot operation failed' and 'load operation reported an error' my first thought was - try robtest.

's d' would have shown the tape in the drive.

I do not agree with regular drive cleaning.
Firmware on drives these days will generate a TapeAlert when it needs cleaning.

Handy NetBackup Links

CRZ · ‎02-02-2012

I think your reply above is probably the one that should be marked as the solution...assuming things are working OK now?

(In case you were curious, we're all so hyped up on marking solutions now because we don't want to see them again in 2 months in the almost totally useless "Can you solve these?" box :) )

Genericus · ‎02-02-2012

If you have a tape in a drive and did not know it, netbackup will freeze your scratch tapes when it tries to load them, and fails. You can go through all your scratch tapes and have backups fail with no media available errors.

Since you figured out your drive issue - you might want to check your tapes as well!

NetBackup 9.1.0.1 on Solaris 11, writing to Data Domain 9800 7.7.4.0
duplicating via SLP to LTO5 & LTO8 in SL8500 via ACSLS

VOX

Media downs drive