cancel
Showing results for 
Search instead for 
Did you mean: 

Random drives DOWNed :(

Fred2010
Level 6
Hi all,

Wonder if anybody reading could help me with a problem.

During backups we sometimes have drives DOWNed unexpectedly (Different ones each time), causing the tape to remain in the drive.

I was wondering if there are some settings available to enable me to debug the source of this particular problem (i.e. verbose debug logging or something similar...)


Our Setup:

- StorageTek SL8500
- 7 x T10000 drives
- Symantec drivers used
- Servers/Drives connect to Brocade 4900 Switches
- Emulex LPe11002 Dualport HBA
- 3 servers (1 Master/Media, 1 Media, 1 SAN Media) all Windows 2003 R2 (x86)
- Netbackup 6.0 MP4, SSO, Vault option

To keep the drives streaming (We are in a setup phase...) I limit each server to use a maximum of 1 drive (So during backup runs, only 3 out of the 7 are actually active)

Any help will be appreciated!

Thanks!

Manfred
18 REPLIES 18

Alex_Vasquez
Level 6
Manfred,
I hope you're doing well. So you have random drives going down, eh? Here are some questions that I have:
1. How long has this been an issue? When did it start?
2. Is there anything that has changed with your library recently? Have you had to swap out drives at all?

But I do wonder why you limit the number drives your media servers can access... Are you using multiple streams for your backups? Are you using multiplexing?

Fred2010
Level 6
Hi Alex,

Doing fine here :) Thanks for your answer!

1) Not exactly sure when it started, but we've had drives downed several times (I've seen at least 3 different ones going down)
2) This particular library is new (The old one did not have these problems, but can't be used anymore (Fire in unmanned room, long story ;) ))

I use multiple streams per drive and also use multiplexing for backups. I limit the number of drives at this time to keep the drives streaming.

I only have 3 servers doing backups now, and one server can just about keep 1 drive busy. So, the limit of 3 drives is done to keep the drives busy and streaming :)

Here are some errors I see in the logs:


1/4/20076:00:33 PMapp0100Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id S00302
1/3/20076:22:13 PMapp0100Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id S00281
1/2/20079:00:51 PMapp0101Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id S00268
1/3/20076:22:24 PMapp0100Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id S00281
1/3/20076:39:05 PMapp0100Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id
1/4/20076:19:05 PMapp0100Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id
1/2/20079:00:40 PMapp0101Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id S00268
1/2/20071:46:51 PMapp0100Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id
1/2/20071:47:02 PMapp0100Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id
1/2/20079:10:45 PMapp0101Error0Media DeviceTapeAlert Code: 0x06, Type: Critical, Flag: WRITE FAILURE, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id
1/2/200711:04:25 PMapp0100Error0Media DeviceTapeAlert Code: 0x04, Type: Critical, Flag: MEDIA, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id S00267
1/2/200711:04:18 PMapp0100db20002Error7706Media Deviceioctl (MTWEOF) failed on media id S00267, drive index 1, No more data is on the tape. (1104) (bptm.c.25799)
1/2/20071:47:02 PMapp0100Error0Media Deviceerror unloading media, TpErrno = Robot operation failed
1/2/20079:00:50 PMapp0101Error0Media Deviceerror unloading media, TpErrno = Robot operation failed
1/2/20079:10:56 PMapp0101Error0Media Deviceerror unloading media, TpErrno = Robot operation failed
1/3/20076:22:24 PMapp0100Error0Media Deviceerror unloading media, TpErrno = Robot operation failed
1/3/20076:39:16 PMapp0100Error0Media Deviceerror unloading media, TpErrno = Robot operation failed
1/4/20076:00:43 PMapp0100Error0Media Deviceerror unloading media, TpErrno = Robot operation failed
1/4/20076:19:16 PMapp0100Error0Media Deviceerror unloading media, TpErrno = Robot operation failed
1/2/20079:00:36 PMapp0101dbs0062.infra.localError7488Media Deviceerror requesting media, TpErrno = Robot operation failed
1/3/20076:22:10 PMapp0100app0100Error7744Media Deviceerror requesting media, TpErrno = Robot operation failed
1/4/20076:00:30 PMapp0100app0100Error8064Media Deviceerror requesting media, TpErrno = Robot operation failed
1/2/200711:06:14 PMapp0100Warning0Media DeviceTapeAlert Code: 0x15, Type: Warning, Flag: CLEAN PERIODIC, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id S00267
1/4/20076:00:33 PMapp0100Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id S00302
1/3/20076:22:13 PMapp0100Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id S00281
1/3/20076:22:24 PMapp0100Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id S00281
1/2/200711:04:25 PMapp0100Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id S00267
1/3/20076:39:05 PMapp0100Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id
1/4/20076:19:05 PMapp0100Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.15 (index 1), Media Id
1/2/20079:00:40 PMapp0101Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id S00268
1/2/20079:00:50 PMapp0101Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id S00268
1/2/20071:46:51 PMapp0100Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id
1/2/20071:47:02 PMapp0100Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id
1/2/20079:10:45 PMapp0101Warning0Media DeviceTapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive STK.T10000A.000.0.0.1.13 (index 5), Media Id
1/4/20076:00:30 PMapp0100app0100Warning8064Media Devicemedia id S00302 load operation reported an error
1/3/20076:22:10 PMapp0100app0100Warning7744Media Devicemedia id S00281 load operation reported an error
1/2/20079:00:36 PMapp0101dbs0062.infra.localWarning7488Media Devicemedia id S00268 load operation reported an error

Nathan_Kippen
Level 6
Certified
Have you looked at your media that you are using? Perhaps FREEZE those few tapes that the robotic seems to be grabbing each time and see if that fixes your problem. Perhaps it is just some bad media.

Good Luck.

Fred2010
Level 6
Hi Nathan,

Thanks for your suggestion...

There's actually quite a lot of frozen tapes:

Too many for it being an occasional bad tape however...

zippy
Level 6
Manfred,

Make list of the frozen ones

/usr/openv/netbackup/bin/admincmd/bpmedialist -U -mlist

grep out the FROZEN ones, take the tape number and run

/usr/openv/netbackup/bin/admincmd/bpmedia -ev $FROZEN_TAPE -unfreeze

keep the list handy, let your backups run for a month or two, re-run the bpmedia list again, now compare the new list to the old list of FROZEN tapes, remove the tapes that have FROZEN again and stick em in the closet!

JD

Fred2010
Level 6
Hi James,

Thanks: Good advise on the frozen tapes :)

You have any idea on how to set Netbackup to verbosely tell me why it has downed the drives though?

I'm looking for some good reasoning why they go down, so I can give those logs to StorageTek (So they can debug the root cause...)

Ta

Alex_Vasquez
Level 6
Hmmm. Well, did you need to recovery the catalog, Manfred? Is the server new as well? I don't see any indication of it, but perhaps there could be a hardware mismatch on the new server? If so deleting and readding your drives would do the trick. Did you get this info from the BPTM log? You could also try the manufacturer drives... Veritas drivers should work, but depending on whom you ask it varies as to whether you use manufacturer drivers or not.

zippy
Level 6
Manfred,

Tape drive go down because the tape drive is going bad or is bad or in rare cases the tape in the drive is bad, I say rare because I see this type of occurance once a year, the tape gets FROZEN, when you have a rash of FROZEN tapes this then its 99.9% time caused by a bad tape drive.

Fred2010
Level 6
Hi James,

Thanks for the script; Unfortunately the servers are Windoze :(

Is some verbose logging possible though to see what's happening when Netbackup DOWNs a drive?

DavidParker
Level 6
Manfred,
Have you set all your logging levels to 5 and restarted your services?
That should give you a lot of info.
I would check the bptm logs first.

Fred2010
Level 6
Hi Alex,

Nope, no tape recovery has been done

The entire robot, tapes, drives and servers are new and shiney :)

I've re-added the hardware config several times to no avail unfortunately, but your suggestion about drivers was also done by Symantec Support:

They say use the StorageTek/SUN driver if it is available. If not available, use the Veritas driver...

Where can I find the BPTM log???

All I need is more info (Like verbose logging) why Netbackup DOWNed the drive...

Dennis_Strom
Level 6
Take a look at your bptm an bpbkar logs. as a cautionary note bpbkar can get quite large very quickly. bptm should tell you what you want to know. I had this quite a bit when I took over my current environment and it was do to old dirty drives. I was one support and slowly swapped out most of my drives and now I do not see it to much any more.

Another thing to check is drive firmware level of your drives, robot and HBA's.

Fred2010
Level 6
Hi David,

How would I enable the BPTM logging on Windows?

Sounds like what I need!

DavidParker
Level 6
It should probably be on already.
Take a look in C:\Program Files\VERITAS\netbackup\logs
If there is a 'bptm' folder, then you already have it turned on.
If not, just create that folder and the logging is then on.

Here's a nice article to read through in regards to logging and what each one does for you:

http://seer.entsupport.symantec.com/docs/243778.htm
DOCUMENTATION: A comprehensive list of VERITAS NetBackup (tm) 3.4, 4.5, 5.x directories, touch files, and commands relating to logging

Oops, just saw that you're on 6.0
Try this link instead:
http://seer.support.veritas.com/docs/278572.htm
DOCUMENTATION: A comprehensive list of NetBackup (tm) 6.0 directories and commands relating to Unified Logging.

Fred2010
Level 6
Hi David,

Nope it wasn't on, but your answer was what I was searching for!

I've created the folder and hope tonight something goes wrong again (Gosh, never thought I'd say that!)

Thanks to all you guys for trying to help!

(Wish I could give all some kudo's!)

DavidParker
Level 6
No problem!

Don't forget to update us; I'm sure we'd all like to hear how things turn out. Plus it could help someone in the future who comes searching for answers!

:)

Nathan_Kippen
Level 6
Certified
Rember not to leave your verbose at a high setting... logs can grow pretty fast and use up a lot of space on your master.

Fred2010
Level 6
Thanks for the reminder :)

Quick update on this issue:

Apparently there might be a firmware issue with the T10000 drives, which could possibly cause this error.

New firmware is being worked on, but not yet general available...

At the moment we have the BPTM logging going on all media servers, trying to capture the error (The error is elusive since we started logging... Where are they when you need them?!?)