All drives randomly DOWN

thiagoabreu · ‎02-11-2014

Hello, we had an enviroment with 15 media servers and about a month ago, all drives randomly go down in all media servers.

Perhaps, it throw a hardware flag but I think it is a false positive or the fact he is completly down (from Master to Media s.) he throw the HW fault flag.
Also, with heavy load like Vault or Staging after some time all drives will be down and without any use of drives he stay UP.

Unfortunatelly our NBU is 6.5.4.
Master server is Solaris 10 SPARC.
Media Servers are Win 2003/2008.
Drives are IBM ULTRIUM-TD3 93G0
Actually we didn't have any MISSING_PATH.

In ltid debug I only get the default "EMM/Operator down the drive" and lot of this:

10:06:33.001 [19492] <2> TAO: TAO (19492|1) - Transport_Cache_Manager::is_entry_idle_i, state is [0]

10:06:33.001 [19492] <2> TAO: TAO (19492|1) - Synch_Invocation::invoke_i, timeout on recv is <299999>

10:06:33.021 [19492] <2> TAO: TAO (19492|1) Synch_Invocation::invoke_i, timeout after recv is <299961> status <1>

10:11:33.976 [19492] <2> TAO: TAO (19492|1) Timeout is <300000>

10:11:33.976 [19492] <2> TAO: TAO (19492|1) Timeout is <0>

A lot of this exactly Hardware Flag:

TapeAlert Code: 0x1f, Type: Critical, Flag: HARDWARE B, from drive TLD.hcart3.3 (index 4), Media Id

Thank you.

revarooo · ‎02-11-2014

tape alerts don't come fron NetBackup - they come direct from the drive/driver

Do you just up the drive and it continues working?

RamNagalla · ‎02-11-2014

0x1F: 'Hardware B: Tape drive has a problem not read/write related',

so what is the job failure status when it is making the tapes down?

did you try tape clean?

show us the file /usr/openv/netbackup/db/media/error

thiagoabreu · ‎02-11-2014

After I UP the drive he does the work without problem.
Jobs didn't get any error, only 196 or 830 (Backup window closed and Drives are not available).

Our "/usr/openv/netbackup/db/media/errors" starts in 2007.

12/30/13 17:15:45 7 TAPE_ALERT TLD.hcart3.10 0x00000002 0x00000000

12/30/13 17:19:37 2 TAPE_ALERT TLD.hcart3.2 0x00000002 0x00000000

12/31/13 00:31:22 CE0731 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000

01/06/14 23:59:26 CE0556 6 TAPE_ALERT TLD.hcart3.4 0x00000002 0x00000000

01/12/14 06:44:35 CE0371 1 WRITE_ERROR TLD.hcart3.6

01/12/14 06:44:44 CE0371 1 TAPE_ALERT TLD.hcart3.6 0x24001000 0x02000000

01/17/14 19:13:15 CH2676 4 TAPE_ALERT TLD.hcart3.3 0x00000000 0x02000000

01/18/14 21:33:07 CI3013 1 WRITE_ERROR TLD.hcart3.6

01/18/14 21:33:16 CI3013 1 TAPE_ALERT TLD.hcart3.6 0x24001000 0x02000000

01/18/14 21:34:01 CI3013 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000

01/21/14 21:13:55 CH2380 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000

01/24/14 20:37:26 CH2244 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x02000000

01/25/14 07:01:04 CH2680 1 TAPE_ALERT TLD.hcart3.6 0x00000000 0x02000000

01/26/14 02:16:42 CH2380 8 TAPE_ALERT TLD.hcart3.9 0x00000002 0x00000000

01/26/14 04:09:30 CI3125 7 TAPE_ALERT TLD.hcart3.10 0x00000002 0x00000000

01/26/14 07:19:55 CI3072 1 WRITE_ERROR TLD.hcart3.6

01/26/14 07:20:00 CI3072 1 TAPE_ALERT TLD.hcart3.6 0x24001000 0x02000000

01/26/14 07:21:03 CI3072 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000

01/26/14 08:53:08 CI3073 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000

01/26/14 21:04:20 CI3125 9 TAPE_ALERT TLD.hcart3.5 0x00000002 0x00000000

01/27/14 02:04:14 CH2380 3 TAPE_ALERT TLD.hcart3.7 0x00000002 0x00000000

01/28/14 10:02:03 CI3097 8 TAPE_ALERT TLD.hcart3.9 0x00000002 0x00000000

01/29/14 08:32:03 CE0280 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x00000000

01/30/14 00:34:35 CI3026 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000

01/30/14 01:46:54 CE0280 7 TAPE_ALERT TLD.hcart3.10 0x00000002 0x00000000

01/30/14 08:58:13 CI2895 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x02000000

01/30/14 09:05:08 CI2884 6 TAPE_ALERT TLD.hcart3.4 0x00000002 0x00000000

01/31/14 09:16:46 CI3111 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000

02/02/14 02:59:19 CE0280 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000

02/03/14 19:22:33 CI2861 3 TAPE_ALERT TLD.hcart3.7 0x00000002 0x00000000

02/04/14 08:57:13 CI2862 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x00000000

02/04/14 19:56:32 CI3122 6 TAPE_ALERT TLD.hcart3.4 0x00000002 0x00000000

02/05/14 01:55:09 CI3093 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000

02/06/14 16:06:19 CE0994 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x00000000

02/07/14 14:49:45 CE0953 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x00000000

02/09/14 01:25:01 CI3093 3 TAPE_ALERT TLD.hcart3.7 0x00000002 0x00000000

02/11/14 01:49:59 CI3093 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000

02/11/14 09:23:45 CH2153 3 TAPE_ALERT TLD.hcart3.7 0x00000002 0x00000000

thiagoabreu · ‎02-11-2014

Also, drives are clean. But it is cleaned by Library, not NBU. In NBU the state is NEEDS CLEANING.

SymTerry · ‎02-11-2014

When you up the drive, it doesn't have a issue for how long?

As metioned revaroo mentioned, the hardware/drivers are comming back with an error. NetBackup downs drives to protect you from making corupted backups. The drives are having some sort of an issue here, false positve or not. If its not and there is something wrong, you gamble with the integrity of the backups when you up the drives.

mph999 · ‎02-11-2014

TAPEALERTS = one thing, hardware problem.

It is in fact, impossible for NBU to cause a tapealert.

The fact NBU shows cleaning is irrelvent, it's simply a 'flag' that hasn't been cleared as the library claned the drive, and NBU will be unaware of this. It can be cleared with the tpclean command, and disabled with the NO_TAPEALRTS touch file.

Will_Restore · ‎02-11-2014

typo, that should be NO_TAPEALERT

mph999 · ‎02-11-2014

Thank wr for he correction, NO_TAPEALERT is what I should have put.

I've come back to 'edit' my post a little - a few TAPE_ALERTS are not true hardware issues, you could argue clean tape is not a true hardware issue as it's user correctable.

One other is to do with encryption, if using KMS and the keys are incorrect (I think) - a tape alert is issued, again this isn't quite a true hardware issue, and can be caused by an issue with the KMS server, I'd forgotten about that one, and it would be fair to say that this is the only one that can be caused by a non-hardware issue.

Yogesh_Jadhav1 · ‎02-11-2014

This is more like a connectivity issue or performance issue for the media servers, In one of my enviroment it use to happend with the VTL's where it use to down all the drives all of a sudden when there was performance issue with Tape Library. You should also check system messages thrown during this time and see if you have any issue there. Just check nbu logs may not provide the seriousness of the issue, so pls check system errors to find any relevant problems.

mph999 · ‎02-11-2014

Media server load / master to media comms can't cause tapealerts though ...

However, it is always possible there are two separate issues casuing similar symptoms.

However, I would recommend concentrating on the one issue we know about, the tapealserts.

Once these are resolved, see what's left ...

That said, if the tapealerts appear in the logs with say a minute of the drives going down, pretty safe to say that will be the cause.

Marianne · ‎02-11-2014

Remember to check error logs on media servers as well. NBU will Down a drive after 3 I/O errors on the same drive in 12 hours. If you have VERBOSE entry in vm.conf on all media servers, Media Manager errors will be logged to /var/adm/messages on Solaris and to System and Application Event Viewer logs on Windows.

Handy NetBackup Links

thiagoabreu · ‎02-12-2014

I already seted the VERBOSE line to vm.conf in one of media servers (Windows 2003) and I get this today on normal schedule:

TapeAlert Code: 0x1f, Type: Critical, Flag: HARDWARE B, from drive TLD.hcart3.6 (index 4), Media Id CE0914

TapeAlert Code: 0x27, Type: Warning, Flag: DIAGNOSTICS REQ., from drive TLD.hcart3.6 (index 4), Media Id CE0914

Operator/EMM server has DOWN'ed drive TLD.hcart3.6 (device 4)

Marianne · ‎02-12-2014

Probably time to log a call with your hardware vendor for that tape drive?

NBU probably DOWN'ed the drive because of 3 I/O errors in 12 hours (e.g. status 84).
bptm log will confirm.

Handy NetBackup Links

thiagoabreu · ‎02-13-2014

Hello, what this means?

"Error bptm (pid=3780) FREEZING media id CE0997, External event caused rewind during write, all data on media is lost"

mph999 · ‎02-13-2014

It means in effect that there has been a position error on the drives.

The tape 'might' have been overwritten - unknown

It can be caused by multiple things:

1/ Firmware issue / driver issue / hardware fault

2/ SAN issue

or 3/ scsi reservation mis-match

If you have the drives shared, and the different servers or devices that see them have different types of scsi reservation set then that will almost cetainly cause the message at some point. Very often, when the cause is because of a reservation mis-match, it will have caused data loss.

Usually, this happens when the tape drives are shared with NetBackup and a NDMP device (for example, NetApp) - and NBU is using one type of reservation, and the filer is using a different one.

QUestion is, do you have NDMP devices seeing your tape drives

Are your tape drives shared with anything ? (SSO)

thiagoabreu · ‎03-06-2014

Hi, the drives are for NBU only.
Checked the SAN and can't find any error on ports.

How I can identify an scsi reservation mis-match?

This drive DOWN minutes ago.

Drive	Device	Control	Port	Bus	Target	LUN	NDMP	Drive Index
TLD.hcart3.6	MASTER	DOWN-TLD						1
TLD.hcart3.6	MEDIA1	DOWN-TLD	3	0	7	0		7
TLD.hcart3.6	MEDIA2	DOWN-TLD	4	0	1	0		0
TLD.hcart3.6	MEDIA3	DOWN-TLD	2	0	7	0		7
TLD.hcart3.6	MEDIA4	DOWN-TLD	2	0	7	0		7
TLD.hcart3.6	MEDIA5	DOWN-TLD	3	0	6	0		6
TLD.hcart3.6	MEDIA6	DOWN-TLD	3	0	7	0		6
TLD.hcart3.6	MEDIA7	DOWN-TLD	3	0	7	0		7
TLD.hcart3.6	MEDIA8	DOWN-TLD	3	0	7	0		7
TLD.hcart3.6	MEDIA9	DOWN-TLD	4	0	7	0		7
TLD.hcart3.6	MEDIA10	DOWN-TLD	3	0	7	0		5
TLD.hcart3.6	MEDIA11	DOWN-TLD	3	0	15	0		4
TLD.hcart3.6	MEDIA12	DOWN-TLD	4	0	7	0		7
TLD.hcart3.6	MEDIA13	DOWN-TLD	2	0	15	0		4

Actually, I scheduled on media servers a "UP" command instead of NO_TAPEALERTS.

Mark_Solutions · ‎03-06-2014

Your previous errors werre down to cleaning needed - perhaps you need to look at your environment and tape handling?

Where do you tapes get stored and how are they transported to and from your site - they can be very sensitive to temperature and humidity changes so if the drives go down on a cold wet day that could be a clue!

Also, as your media servers are Windows based have you set the AutoRun key to zero ( see method 1 in this tech note: http://support.microsoft.com/kb/842411) and stopped and disabled the removable storage service on your media servers?

When the drive went down today what was logged in the windows application / system event logs of the servers just before it noted that EMM had downed the drive?

The alert codes you get tend to be 0x02 (write error) but I also see a 0x24 which is a drive temperature warning which will cause an immediate media freeze and hence drive down after three of these. 0x1f you mention in the opening thread means hardware error which again will down the drive.

Where is you tape library and could it be having temerature issues?

Finally for now i see you are using firmware version 93G0 on the drives - well worth getting them up to date to take advantage of all bug fixes and can get rid of spurious alerts (there is a bug in your version related to tape cleaning if not done by the library itself - shouldn't affect you but who knows!)

Hope this helps

Marianne · ‎03-06-2014

This drive DOWN minutes ago.

What is logged in media servers' messages files and/or Event Viewer logs?

Handy NetBackup Links

Ron_Cohn · ‎03-06-2014

How are these drives connected. Are they Fiber Channel?

If so, how many Windows Media Servers do you have?

VOX

All drives randomly DOWN