cancel
Showing results for 
Search instead for 
Did you mean: 

All drives randomly DOWN

thiagoabreu
Level 3

Hello, we had an enviroment with 15 media servers and about a month ago, all drives randomly go down in all media servers.

Perhaps, it throw a hardware flag but I think it is a false positive or the fact he is completly down (from Master to Media s.) he throw the HW fault flag.
Also, with heavy load like Vault or Staging after some time all drives will be down and without any use of drives he stay UP.

Unfortunatelly our NBU is 6.5.4.
Master server is Solaris 10 SPARC.
Media Servers are Win 2003/2008.
Drives are IBM ULTRIUM-TD3 93G0
Actually we didn't have any MISSING_PATH.

In ltid debug I only get the default "EMM/Operator down the drive" and lot of this:

10:06:33.001 [19492] <2> TAO: TAO (19492|1) - Transport_Cache_Manager::is_entry_idle_i, state is [0]
10:06:33.001 [19492] <2> TAO: TAO (19492|1) - Synch_Invocation::invoke_i, timeout on recv is <299999>
10:06:33.021 [19492] <2> TAO: TAO (19492|1) Synch_Invocation::invoke_i, timeout after recv is <299961> status <1>
10:11:33.976 [19492] <2> TAO: TAO (19492|1) Timeout is <300000>
10:11:33.976 [19492] <2> TAO: TAO (19492|1) Timeout is <0>
 
 
A lot of this exactly Hardware Flag:
TapeAlert Code: 0x1f, Type: Critical, Flag: HARDWARE B, from drive TLD.hcart3.3 (index 4), Media Id
 
Thank you.
34 REPLIES 34

revarooo
Level 6
Employee

tape alerts don't come fron NetBackup - they come direct from the drive/driver

Do you just up the drive and it continues working?

RamNagalla
Moderator
Moderator
Partner    VIP    Certified

0x1F: 'Hardware B: Tape drive has a problem not read/write related',

so what is the job failure status when it is making the tapes down?

did you try tape clean?

show us the file /usr/openv/netbackup/db/media/error 

thiagoabreu
Level 3

After I UP the drive he does the work without problem.
Jobs didn't get any error, only 196 or 830 (Backup window closed and Drives are not available).

 

Our "/usr/openv/netbackup/db/media/errors" starts in 2007.

12/30/13 17:15:45  7 TAPE_ALERT TLD.hcart3.10 0x00000002 0x00000000
12/30/13 17:19:37  2 TAPE_ALERT TLD.hcart3.2 0x00000002 0x00000000
12/31/13 00:31:22 CE0731 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000
01/06/14 23:59:26 CE0556 6 TAPE_ALERT TLD.hcart3.4 0x00000002 0x00000000
01/12/14 06:44:35 CE0371 1 WRITE_ERROR TLD.hcart3.6
01/12/14 06:44:44 CE0371 1 TAPE_ALERT TLD.hcart3.6 0x24001000 0x02000000
01/17/14 19:13:15 CH2676 4 TAPE_ALERT TLD.hcart3.3 0x00000000 0x02000000
01/18/14 21:33:07 CI3013 1 WRITE_ERROR TLD.hcart3.6
01/18/14 21:33:16 CI3013 1 TAPE_ALERT TLD.hcart3.6 0x24001000 0x02000000
01/18/14 21:34:01 CI3013 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000
01/21/14 21:13:55 CH2380 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000
01/24/14 20:37:26 CH2244 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x02000000
01/25/14 07:01:04 CH2680 1 TAPE_ALERT TLD.hcart3.6 0x00000000 0x02000000
01/26/14 02:16:42 CH2380 8 TAPE_ALERT TLD.hcart3.9 0x00000002 0x00000000
01/26/14 04:09:30 CI3125 7 TAPE_ALERT TLD.hcart3.10 0x00000002 0x00000000
01/26/14 07:19:55 CI3072 1 WRITE_ERROR TLD.hcart3.6
01/26/14 07:20:00 CI3072 1 TAPE_ALERT TLD.hcart3.6 0x24001000 0x02000000
01/26/14 07:21:03 CI3072 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000
01/26/14 08:53:08 CI3073 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000
01/26/14 21:04:20 CI3125 9 TAPE_ALERT TLD.hcart3.5 0x00000002 0x00000000
01/27/14 02:04:14 CH2380 3 TAPE_ALERT TLD.hcart3.7 0x00000002 0x00000000
01/28/14 10:02:03 CI3097 8 TAPE_ALERT TLD.hcart3.9 0x00000002 0x00000000
01/29/14 08:32:03 CE0280 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x00000000
01/30/14 00:34:35 CI3026 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000
01/30/14 01:46:54 CE0280 7 TAPE_ALERT TLD.hcart3.10 0x00000002 0x00000000
01/30/14 08:58:13 CI2895 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x02000000
01/30/14 09:05:08 CI2884 6 TAPE_ALERT TLD.hcart3.4 0x00000002 0x00000000
01/31/14 09:16:46 CI3111 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000
02/02/14 02:59:19 CE0280 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000
02/03/14 19:22:33 CI2861 3 TAPE_ALERT TLD.hcart3.7 0x00000002 0x00000000
02/04/14 08:57:13 CI2862 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x00000000
02/04/14 19:56:32 CI3122 6 TAPE_ALERT TLD.hcart3.4 0x00000002 0x00000000
02/05/14 01:55:09 CI3093 1 TAPE_ALERT TLD.hcart3.6 0x00000002 0x00000000
02/06/14 16:06:19 CE0994 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x00000000
02/07/14 14:49:45 CE0953 4 TAPE_ALERT TLD.hcart3.3 0x00000002 0x00000000
02/09/14 01:25:01 CI3093 3 TAPE_ALERT TLD.hcart3.7 0x00000002 0x00000000
02/11/14 01:49:59 CI3093 5 TAPE_ALERT TLD.hcart3.8 0x00000002 0x00000000
02/11/14 09:23:45 CH2153 3 TAPE_ALERT TLD.hcart3.7 0x00000002 0x00000000
 

thiagoabreu
Level 3

Also, drives are clean. But it is cleaned by Library, not NBU. In NBU the state is NEEDS CLEANING.

SymTerry
Level 6
Employee Accredited

When you up the drive, it doesn't have a issue for how long?

As metioned revaroo mentioned, the hardware/drivers are comming back with an error. NetBackup downs drives to protect you from making corupted backups. The drives are having some sort of an issue here, false positve or not. If its not and there is something wrong, you gamble with the integrity of the backups when you up the drives.

mph999
Level 6
Employee Accredited

TAPEALERTS = one thing, hardware problem.

It is in fact, impossible for NBU to cause a tapealert.

The fact NBU shows cleaning is irrelvent, it's simply a 'flag' that hasn't been cleared as the library claned the drive, and NBU will be unaware of this.  It can be cleared with the tpclean command, and disabled with the NO_TAPEALRTS touch file.

Will_Restore
Level 6

typo, that should be  NO_TAPEALERT

mph999
Level 6
Employee Accredited

Thank wr for he correction, NO_TAPEALERT is what I should have put.

I've come back to 'edit' my post a little - a few TAPE_ALERTS are not true hardware issues, you could argue clean tape is not a true hardware issue as it's user correctable.

One other is to do with encryption, if using KMS and the keys are incorrect (I think) - a tape alert is issued, again this isn't quite a true hardware issue, and can be caused by an issue with the KMS server, I'd forgotten about that one, and it would be fair to say that this is the only one that can be caused by a non-hardware issue. 

Yogesh_Jadhav1
Level 5

This is more like a connectivity issue or performance issue for the media servers, In one of my enviroment it use to happend with the VTL's where it use to down all the drives all of a sudden when there was performance issue with Tape Library. You should also check system messages thrown during this time and see if you have any issue there. Just check nbu logs may not provide the seriousness of the issue, so pls check system errors to find any relevant problems.

mph999
Level 6
Employee Accredited

Media server load / master to media comms can't cause tapealerts though ...

However, it is always possible there are two separate issues casuing similar symptoms.

However, I would recommend concentrating on the one issue we know about, the tapealserts.

Once these are resolved, see what's left ...

That said, if the tapealerts appear in the logs with say a minute of the drives going down, pretty safe to say that will be the cause.

Marianne
Level 6
Partner    VIP    Accredited Certified
Remember to check error logs on media servers as well. NBU will Down a drive after 3 I/O errors on the same drive in 12 hours. If you have VERBOSE entry in vm.conf on all media servers, Media Manager errors will be logged to /var/adm/messages on Solaris and to System and Application Event Viewer logs on Windows.

thiagoabreu
Level 3

I already seted the VERBOSE line to vm.conf in one of media servers (Windows 2003) and I get this today on normal schedule:

 

TapeAlert Code: 0x1f, Type: Critical, Flag: HARDWARE B, from drive TLD.hcart3.6 (index 4), Media Id CE0914

TapeAlert Code: 0x27, Type: Warning, Flag: DIAGNOSTICS REQ., from drive TLD.hcart3.6 (index 4), Media Id CE0914

Operator/EMM server has DOWN'ed drive TLD.hcart3.6 (device 4)

 

 

Marianne
Level 6
Partner    VIP    Accredited Certified

Probably time to log a call with your hardware vendor for that tape drive?

NBU probably DOWN'ed the drive because of 3 I/O errors in 12 hours (e.g. status 84). 
bptm log will confirm.

thiagoabreu
Level 3

Hello, what this means?

"Error bptm (pid=3780) FREEZING media id CE0997, External event caused rewind during write, all data on media is lost"

mph999
Level 6
Employee Accredited

It means in effect that there has been a position error on the drives.

The tape 'might' have been overwritten - unknown

It can be caused by multiple things:

1/ Firmware issue / driver issue / hardware fault

2/ SAN issue

or 3/  scsi reservation mis-match

If you have the drives shared, and the different servers or devices that see them have different types of scsi reservation set then that will almost cetainly cause the message at some point.  Very often, when the cause is because of a reservation mis-match, it will have caused data loss.

Usually, this happens when the tape drives are shared with NetBackup and a NDMP device (for example, NetApp) - and NBU is using one type of reservation, and the filer is using a different one.

QUestion is, do you have NDMP devices seeing your tape drives

Are your tape drives shared with anything ? (SSO)

thiagoabreu
Level 3

Hi, the drives are for NBU only.
Checked the SAN and can't find any error on ports.

How I can identify an scsi reservation mis-match?

This drive DOWN minutes ago.

Drive Device Control Port Bus Target LUN NDMP Drive Index
TLD.hcart3.6 MASTER DOWN-TLD           1
TLD.hcart3.6 MEDIA1 DOWN-TLD 3 0 7 0   7
TLD.hcart3.6 MEDIA2 DOWN-TLD 4 0 1 0   0
TLD.hcart3.6 MEDIA3 DOWN-TLD 2 0 7 0   7
TLD.hcart3.6 MEDIA4 DOWN-TLD 2 0 7 0   7
TLD.hcart3.6 MEDIA5 DOWN-TLD 3 0 6 0   6
TLD.hcart3.6 MEDIA6 DOWN-TLD 3 0 7 0   6
TLD.hcart3.6 MEDIA7 DOWN-TLD 3 0 7 0   7
TLD.hcart3.6 MEDIA8 DOWN-TLD 3 0 7 0   7
TLD.hcart3.6 MEDIA9 DOWN-TLD 4 0 7 0   7
TLD.hcart3.6 MEDIA10 DOWN-TLD 3 0 7 0   5
TLD.hcart3.6 MEDIA11 DOWN-TLD 3 0 15 0   4
TLD.hcart3.6 MEDIA12 DOWN-TLD 4 0 7 0   7
TLD.hcart3.6 MEDIA13 DOWN-TLD 2 0 15 0   4

Actually, I scheduled on media servers a "UP" command instead of NO_TAPEALERTS.

Mark_Solutions
Level 6
Partner Accredited Certified

Your previous errors werre down to cleaning needed - perhaps you need to look at your environment and tape handling?

Where do you tapes get stored and how are they transported to and from your site - they can be very sensitive to temperature and humidity changes so if the drives go down on a cold wet day that could be a clue!

Also, as your media servers are Windows based  have you set the AutoRun key to zero ( see method 1 in this tech note: http://support.microsoft.com/kb/842411) and stopped and disabled the removable storage service on your media servers?

When the drive went down today what was logged in the windows application / system event logs of the servers just before it noted that EMM had downed the drive?

The alert codes you get tend to be 0x02 (write error) but I also see a 0x24 which is a drive temperature warning which will cause an immediate media freeze and hence drive down after three of these. 0x1f you mention in the opening thread means hardware error which again will down the drive.

Where is you tape library and could it be having temerature issues?

Finally for now i see you are using firmware version 93G0 on the drives - well worth getting them up to date to take advantage of all bug fixes and can get rid of spurious alerts (there is a bug in your version related to tape cleaning if not done by the library itself - shouldn't affect you but who knows!)

Hope this helps

Marianne
Level 6
Partner    VIP    Accredited Certified

This drive DOWN minutes ago.

 

What is logged in media servers' messages files and/or Event Viewer logs?

 

Ron_Cohn
Level 6

How are these drives connected.  Are they Fiber Channel?

If so, how many Windows Media Servers do you have?