Hello, we had an enviroment with 15 media servers and about a month ago, all drives randomly go down in all media servers.
Perhaps, it throw a hardware flag but I think it is a false positive or the fact he is completly down (from Master to Media s.) he throw the HW fault flag.
Also, with heavy load like Vault or Staging after some time all drives will be down and without any use of drives he stay UP.
Unfortunatelly our NBU is 6.5.4.
Master server is Solaris 10 SPARC.
Media Servers are Win 2003/2008.
Drives are IBM ULTRIUM-TD3 93G0
Actually we didn't have any MISSING_PATH.
In ltid debug I only get the default "EMM/Operator down the drive" and lot of this:
0x1F: 'Hardware B: Tape drive has a problem not read/write related',
so what is the job failure status when it is making the tapes down?
did you try tape clean?
show us the file /usr/openv/netbackup/db/media/error
After I UP the drive he does the work without problem.
Jobs didn't get any error, only 196 or 830 (Backup window closed and Drives are not available).
Our "/usr/openv/netbackup/db/media/errors" starts in 2007.
When you up the drive, it doesn't have a issue for how long?
As metioned revaroo mentioned, the hardware/drivers are comming back with an error. NetBackup downs drives to protect you from making corupted backups. The drives are having some sort of an issue here, false positve or not. If its not and there is something wrong, you gamble with the integrity of the backups when you up the drives.
TAPEALERTS = one thing, hardware problem.
It is in fact, impossible for NBU to cause a tapealert.
The fact NBU shows cleaning is irrelvent, it's simply a 'flag' that hasn't been cleared as the library claned the drive, and NBU will be unaware of this. It can be cleared with the tpclean command, and disabled with the NO_TAPEALRTS touch file.
Thank wr for he correction, NO_TAPEALERT is what I should have put.
I've come back to 'edit' my post a little - a few TAPE_ALERTS are not true hardware issues, you could argue clean tape is not a true hardware issue as it's user correctable.
One other is to do with encryption, if using KMS and the keys are incorrect (I think) - a tape alert is issued, again this isn't quite a true hardware issue, and can be caused by an issue with the KMS server, I'd forgotten about that one, and it would be fair to say that this is the only one that can be caused by a non-hardware issue.
This is more like a connectivity issue or performance issue for the media servers, In one of my enviroment it use to happend with the VTL's where it use to down all the drives all of a sudden when there was performance issue with Tape Library. You should also check system messages thrown during this time and see if you have any issue there. Just check nbu logs may not provide the seriousness of the issue, so pls check system errors to find any relevant problems.
Media server load / master to media comms can't cause tapealerts though ...
However, it is always possible there are two separate issues casuing similar symptoms.
However, I would recommend concentrating on the one issue we know about, the tapealserts.
Once these are resolved, see what's left ...
That said, if the tapealerts appear in the logs with say a minute of the drives going down, pretty safe to say that will be the cause.
I already seted the VERBOSE line to vm.conf in one of media servers (Windows 2003) and I get this today on normal schedule:
TapeAlert Code: 0x1f, Type: Critical, Flag: HARDWARE B, from drive TLD.hcart3.6 (index 4), Media Id CE0914
TapeAlert Code: 0x27, Type: Warning, Flag: DIAGNOSTICS REQ., from drive TLD.hcart3.6 (index 4), Media Id CE0914
Operator/EMM server has DOWN'ed drive TLD.hcart3.6 (device 4)
It means in effect that there has been a position error on the drives.
The tape 'might' have been overwritten - unknown
It can be caused by multiple things:
1/ Firmware issue / driver issue / hardware fault
2/ SAN issue
or 3/ scsi reservation mis-match
If you have the drives shared, and the different servers or devices that see them have different types of scsi reservation set then that will almost cetainly cause the message at some point. Very often, when the cause is because of a reservation mis-match, it will have caused data loss.
Usually, this happens when the tape drives are shared with NetBackup and a NDMP device (for example, NetApp) - and NBU is using one type of reservation, and the filer is using a different one.
QUestion is, do you have NDMP devices seeing your tape drives
Are your tape drives shared with anything ? (SSO)
Hi, the drives are for NBU only.
Checked the SAN and can't find any error on ports.
How I can identify an scsi reservation mis-match?
This drive DOWN minutes ago.
Actually, I scheduled on media servers a "UP" command instead of NO_TAPEALERTS.
Your previous errors werre down to cleaning needed - perhaps you need to look at your environment and tape handling?
Where do you tapes get stored and how are they transported to and from your site - they can be very sensitive to temperature and humidity changes so if the drives go down on a cold wet day that could be a clue!
Also, as your media servers are Windows based have you set the AutoRun key to zero ( see method 1 in this tech note: http://support.microsoft.com/kb/842411) and stopped and disabled the removable storage service on your media servers?
When the drive went down today what was logged in the windows application / system event logs of the servers just before it noted that EMM had downed the drive?
The alert codes you get tend to be 0x02 (write error) but I also see a 0x24 which is a drive temperature warning which will cause an immediate media freeze and hence drive down after three of these. 0x1f you mention in the opening thread means hardware error which again will down the drive.
Where is you tape library and could it be having temerature issues?
Finally for now i see you are using firmware version 93G0 on the drives - well worth getting them up to date to take advantage of all bug fixes and can get rid of spurious alerts (there is a bug in your version related to tape cleaning if not done by the library itself - shouldn't affect you but who knows!)
Hope this helps