cancel
Showing results for 
Search instead for 
Did you mean: 

Multiple tapes being frozen, drives being down but Tape Library shows no hardware issues.

sungpillhan
Level 3

Hello, 

This problem seems getting serious. We have about 45 tapes in Quantum Scalar i3 with 4 drives and Netbackup 8.1 on RHEL 6.9.

Once in a while, 1 or 2 of drives go down an tapes are marked frozen. I opened multiple tickets to Netbackup and Quantum, they really don't find the root cause of the problem, just recommend to upgrade firmware, unfreeze tapes, Up drives. That's all both tech supports do. Now I have nearly half of tapes got frozen and 2 drives down. 

Does anyone have experience on tackling down the similar issues, if then, please share  your thoughts. 

Belows are the screenshots of Netbackup and Quantum Scalar i3 tape library. 

 

2021-02-19 08_58_11-Media - 10.225.72.81 - NetBackup Administration Console [root logged into 10.225.png2021-02-19 08_59_11-Device Monitor - 10.225.72.81 - NetBackup Administration Console [root logged in.png2021-02-19 09_00_00-library.png

 

4 REPLIES 4

StoneRam-Simon
Level 6
Partner    VIP    Accredited Certified

There are quite a few reasons for this...
1) Physical damage to tapes / drives
Have you visually inspected the Tapes that are being frozen to make sure the "leader" is still there?
One problem with drives I recall from many years ago, was that if a tape "leader" snaps, the drive fails to tension and it causes the drives leader to spin back on itself.
Problem here is if you don't fix the drive and remove the damaged media, then putting a damaged tape into a good drive will result in a broken drive, and putting a good tape into a damaged drive can result in broken tape...

2)Previous cross-mounting of media...
When a media is first written to by NetBackup it writes the "label" to it, if this label doesn't match with the external (barcode) then when it mounts it will get frozen,  if this is the case you should see some messages in the logs..
This can happen if a library was manually configured, or if someone has "crossed" cables over at some point (when doing some maintenance on the drives)
It should happen but I would be tempted to load one of the media and see what NetBackup reads the label to be.

3)Preventing overwriting of data.
If NetBackup detects a "know" data format it can freeze a tape to prevent it overwriting.
are the media "new", have they been used in another environment or with another backup product?  


Given that the drives are also being downed I would think (1) is the most likely, and I would want to inspect all the frozen media before they are "un-frozen"

quebek
Moderator
Moderator
   VIP    Certified

Hi

I want to add to what to @StoneRam-Simon one more - I would run

bperror -media -hoursago 24

to see some errors and this on media servers

grep -v -e "<2>" -e "<4>" /usr/openv/netbackup/logs/bptm/log_with_date_when_failure_was_seen

maybe this will be an eye opener... 

mph999
Level 6
Employee Accredited

 

My experience, and I do a lot of tape cases ...

 

The majority of issues have nothing to do with NBU - country to popular belief, NBU does not write or read to the tape, it's all done by the OS - at a high level, all NBU does is send data to the OS and request it is written to the tape a specified blocksize.  Aside of that we do send various scsi commands, but again, these are an ‘industry standard’ and nothing to do with NBU,  There are exceptions however ... I had a odd one a while back where it was actually something corrupted in the storage unit, recreating the STU fixed the issue ...  There has also been a recent issue with WORM tape (LTO8  and above I think) which was a code issue.

 

WIth all due respect to vendors, they run a standard set of tests, if these don’t test the part where it is failing, they pass the drive as good - this tends to happen for the more unusal cases, as opposed to a simple write failure due to a drive fault or something,  I have lost count of the number of times the vendor has blamed NBU, and it turns out to be hardware - it seems to be their standard response.

 

NBU config issues can cause issues, I would expect these to cause failure at the start of the job - but if your setup isn't that complex just delete the drives/ robot and reconfigure (nbemmcmd -deletealldevices -allrecords)  - it may not fix it, but it does prove as much as we can that  the config is good - all often, this is enough to rule  out NBU as being part of the issue.

 

Activty monitor / job details is where I would start - this may well show the error, or at least whereabouts in the job it is failing.  Is it always failing is the same place, is it always failing on the same few tapes, what is the frequency of failure, what status codes do you see, is it always in the write (or read part of the job), is the tape even making it as far as the drive (robot load error) or if it is, is it mounting correctly (tape physically in drive does not equal correctly mounted, several other things have to happen).

 

What does bptm  log show (I'll upset Marianne, but I like VERBOSE =  5.   The messages log could be good as well, if you see TAPEALERT, ASC/ ASCQ, ioctl or CRC errors, you're almost 100% certain to have a hardware issue.

 

In fact, set up the volmgr logs as per this article, include the various ‘touch files’

 

https://vox.veritas.com/t5/Articles/Quick-Guide-to-Setting-up-logs-in-NetBackup/ta-p/811951

 

Half the battle with these issues is confirming where about some the failure is ... and the history behind the issue - when did it start, where there any changes (eg firmware).

Linux is great, as we have quite a few ‘tools’ to test tape drives, tar, dd, cpio, mt, sg3_utils (optional package but certainly worth installing), scsi_command (ok, this one is Veritas, but is NOT anything to do with NBU), mtx and probably others I can’t think of right now.   Tape should be massively reliable (in fact, more so than disk) but tapes can be damaged, drives and tapes wear out over time so Simon makes some good points, although a leader pin issue is rare, I've seen it once in 17 years.  On the media server(s) , the /use/opens/net backup/db/media/errors files can be useful, and will also show if it's the same few tapes causing the issue.

mph999
Level 6
Employee Accredited

Log a new case with Veritas (once you have the issue occurring/ have logs available) and post the case number up here, I'll take a look, or ask the TSE to contact me (Martin Holt).

Get a set of logs as I've described (at VERBOSE = 5)  and VERBOSE in vm.conf (all explained in the link).  If the issue is very odd, it could be additional logs, but in the majority of cases what I have suggested is usually more than sufficient.  

If you're not too sure about gathering logs, just post Activity Monitor details up here first as it might be sufficient to narrow the issue down, without collecting everything, although the volmgr and bptm logs are easy to get.  You'll also need a new NBSU output for the case.