05-05-2013 05:19 PM
Hi,
Running Netbackup 7.1.0.4 on Solaris 10 SPARC/X86
We have LTO4 tape drives in a SL48 robot and they're supposed to read after write if I'm correct. eg. it will verify on the fly.
Now, we do normal backups and they just do fine without any write errors.
After the backup is done we run a verify on the tapes and they seem to fail a couple of times a week now.
Verify has run without any errors for some years now, with the occasional error now and then, maybe 4 times a year..
The tapes are actually a bit old now, going up to 250 mounts and are 4 years old.
When I been in contact with Symantec support (for totaly other reasons) it seems that "nobody does verify anymore",
So I'm sort of curios which one to blame here.
1. The tape drive?
2. The tapes are old and needs replacement
3. The LTO 4 read after write technology is not working properly.
4. The verify is not good enough, it doesn't read properly.
Any insight ideas and where I should start looking?
- Roland
05-05-2013 09:31 PM
The bptm log on the media server(s) is the place to start where exact error message will be logged.
Another helpful log is /usr/openv/netbackup/db/media/error on the media server(s). This will give indication of errors associated with media and tape drives.
VERBOSE entry in /usr/openv/volmgr/vm.conf (followed by NBU restart) will ensure device-related errors are logged in /var/adm/messages.
If you post above logs as File attachments, we can (hopefully) assist with attempts to pinpoint problem.
05-06-2013 12:17 AM
Hi,
I presume by verify you mean the Netbackup verify ? I will base my post on this assumption.
LTO drives do read after write, you are correct, this is done done within the drive and is invisible to the server that is making the write operation.
When this operation fails, this is when you would see the TAPE_ALERT such as :
05-06-2013 02:28 PM
Thanks Girls and Guys.
Helpfull as always :D
I will start collecting some logs and I'll post them here.
A side note, we backup to 2 tapes (copy) and one goes offsite, the offsite gets verified by netbackup (bpverify).
The verify takes place after my catalog backup so the verify tapes are mounted again so it could be any drive (we have 2)
- Roland
05-06-2013 02:36 PM
/usr/openv/netbackup/db/media/errors
03/04/13 17:07:47 NRT006 1 WRITE_ERROR HP.ULTRIUM4-SCSI.001
03/04/13 17:08:10 NRT006 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x34001000 0x00000000
03/26/13 20:09:42 DRT014 0 READ_ERROR HP.ULTRIUM4-SCSI.000
03/26/13 20:09:42 DRT014 0 TAPE_ALERT HP.ULTRIUM4-SCSI.000 0x80000000 0x00000000
04/26/13 00:26:30 DRT014 0 READ_ERROR HP.ULTRIUM4-SCSI.000
04/26/13 00:26:30 DRT014 0 TAPE_ALERT HP.ULTRIUM4-SCSI.000 0x80000000 0x00000000
04/26/13 19:12:12 NRT005 1 READ_ERROR HP.ULTRIUM4-SCSI.001
04/26/13 19:12:12 NRT005 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x80000000 0x00000000
04/28/13 00:01:59 DRT010 1 READ_ERROR HP.ULTRIUM4-SCSI.001
04/28/13 00:01:59 DRT010 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x80000000 0x00000000
04/29/13 19:46:40 NRT008 1 READ_ERROR HP.ULTRIUM4-SCSI.001
04/29/13 19:46:40 NRT008 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x80000000 0x00000000
05/03/13 19:08:54 DRT021 0 READ_ERROR HP.ULTRIUM4-SCSI.000
05/03/13 19:08:54 DRT021 0 TAPE_ALERT HP.ULTRIUM4-SCSI.000 0x80000000 0x00000000
Strangely it is only the remote tapes.
Denoted by the second letter R.
00-17:00 is the actual backup time
17-17:30 is catalog
18-23:00 is verify (all R tapes catalog and data)
- Roland
05-06-2013 02:42 PM
What you could do, when you find a tape that fails the verify, try it in the other drive (just down the drive you don't want to use).
You backup to two tapes, is this 'inline tape copy' ? If so, it would be interesting to see if the other tape would fail the verify.
If you do use inline tape copy, it is exactly the same data sent to both drives, bptm sends it once to one drive, then again to the other (via the OS of course ...).
Other think of course, apart from the medi errors file, is the bptm log that covers the time the verify fails.
Martin
05-06-2013 03:03 PM
Good idea, I will do some test today.
Yes inline tape copy.
I will try to verify a local copy from the last failed remote and see if it fails.
Is there some handy test script that will load a scratch tape and do some serious drive testing?
I got a 6 hour time slot when nothing is being backed up.
- Roland
05-06-2013 03:06 PM
root@pnms01:~# /var/tmp/tperr.sh -a
Errors File exists ....
DLT001 has had errors in 1 different drives (Total occurrences (errors) of this volume is 2)
DLT011 has had errors in 1 different drives (Total occurrences (errors) of this volume is 4)
DLT002 has had errors in 1 different drives (Total occurrences (errors) of this volume is 4)
DRT010 has had errors in 1 different drives (Total occurrences (errors) of this volume is 2)
DRT001 has had errors in 1 different drives (Total occurrences (errors) of this volume is 4)
DLT008 has had errors in 1 different drives (Total occurrences (errors) of this volume is 2)
DRT021 has had errors in 1 different drives (Total occurrences (errors) of this volume is 2)
DLT018 has had errors in 1 different drives (Total occurrences (errors) of this volume is 2)
DRT031 has had errors in 1 different drives (Total occurrences (errors) of this volume is 5)
DRT014 has had errors in 1 different drives (Total occurrences (errors) of this volume is 4)
NLT002 has had errors in 1 different drives (Total occurrences (errors) of this volume is 2)
DRT016 has had errors in 1 different drives (Total occurrences (errors) of this volume is 1)
NRT001 has had errors in 1 different drives (Total occurrences (errors) of this volume is 1)
NRT005 has had errors in 1 different drives (Total occurrences (errors) of this volume is 2)
NRT006 has had errors in 1 different drives (Total occurrences (errors) of this volume is 2)
NRT008 has had errors in 1 different drives (Total occurrences (errors) of this volume is 2)
NRT009 has had errors in 1 different drives (Total occurrences (errors) of this volume is 1)
HP.ULTRIUM4-SCSI.000 has had errors with 10 different tapes (Total occurrences (errors) for this drive is 30)
HP.ULTRIUM4-SCSI.001 has had errors with 7 different tapes (Total occurrences (errors) for this drive is 12)
root@pnms01:~#
05-06-2013 03:22 PM
Ahh, you got the script working ...
It works on stats, so the more errors, te more accurate it becomes.
From what you have posted, HP.ULTRIUM4-SCSI.000 has more errors on average (x3 per tape) so I would suggest this is the more problamatic drive.
(I have to presume the drives are used about equally)
A few tapes have 4 or 5 errors, not massivly high, but more than the others, so these may be the more probalamatic tapes.
I was hoping for a big difference between the numbers of errors on the drives (or tapes) - however, we can only go with what we have got.
There are no offical tape test scripts, I have one that hamers the drives to check positioning, you are welcome to have a copy ;
05-06-2013 05:32 PM
To me it looks more likely we starting to get to end of life on those tapes.
But before I change all the tapes, I just wanted to make sure I do the right thing.
Problems with the drives will not be solved by replacing tapes.
I'm happy to replace tapes as long as I know that is the problem.
And I would replace the whole lot of older tapes in one go instead of waiting for them to fail.
- Roland
05-06-2013 08:54 PM
I successfully verified the LOCAL copy of the last failed verify.
Last verify error was on a Remote tape DRT021.
/usr/openv/netbackup/bin/admincmd/bpverify -cn 1 -id DLT016 -s 05/03/2013 00:00:00 -e 05/03/2013 17:00:00
.
.
.
INF - Status = successfully verified 13 of 13 images.
- Roland
05-06-2013 11:12 PM
There is not really anyway of telling if it is the drive or tapes - sure sometimes if you are lucky you get a tapealert showing media is degraded or similar, but sometimes you just get write / read errors with no clear cause.
Some companies just replace tapes every 3 or 4 years to avoid any issues.
The only way to tell for sure is to run specialist software such as Storsentry, this monitors bothe tapes and drives separately, and is able to predict failures before they happen, either on tapes or drives, and yes, it does work very very well. The downside is it is not cheap.
Martin
05-07-2013 08:55 AM
Always worth regularly cleaning you drives too and making sure your environment is sound
Tapes do not like to change temperature or humidity - so if they write OK and are then taken from the library in a nice cool server room and transported at a higher temperature they could get damaged
NetBackup allows a certain amount of read / write errors without frezzing a tape or downing a drive but it doesn't mean it will always be good as other factors can affect it
Sony one told us that a tape should only change temperature by 2 degrees in an hour - which i know is not the case for many of my customers!
Drive cleaning and keeping the drive firmware up to date can both help - but too much drive cleaning will wear the heads out too! - as will shoeshining if you dont feed the tapes with data well enough
So many factors!
As you are seeing both read and write errors a couple of cleaning cycles and a check of the firmware may help - then see if the errors reduce
Hope this helps
05-19-2013 10:46 PM
Hi,
I turned on the debugging and as always I haven't seen any errors since
I'll post again when I get an error in the logs.
- Roland