Netbackup Verify media

rsm_gbg · ‎05-05-2013

Hi,

Running Netbackup 7.1.0.4 on Solaris 10 SPARC/X86

We have LTO4 tape drives in a SL48 robot and they're supposed to read after write if I'm correct. eg. it will verify on the fly.
Now, we do normal backups and they just do fine without any write errors.
After the backup is done we run a verify on the tapes and they seem to fail a couple of times a week now.
Verify has run without any errors for some years now, with the occasional error now and then, maybe 4 times a year..
The tapes are actually a bit old now, going up to 250 mounts and are 4 years old.

When I been in contact with Symantec support (for totaly other reasons) it seems that "nobody does verify anymore",
So I'm sort of curios which one to blame here.

1. The tape drive?
2. The tapes are old and needs replacement
3. The LTO 4 read after write technology is not working properly.
4. The verify is not good enough, it doesn't read properly.

Any insight ideas and where I should start looking?

- Roland

Marianne · ‎05-05-2013

The bptm log on the media server(s) is the place to start where exact error message will be logged.

Another helpful log is /usr/openv/netbackup/db/media/error on the media server(s). This will give indication of errors associated with media and tape drives.

VERBOSE entry in /usr/openv/volmgr/vm.conf (followed by NBU restart) will ensure device-related errors are logged in /var/adm/messages.

If you post above logs as File attachments, we can (hopefully) assist with attempts to pinpoint problem.

Handy NetBackup Links

mph999 · ‎05-06-2013

Hi,

I presume by verify you mean the Netbackup verify ? I will base my post on this assumption.

LTO drives do read after write, you are correct, this is done done within the drive and is invisible to the server that is making the write operation.

When this operation fails, this is when you would see the TAPE_ALERT such as :

0x03: 'Uncorrectable read/write error',

0x04: 'Media Performance Degraded, Data Is At Risk',

0x05: 'Read Failure',

0x06: 'Write Failure',

0x07: 'Media has reached the end of its useful life',

0x08: 'Cartridge not data grade',

As we see however, it does not guarantee that the media will be readable in the future for 100% certain. One possibility is that the drive used for the verify is a different drive than was used to write the tape. As the drives/ tapes wear they can start to reach the limits of the tolerences that they were designed to perform within. When this happens, you can get tapes that will read/ write quite happily on one drive, but if this tape is moved to a different drive the reads may fail.

Have a go with this :

https://www-secure.symantec.com/connect/downloads/tperrsh-script-solaris-only

It will go through the media errors files and should pick out any drives/ media that are having a particular problem. Any issues with it let me know, or just post up the /usr/openv/netbackup/db/media/errors files from each media server, and I'll take a look.

(Ahh, I see the outstanding post from Marianne has mentioned this file(s) already)

So, to your question:

1. The tape drive?
2. The tapes are old and needs replacement
3. The LTO 4 read after write technology is not working properly.
4. The verify is not good enough, it doesn't read properly.

1. Possibly

2. Possibly

3. Very unlikely

4. No

The read/ write operations are not carried out by NetBackup, the OS does all of this, so for these sorts of issues you need to be looking outside NBU (as I see that you are).

Martin

rsm_gbg · ‎05-06-2013

Thanks Girls and Guys.

Helpfull as always :D
I will start collecting some logs and I'll post them here.

A side note, we backup to 2 tapes (copy) and one goes offsite, the offsite gets verified by netbackup (bpverify).
The verify takes place after my catalog backup so the verify tapes are mounted again so it could be any drive (we have 2)

- Roland

rsm_gbg · ‎05-06-2013

/usr/openv/netbackup/db/media/errors

03/04/13 17:07:47 NRT006 1 WRITE_ERROR HP.ULTRIUM4-SCSI.001
03/04/13 17:08:10 NRT006 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x34001000 0x00000000
03/26/13 20:09:42 DRT014 0 READ_ERROR HP.ULTRIUM4-SCSI.000
03/26/13 20:09:42 DRT014 0 TAPE_ALERT HP.ULTRIUM4-SCSI.000 0x80000000 0x00000000
04/26/13 00:26:30 DRT014 0 READ_ERROR HP.ULTRIUM4-SCSI.000
04/26/13 00:26:30 DRT014 0 TAPE_ALERT HP.ULTRIUM4-SCSI.000 0x80000000 0x00000000
04/26/13 19:12:12 NRT005 1 READ_ERROR HP.ULTRIUM4-SCSI.001
04/26/13 19:12:12 NRT005 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x80000000 0x00000000
04/28/13 00:01:59 DRT010 1 READ_ERROR HP.ULTRIUM4-SCSI.001
04/28/13 00:01:59 DRT010 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x80000000 0x00000000
04/29/13 19:46:40 NRT008 1 READ_ERROR HP.ULTRIUM4-SCSI.001
04/29/13 19:46:40 NRT008 1 TAPE_ALERT HP.ULTRIUM4-SCSI.001 0x80000000 0x00000000
05/03/13 19:08:54 DRT021 0 READ_ERROR HP.ULTRIUM4-SCSI.000
05/03/13 19:08:54 DRT021 0 TAPE_ALERT HP.ULTRIUM4-SCSI.000 0x80000000 0x00000000

Strangely it is only the remote tapes.
Denoted by the second letter R.

00-17:00 is the actual backup time
17-17:30 is catalog
18-23:00 is verify (all R tapes catalog and data)

- Roland

mph999 · ‎05-06-2013

What you could do, when you find a tape that fails the verify, try it in the other drive (just down the drive you don't want to use).

You backup to two tapes, is this 'inline tape copy' ? If so, it would be interesting to see if the other tape would fail the verify.

If you do use inline tape copy, it is exactly the same data sent to both drives, bptm sends it once to one drive, then again to the other (via the OS of course ...).

Other think of course, apart from the medi errors file, is the bptm log that covers the time the verify fails.

Martin

rsm_gbg · ‎05-06-2013

Good idea, I will do some test today.

Yes inline tape copy.
I will try to verify a local copy from the last failed remote and see if it fails.

Is there some handy test script that will load a scratch tape and do some serious drive testing?
I got a 6 hour time slot when nothing is being backed up.

- Roland

rsm_gbg · ‎05-06-2013

root@pnms01:~# /var/tmp/tperr.sh -a
Errors File exists ....
DLT001 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DLT011 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 4)
DLT002 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 4)
DRT010 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DRT001 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 4)
DLT008 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DRT021 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DLT018 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DRT031 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 5)
DRT014 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 4)
NLT002 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
DRT016 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 1)
NRT001 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 1)
NRT005 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
NRT006 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
NRT008 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 2)
NRT009 has had errors in 1 different drives   (Total occurrences (errors) of this volume is 1)

HP.ULTRIUM4-SCSI.000 has had errors with 10 different tapes   (Total occurrences (errors) for this drive is 30)
HP.ULTRIUM4-SCSI.001 has had errors with 7 different tapes   (Total occurrences (errors) for this drive is 12)
root@pnms01:~#

mph999 · ‎05-06-2013

Ahh, you got the script working ...

It works on stats, so the more errors, te more accurate it becomes.

From what you have posted, HP.ULTRIUM4-SCSI.000 has more errors on average (x3 per tape) so I would suggest this is the more problamatic drive.

(I have to presume the drives are used about equally)

A few tapes have 4 or 5 errors, not massivly high, but more than the others, so these may be the more probalamatic tapes.

I was hoping for a big difference between the numbers of errors on the drives (or tapes) - however, we can only go with what we have got.

There are no offical tape test scripts, I have one that hamers the drives to check positioning, you are welcome to have a copy ;

#! /usr/bin/ksh

#Useage tapetest <device path>

if [[ $(echo $1) = "" ]]

then

echo "Please rerun and specify drive device"

exit

fi

#Define variables

TAPE=$1

export TAPE

RANDOM_COUNT=12

#Get tape position

tape_position () {

POSITION=$(mt -f $TAPE stat |grep file |awk -F= '{print $2}' |awk '{print $1}')

echo $POSITION

}

#Calculate MAX_FILES

max_files () {

echo "Rewinding tape in $TAPE"

mt rew

echo "Searching for end-of-media for $TAPE"

while mt fsf

do

echo "Tape at position $(tape_position)"

MAX_FILES=$(tape_position)

done

echo "Rewinding tape"

mt rew

}

fsf_sequencial_forward () {

echo "fsf_sequential_test"

mt rew

count=1

echo "Maximum 'file' position = $MAX_FILES"

while [[ $count -le $MAX_FILES ]]

do

echo "Positioning to file $count"

mt fsf 1

echo "Drive reports position $(tape_position)"

let count+=1

done

echo "fsf_sequential_forward test complete"

}

bsf_sequencial_backwards () {

echo "fsf_sequential_test_backwards"

mt eom

count=$MAX_FILES

while [[ $count -ne 0 ]]

do

echo "Positioning to file $(($count-1))"

mt bsf 1

echo "Drive reports position $(tape_position)"

let count-=1

done

echo "bsf_sequential_backwards test complete"

}

random () {

echo "Starting random positioning test"

count=1

while [[ count -le $RANDOM_COUNT ]]

do

position_value=$(echo $((RANDOM%$(echo $MAX_FILES)+1)))

echo "Positioning to file $position_value"

mt asf $position_value

echo "Drive reports position $(tape_position)"

let count+=1

done

}

eom () {

echo "Checking position to eom"

echo "Rewinding ..."

mt rew

echo "Positioning to EOM"

mt eom

echo "Drive reports position $(tape_position)"

mt rew

echo "eom test complete"

}

max_files

fsf_sequencial_forward

bsf_sequencial_backwards

random

eom

Apologies, it's not written particularly well, I just needed it to work, not look pretty ...

Pass the device file on the command line, preload a tape using robtest.

If you change the RANDOM_COUNT variable, this will change how many times it will randomly position.

Regards,

Martin

rsm_gbg · ‎05-06-2013

To me it looks more likely we starting to get to end of life on those tapes.
But before I change all the tapes, I just wanted to make sure I do the right thing.
Problems with the drives will not be solved by replacing tapes.

I'm happy to replace tapes as long as I know that is the problem.
And I would replace the whole lot of older tapes in one go instead of waiting for them to fail.

- Roland

rsm_gbg · ‎05-06-2013

I successfully verified the LOCAL copy of the last failed verify.
Last verify error was on a Remote tape DRT021.

/usr/openv/netbackup/bin/admincmd/bpverify -cn 1 -id DLT016 -s 05/03/2013 00:00:00 -e 05/03/2013 17:00:00
.
.
.
INF - Status = successfully verified 13 of 13 images.

- Roland

mph999 · ‎05-06-2013

There is not really anyway of telling if it is the drive or tapes - sure sometimes if you are lucky you get a tapealert showing media is degraded or similar, but sometimes you just get write / read errors with no clear cause.

Some companies just replace tapes every 3 or 4 years to avoid any issues.

The only way to tell for sure is to run specialist software such as Storsentry, this monitors bothe tapes and drives separately, and is able to predict failures before they happen, either on tapes or drives, and yes, it does work very very well. The downside is it is not cheap.

Martin

Mark_Solutions · ‎05-07-2013

Always worth regularly cleaning you drives too and making sure your environment is sound

Tapes do not like to change temperature or humidity - so if they write OK and are then taken from the library in a nice cool server room and transported at a higher temperature they could get damaged

NetBackup allows a certain amount of read / write errors without frezzing a tape or downing a drive but it doesn't mean it will always be good as other factors can affect it

Sony one told us that a tape should only change temperature by 2 degrees in an hour - which i know is not the case for many of my customers!

Drive cleaning and keeping the drive firmware up to date can both help - but too much drive cleaning will wear the heads out too! - as will shoeshining if you dont feed the tapes with data well enough

So many factors!

As you are seeing both read and write errors a couple of cleaning cycles and a check of the firmware may help - then see if the errors reduce

Hope this helps

rsm_gbg · ‎05-19-2013

Hi,

I turned on the debugging and as always I haven't seen any errors since

I'll post again when I get an error in the logs.

- Roland

VOX

Netbackup Verify media