Highlighted

"io_position_for_write" failing causing error code 52 ( bptm timeout ), any clues ?

Hi Folks,
I have been getting a lot of backup jobs failed with error code 52 i.e
bptm timout.
Upon going thru logs/bptm/log.<mmddyy> , i noticed the part where bptm
gave up was

15:59:47.618 [21913] <2> write_backup: media id W425L2 mounted on drive index 2,
 drivepath /dev/rmt/3cbn, drivename HPUltrium2-SCSI2, copy 1
15:59:47.619 [21913] <2> io_read_media_header: drive index 2, reading media head
er, buflen = 65536, buff = 0x18b770, copy 1
15:59:47.619 [21913] <2> io_ioctl: command (5)MTREW 1 from (bptm.c.7487) on driv
e index 2
15:59:47.708 [21913] <2> io_ioctl: command (1)MTFSF 1 from (bptm.c.7740) on driv
e index 2
15:59:47.781 [21913] <2> io_position_for_write: position media id W425L2, copy 1
, current number images = 3
15:59:47.781 [21913] <2> io_position_for_write: locating to absolute block numbe
r 431207, copy 1
16:15:40.641 [21913] <2> set_job_details: Sending Tfile jobid (116993)
16:15:40.641 [21913] <2> set_job_details: LOG 1260778540 8 bptm 21913 cannot loc
ate on drive index 2, locate scsi command failed, key = 0x3, asc = 0x14, ascq =
0x0
16:15:40.697 [21913] <8> io_position_for_write: cannot locate on drive index 2,
locate scsi command failed, key = 0x3, asc = 0x14, ascq = 0x0

16:15:40.697 [21913] <2> io_ioctl: command (5)MTREW 1 from (bptm.c.6431) on driv
e index 2

Apparently, bptm mounted, the media i.e W425L2, successfully, but when it tries to position the head
to block 4321207,  it failed. As if that block is not there !!

W425L2 is a NEW media which I have unwrapped 2 weeks ago.

This also happened to a mixture of 7 other medias ( mixture of old and new ) over 3 weeks.
So i doubt its a media issue.
Having said that I dont think its drive problem as well.
Becos the I have another 2 drives, which gave the same error when using the
said medias.

More background info.

id     rl  images   allocated        last updated      density  kbytes restores
           vimages   expiration       last read         <------- STATUS ------->
--------------------------------------------------------------------------------
W425L2   5      3   12/04/2009 12:38  12/05/2009 11:01  hcart2   110386720     0
                3   03/08/2010 11:01        N/A          

Any clue on what is happening here ?

BTW my L100 tape library was found to be hung on 7 Dec 2009. All jobs
failed with code 52, causing by bptm timeout from unable to mount media.
( reasonable, because the robot has hung as well ).

Assuming that the lib was hung as early as 5th Dec, could it have anything to
do the scsi locate error ?

Any insight or suggestion would be most appreciated.
Thanks in advance.





7 Replies
Highlighted

worn out tape drive.


This is a classic HP LTO symptom for a worn out tape drive. I don't know why "cannot locate" status is thrown, a more meaningful message would be "Read error".

Whenever we see a drive starting to report  key = 0x3, asc = 0x14, ascq =0x0,we get it replaced. You may try to clean the tape drive - But expect it only to postpone the unavoidable.




Highlighted

Hi Nicolai, thanks for

Hi Nicolai,
thanks for posting.

But I did a positive control on the tape drive, i.e by using a different tape drive.

It also gave the same error. Seems to me that the tape/media is somehow
"corrupted" . Is this possible ?




Highlighted

Hello


As Nicolai, my understanding of the logs let me think it is a drive issue,
nevertheless as you are quiet surethe drive is ok, maybe you should try an other operation on the tape
could you try a long erase on W425L2, and if the erase is ok try a test backup
Highlighted

YES


It's is possible. If there is no End of Data marker (EOD) of a tape you will get similar error. A missing EOD marker could be a result of SAN or power failure during write operation. A LTO tape have a index at the beginning of tape, that index get updated on unload. That's why is a bad idea to power cycle a tape drive while mounted.





Highlighted

Hi Nicolai, It's is possible.

Hi Nicolai,

It's is possible. If there is no End of Data marker (EOD) of a tape you will get similar error. A missing EOD marker could be a result of SAN or power failure during write operation. A LTO tape have a index at the beginning of tape, that index get updated on unload. That's why is a bad idea to power cycle a tape drive while mounted.

Come to think of it, my Tape library did hang around the time the tape was last updated.
So that could have caused the corruption.

In that case i have another question ...
Let say a backup job is writing the n th image on media.
If the library hung during a backup operation, the backup job would have failed right ?

Consequently, the image catalogue should not have been updated.
And when the tape is being for backup again, the tape should be position at the end
of the n-1 th image , instead of the nth right ?

Then this error should not have occurred right ?

Hi Tula,
I will try as you suggested. Thanks very much
Highlighted

powercycle before unload = bad


You are right, Netbackup should have control of the number of images, but it's important to note that the tape drive updates it's index during UNLOAD, so if Netbackup believe everything is OK you power cycle the drive before unload, the last EOD is missing from the tape index. Next time tape is mounted, Netbackup would ask to locate the end of the last image, but the drive don't know it because the index was never updated.

In NBU 5.1 we often needed to suspend a tape after a write failure. Not the same as you, but down that lane.

Highlighted

Try to see

Try to see something about the HCART type of the drive and the tape.


that is just a shot