cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

Warning bptm cannot locate on drive index , No more data is on the tape.

George_Dicu
Level 4

Hello there,

We have

  • IBM TS3500, TLD 3584 library with 5 LTO-5 Drives updated with the last microcode
  • We use original IBM tapes.
  • NBU 7.6.0.2, 1 master and 4 media servers.
  • Windows srv 2008 r2

Issue:"Warning bptm(pid=) cannot locate on drive index 5, No more data is on the tape. "

When NB is trying to write on tape, we get the above error.

This is what we get when SLP are running and doing Duplications.

I found in bptm log, withing the time the job was running:

16:20:06.824 [12680.8964] <2> Orb::setOrbConnectTimeout: timeout seconds: 60(Orb.cpp:1567)
16:20:06.824 [12680.8964] <2> Orb::setOrbRequestTimeout: timeout seconds: 1800(Orb.cpp:1576)
16:20:07.791 [10424.13532] <2> tapelib: wait_for_ltid, Mount, timeout 0
16:20:29.475 [10424.13532] <2> io_ioctl: command (5)MTREW 1 0x0 from (bptm.c.8312) on drive index 4
16:20:29.506 [10424.13532] <2> io_ioctl: command (1)MTFSF 1 0x0 from (bptm.c.8564) on drive index 4
16:25:48.761 [10424.13532] <8> io_position_for_write: cannot locate on drive index 4, No more data is on the tape.
16:25:48.761 [10424.13532] <2> io_ioctl: command (5)MTREW 1 0x0 from (bptm.c.7146) on drive index 4
16:26:14.719 [10424.13532] <2> io_position_for_write: If header checks out, then overwrite it at LBA 0
16:26:14.813 [10424.13532] <2> io_position_for_write: Media at BOT

Please ignore LOT 3 handling, I only use LTO5, the focus is on tape: XO0354.

I attached the logs, please ask if smth is needed in order to gather more data for troubleshooting.

Thank You

//George

1 ACCEPTED SOLUTION

Accepted Solutions

mph999
Level 6
Employee Accredited

OK ...

So ... in summary ...

We have multiple tapes that are NOT full, that have no empty header at the end, at least accoding to scsi_command -map

The issue is seen when attempting to position to the end of data, which we cannot do as we can't find an empty header, which is what we look for.

The issue actually happens when the last backup to that tappe was written, so one thing we will have to do is look at the backup logs and see what they say - do we try and write the header and the tape drive 'ignores' the request, of do we just not even try.

Also, once I work out the syntax, we'lll try nt_ttu to position 'outsdie' fo NBU.

Regarding the logs, if the issue is happeneing very frequesntly as you suggest, it could be worth making some very small backups (just a few files) and then checking the tape to see if the header is missing.  If this works, we could get the logs and new_scsi map output within a short time.

It's not impossible that the issue happens only on 'big' backups - ok, I can't see how, but when t/shooting you need to consider everything no matter how unlikely - this would make it a bit more time consuming to reproduce.

The other way, is to just turn on bptm at verbose 5, and wait for a failure on a duplication, then look back at the logs - if you are a bit short on disk space, we might have to back the logs up and then delete them, then when we get a failure, restore the log we need.

That's all I can think of at the moment - very odd, as this issue hasn't been reported before - so I suspect it will come down to be a combination of factors that are unique to your environment, eg. driver version + firmware version + NBU version + maybe some OS patch or something ...

I'll also look to see if there is some scsi logging we can turn on, not sure in Windows, certainly you can in Linux/ Solaris - if it's available, Engineering would ask for that so might as well do it now .

So in summary, the next steps as I see them at the moment.

Reproduce issue and get bptm logs + scsi logs (if available) + event logs  +  mm logs

Get scsi_command map output for new failure (to clearly demonstrate the header is missing

nt_ttiu test (once I've worked out how, sorry, manually page is useless)

We'll also have a chat about the exact history (I know you have mentioned it but we'll confirm the exact details) and maybe look at trying a different tape driver.

Once we have this, if no solution can be found (I'll have a chat with my colleague Pete who is also BL and very very good at tape drive issues), there are also a couple of BL guys in the US who are very good at these sort of things as well.

If that provides no solution, then I'll be in a position to formally esclalate to Engineering.

Hopefully this is acceptable for you,

Regards,

Martin

View solution in original post

24 REPLIES 24

RamNagalla
Moderator
Moderator
Partner    VIP    Certified

other than the below log i do also see lot of robot operation failrues... does the Drives working fine for backups?

by any chance did you check the normal backup on these Drives?

 

 

    Line 613: 16:48:59.597 [10424.13532] <2> io_ioctl: command (1)MTFSF 1 0x0 from (bptm.c.7565) on drive index 4
    Line 614: 16:48:59.597 [10424.13532] <2> io_position_for_write: If header checks out, then overwrite it at LBA 19505160
    Line 615: 16:48:59.613 [10424.13532] <2> io_position_for_write: at file number 1485
    Line 616: 16:48:59.613 [10424.13532] <2> io_ioctl: command (1)MTFSF 3 0x0 from (bptm.c.7358) on drive index 4
    Line 617: 16:53:28.261 [10424.13532] <2> io_ioctl: MTFSF encountered EOD, normal return
    Line 618: 16:53:28.261 [10424.13532] <2> io_position_for_write: If header checks out, then overwrite it at LBA 19505729
    Line 619: 16:53:28.292 [10424.13532] <2> io_read_block: read error on media id XO0354, drive index 4 reading header block, len = 0; No more data is on the tape. (1104)
    Line 620: 16:53:28.292 [10424.13532] <16> io_position_for_write: cannot position media id XO0354 for write
    Line 621: 16:53:28.323 [10424.13532] <2> send_MDS_msg: DEVICE_STATUS 1 162081420 adm-master01 XO0354 4002628 IBM.ULT3580-TD5.Drive1 2000494 POSITION_ERROR 0 0
    Line 622: 16:53:28.385 [10424.13532] <2> log_media_error: successfully wrote to error file - 08/21/14 16:53:28 XO0354 4 POSITION_ERROR IBM.ULT3580-TD5.Drive1

SymTerry
Level 6
Employee Accredited

As Nagalla mentioned, this looks to be a hardware issue with either tape or drive. This could just be a bad tape. If you have not already, try other tapes with this drive. Positioning errors could be due to a dirty tape head, when was the last time this drive was cleaned?

mph999
Level 6
Employee Accredited
Looks to me like it can't find the empty header that should be at the end of the tape the signifies where to start wring the next backup. One of several causes, most likely being ... Tape issue Drive issue Driver or Friver fireware issue Tape corrupted in some way If you down the drive this happened in, do you get the same issue ? That will narrow it down. Positioning is all done outside NBU, so there is nothing that can be done in NBU to fix this unfortunately. You could try a bpverify on the tape and also an bpmedialist -mcontents see if the tape can be read - are you using any encryption, MSEO or hardware as a matter of interest ? As it's on windows we have pretty much no tools to investigate this (unix/ linux is great as you have dd / tar /od / mt / tcopy (Solars) which can sometimes help massively, windows - pretty much nothing. There is however, nt_ttu - I've never used it to read more than the media header at the beginning of the tape, but there is a help page (just run nt_ttu I think) and it may be possible to fsf 1487 (should be end of last image) and then read what is there ), should be an empty header . There is also scsi_command, in volmgr/bin dir (we have it on unix, not sure on windows) - it it does exist, it has a -map option so with tape loaded into drive (tpreq or robtest (down drive first if using robtest)) scsi_commmand -map -d with read through the tape and might given an indication. It looks like it can't position - and instead of finding the empty header it hits the end of data 'mark' which is after the empty header - the troubleshooting should start with IBM, it's their tape drive that can't position their media ... (Positioning tapes is down to the OS and hardware). To be honest though, chances are they'll just blame us (yep, been there before) and seeing we can't do anything about it I'd just park that tape and get on with life, if it's only affecting one tape, then can you re-run those duplications onto another, or try and copy that tape. Problem is, if it happens again, which I suspect it will it will need to be looked into. The only way this could be us, is if you are using MSEO (can cause problems sometimes) or if we've screwed up the empty header at the end (like it's really missing) - though given that we write a lot of tapes and I've never seen this before, that's very very very unlikely (as it would be a code issue and we'd have calls flooding in, + it would affect more than one of your tapes). However, it should be considered - but that would require a call to Symantec + probably BL + probably Engineering to investigate, and given the above, IBM as well as that is where the most likely cause is. Apart from MSEO, in 6 years I honestly can't think of a positioning error that's been our fault, and I do an awful lot of tape issues. I'll check the DB for any issues of missing empty header, but I don;t expect to find any. You need to have a look at scsi_command and nt_ttu as unless we can try and read the end of the tape, we have no way to investigate. I don't mind looking at those commands myself, but I know I'm on a sev1 call tomorrow so will have no time (funnily enough, it's a drive issue ...). If you decide to log a call, post the case number up here + log a call with IBM as well.

mph999
Level 6
Employee Accredited

Just checked the etracks - can't find any match.

There was an issue where we could hit EOM (end-of-media) which is not quite what you have.  I'd forgotten about that one, but it would only affect systems that had been upgraded from NBU 5.1 and had the forge_media script run to correct catalog inconsistency issues.

Using nt_ttu I can't see anyway to position the tape, you could install mt command for windows (3rd party)

http://www.holistech.co.uk/sw/mt/mt.html

 

This doesn't seem to have anyway to fsf the tape, so we're still stuck.

The only way I see of doing this is using nt_ttu to read all the way through the tape and output this to a file, I guess would have to try and read more bytes then are in the tape so we definately get to the end - it would error but that wouldn't matter.

 

Marianne
Level 6
Partner    VIP    Accredited Certified

Curious to know if these drives have been working fine previously?

If so, when did the problem start?
Before or after the new firmware?
Latest firmware is not always problem-free....

I agree with previous comments - way too many robot-related errors.

2014-08-21 04:27:51 - granted resource adm-master01-hcart2-robot-tld-1
2014-08-21 04:37:27 - Error bptm(pid=8784) error requesting media, TpErrno = No robot daemon or robotics are unavailable 
2014-08-21 07:31:47 - Error bptm(pid=12852) error requesting media, TpErrno = Robot operation failed     
2014-08-21 07:32:27 - Warning bptm(pid=12852) media id XO0354 load operation reported an error     

Just a check for my own sanity - master is a physical server, not a VM, right?

Have you check Windows Event Viewer System log for errors?

Nicolai
Moderator
Moderator
Partner    VIP   

Really good posts from MPH999 yes

Have you experienced power drop(s) ?

If power was cut off while tape drive was writing, no empty header is written and Netbackup will have troubles locating where it ended writing.

Suspend affected tapes, once data is expired you can re-use tapes.

George_Dicu
Level 4

The Problems are since ~2 years ago.

IBM suggested the firmware updates in order for them to investigate, Their Support suck..every time they`ve come, they changed the drives and ...done.

Before the updates some jobs runned, some not, we thought this was a tape issue.

Apparently with this new updates we can obviously see that all Robot is f.k up.

Master and Medias are physical.

Yes, I did check the event viewer and alot of NB Tape Manager or NB TLD Daemon

TapeAlert Code: 0x27, Type: Warning, Flag: DIAGNOSTICS REQ., from drive IBM.ULT3580-TD5.Drive4 (index 6), Media Id XO0338

TLD(0) [10056] timed out after waiting 844 seconds for ready, drive 1

Operator/EMM server has DOWN'ed drive IBM.ULT3580-TD5.Drive3 (device 2)

TapeAlert Code: 0x0c, Type: Informational, Flag: UNSUPPORTED FORMAT, from drive IBM.ULT3580-TD5.Drive3 (index 2), Media Id IBM047
TLD(0) [4156] timed out after waiting 849 seconds for ready, drive 2
Operator/EMM server has DOWN'ed drive IBM.ULT3580-TD5.Drive5 (device 5)

mph999
Level 6
Employee Accredited

If TAPE_ALERTS are involved, you certainly have tape drive / media issues of some sort.

However, if these are leading to the problem you describe I cannot say.  It would I think be reasonable to suggest 'they might be' and thus should be resolved to confirm (given we know that this uissue isn't haoppening to everyone else ...)

I'd escalate within IBM and get their higher levels support involved.  

Nicolai
Moderator
Moderator
Partner    VIP   

You must resolve the "Unsupported format" issues. Either you have LTO tapes in the robot the current generation can't wrute (write compability is one major version back). Or you have tape where servo track are damagesd.  remove them from the robot and dischard them. Tape with damaged servo track can't be re-initialized.

Following LTO tapes can be used with LTO5 drives:

LTO5: Read & Write

LTO4: Read & Write

LTO3: Read Only

LTO2: Unsupported

LTO1: Unsupported

From : 

http://www-01.ibm.com/support/knowledgecenter/STQRQ9/com.ibm.storage.ts4500.doc/opg_3584_xtault.html

about the message "UNSUPPORTED FORMAT"

Set when you load an unsupported cartridge type into the drive or when the cartridge format is corrupted.

 

George_Dicu
Level 4

How can this be? we have IBM tapes, the 5th generation. :(

Nicolai
Moderator
Moderator
Partner    VIP   

Inspect ALL tape if they really are type 5. If that not the case you have a LTO5 tape with damages servo track. If the amount of tape is relative small, mount every tape by hand and see if the LTO drives become ready for writing. Some red LED should be lit if the tape format is unsupported.

If you have large number of tapes, enable BPTM logging at verbose 5 and search for  "waiting 844 seconds for ready". Close to the text mention, you should be able to find the bar code id of the tape causing the issue. Multiple tapes could be affected.

3rd option is to shelf all tapes and run exclusive on new non-ibm tapes and see if the problem re-occur.

George_Dicu
Level 4

I`m trying to rediscover the drives again.

But the Administration Assisted Discovery has no TLD robot there, the ACS is the old Hp robot that we do not use.

Please see the attachment.Can I do smth to enable it?

Deb_Wilmot
Level 6
Employee Accredited Certified

You might want to reboot (sorry - windows thing...)

I was reviewing the data and noticed you have stale files in ...netbackup/db/media/drives (UNIX) ...\NetBackup\db\media\drives (Windows) on the media server the bptm log came from:

The CURTIME in each line is the current ctime, the TIME is the time the ctime file was created.  So it's easy to see that a majority of these are really7 old:

16:46:45.749 [11080.6104] <2> report_drives: DRIVE = HP.ULTRIUM3-SCSI.002 LOCK = FALSE CURTIME = 1408632405

16:46:45.749 [11080.6104] <2> report_drives: TIME = 1373016298
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = HP.ULTRIUM3-SCSI.005 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1383175540
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = HP.ULTRIUM3-SCSI.006 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1383412067
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = IBM.ULT3580-TD5.002 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1380696923
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = IBM.ULT3580-TD5.004 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1380696873
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = IBM.ULT3580-TD5.005 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1380699237
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = IBM.ULT3580-TD5.Drive1 LOCK = TRUE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1408630806

 

The only drive file that looks ok is the one with IBM.ULT3580-TD5.Drive in the name.  

When you have time - shutdown nbu on that server and just remove all the left over files after the processes are down.

 

 

mph999
Level 6
Employee Accredited

OK, just gone through this on a test server, scsi_command -map does work on windows ....

 

C:\Users\Administrator>scsi_command -map -d {4,0,1,2}


00000000: file 1: record 1: size 1024: NBU MEDIA header (E03002)
00000001: file 1: eof after 1 records: 1024 bytes
00000002: file 2: record 1: size 1024: NBU BACKUP header
          backup_id nbmedia1_1409139439: frag 1: file 1: copy 1
          expiration 1410349039: retention 1: block_size 65536
          flags 0x0: mpx_headers 0: resume_count 0: media E03002
00000003: file 2: record 2: size 11264
00000004: file 2: eof after 2 records: 12288 bytes
00000005: file 3: record 1: size 1024: NBU BACKUP header
          backup_id nbmedia1_1409139714: frag 1: file 2: copy 1
          expiration 1410349314: retention 1: block_size 65536
          flags 0x0: mpx_headers 0: resume_count 0: media E03002
00000006: file 3: record 2: size 11264
00000007: file 3: eof after 2 records: 12288 bytes
00000008: file 4: record 1: size 1024: NBU EMPTY header (file 3)
eot

The address for the drive can be seen from vmoprcmd.

As can be seen, scsi_command -map can id the different bits on the tape, and quite simply we look at the end of the the output to see if there is an EMPTY header.

 

George_Dicu
Level 4

Hello mph999,

 

Is it norman to have the same DrivePath for Drive1? :

 

Drive Name               Label   Ready  RecMID  ExtMID  Wr.Enbl.  Type
    Host                       DrivePath                            Status
=============================================================================
IBM.ULT3580-TD5.Drive1   Yes     Yes    XO0493  XO0493  Yes       hcart2-Clean
    adm-master01               {3,0,6,0}                            DOWN-TLD
    adm-media02                {3,0,6,0}                            DOWN-TLD
    adm-media04                {3,0,6,0}                            ACTIVE
    adm-media01                {3,0,4,0}                            DOWN-TLD
    adm-media03                {3,0,6,0}                            DOWN-TLD

IBM.ULT3580-TD5.Drive2   No      No                     No        hcart2-Clean
    adm-master01               {3,0,7,0}                            DOWN-TLD
    adm-media02                {3,0,7,0}                            DOWN-TLD
    adm-media04                {3,0,7,0}                            DOWN-TLD
    adm-media01                {3,0,5,0}                            DOWN-TLD
    adm-media03                {3,0,7,0}                            DOWN-TLD

IBM.ULT3580-TD5.Drive3   No      No                     No        hcart2-Clean
    adm-master01               {3,0,8,0}                            DOWN-TLD
    adm-media02                {3,0,8,0}                            DOWN-TLD
    adm-media04                {3,0,8,0}                            DOWN-TLD
    adm-media01                {3,0,6,0}                            DOWN-TLD
    adm-media03                {3,0,8,0}                            DOWN-TLD

IBM.ULT3580-TD5.Drive4   No      No                     No        hcart2-Clean
    adm-master01               {3,0,9,0}                            DOWN-TLD
    adm-media02                {3,0,9,0}                            DOWN-TLD
    adm-media04                {3,0,9,0}                            DOWN-TLD
    adm-media01                {3,0,7,0}                            DOWN-TLD
    adm-media03                {3,0,9,0}                            DOWN-TLD

IBM.ULT3580-TD5.Drive5   No      No                     No        hcart2-Clean
    adm-master01               {3,0,10,0}                           DOWN-TLD
    adm-media02                {3,0,10,0}                           DOWN-TLD
    adm-media04                {3,0,10,0}                           DOWN-TLD
    adm-media01                {3,0,8,0}                            DOWN-TLD
    adm-media03                {3,0,10,0}                           DOWN-TLD

George_Dicu
Level 4

Hi,

 

So I found a tape without the empty header at the end :(:

00000000: file 1: record 1: size 1024: NBU MEDIA header (XO0493)
00000001: file 1: eof after 1 records: 1024 bytes
00000002: file 2: record 1: size 1024: NBU BACKUP header
          backup_id esf-shrpnt-db.esf.se_1398474101: frag 1: file 1: copy 2
          expiration 2147483647: retention 9: block_size 262144
          flags 0x0: mpx_headers 0: resume_count 0: media XO0493
00000003: file 2: record 2: size 262144
00000021: file 2: record 20: size 229376
00000022: file 2: eof after 20 records: 4948992 bytes
00000023: file 3: record 1: size 1024: NBU BACKUP header
          backup_id esf-shrpnt-db.esf.se_1398474155: frag 1: file 2: copy 2
          expiration 2147483647: retention 9: block_size 262144
          flags 0x0: mpx_headers 0: resume_count 0: media XO0493
00000024: file 3: record 2: size 262144
00000890: file 3: record 868: size 229376
00000891: file 3: eof after 868 records: 227247104 bytes
00000892: file 4: record 1: size 1024: NBU BACKUP header
          backup_id esf-shrpnt-db.esf.se_1398474224: frag 1: file 3: copy 2
          expiration 2147483647: retention 9: block_size 262144
          flags 0x0: mpx_headers 0: resume_count 0: media XO0493
00000893: file 4: record 2: size 262144
00000992: file 4: record 101: size 32768
00000993: file 4: eof after 101 records: 25986048 bytes
00000994: file 5: record 1: size 1024: NBU BACKUP header
          backup_id esf-shrpnt-db.esf.se_1398474276: frag 1: file 4: copy 2
          expiration 2147483647: retention 9: block_size 262144
          flags 0x0: mpx_headers 0: resume_count 0: media XO0493
00000995: file 5: record 2: size 262144
00001030: file 5: record 37: size 98304
00001031: file 5: eof after 37 records: 9274368 bytes
00001032: file 6: record 1: size 1024: NBU BACKUP header
          backup_id p250-1.esf.se_1398468601: frag 1: file 5: copy 2
          expiration 2147483647: retention 9: block_size 262144
          flags 0x0: mpx_headers 0: resume_count 0: media XO0493
00001033: file 6: record 2: size 262144
00041033: file 6: eof after 40001 records: 10485761024 bytes
00041034: file 7: record 1: size 1024: NBU BACKUP header
          backup_id p250-1.esf.se_1398468601: frag 2: file 6: copy 2
          expiration 2147483647: retention 9: block_size 262144
          flags 0x0: mpx_headers 0: resume_count 0: media XO0493
00041035: file 7: record 2: size 262144
00081035: file 7: eof after 40001 records: 10485761024 bytes
00081036: file 8: record 1: size 1024: NBU BACKUP header
          backup_id p250-1.esf.se_1398468601: frag 3: file 7: copy 2
          expiration 2147483647: retention 9: block_size 262144
          flags 0x0: mpx_headers 0: resume_count 0: media XO0493
00081037: file 8: record 2: size 262144
eot

What do you think? IBM issues? Firmware issues? I think the last tape drive firmware didn`t wrote the Empty header at the end and it didn`t check fo it somehow..and  now, the new firmware is looking for it in the tapes. Can this be possible?

 

//George

mph999
Level 6
Employee Accredited
Yes, that is completely possible - a problem in a backup is only seen when the tape is next used. Is that tape marked as full. bpmedialist -m will show you. I think there is one time we don't write a header and that is when the tape is full, so if you picked a full tape we'll need to try again on one that isn't full (sorry, forgot to mention that), but if this is not a full tape then good, we have a problem. Given that we are responsible for the header, it could be our problem, or it could be we tried to write it and the drive didn't - could be anything ... You have a call logged with Symantec, I know this because I have it ... We need to confirm if this is a full tape (as above) If NOT, then fine ... If it is, we need to check a failing tape that is not full Likely hood is that I'll send this to Engineering, as I have no way to investigate this, we'd need to get an debug EEB to log whats going on - we'd also need to enable scsi logging. As a matter of interest, because I know what engineering will ask, are you able to upgrade to 7.6.0.3 Thanks, M

George_Dicu
Level 4

Hello, the tape is not full:

 

Server Host = adm-media01

 id     rl  images   allocated        last updated      density  kbytes restores
           vimages   expiration       last read         <------- STATUS ------->
           On Hold
--------------------------------------------------------------------------------
XO0493   9      4   04/26/2014 03:45  04/26/2014 04:07  hcart2      261184     0
                4   INFINITY              N/A
           0

or:

XO0493L5    XO0493    Active      3    adm-media01    ESF_for_ever    4        2038-01-19 04:14:07    HCART2    TLD      0    adm-master01    252    000_00000_TLD    2014-04-26 03:45:20     - NetBackup      -

I have other tape cheked please see the attachments, and its not full as well.

//George

mph999
Level 6
Employee Accredited
Super thx