08-21-2014 09:31 AM
Hello there,
We have
Issue:"Warning bptm(pid=) cannot locate on drive index 5, No more data is on the tape. "
When NB is trying to write on tape, we get the above error.
This is what we get when SLP are running and doing Duplications.
I found in bptm log, withing the time the job was running:
16:20:06.824 [12680.8964] <2> Orb::setOrbConnectTimeout: timeout seconds: 60(Orb.cpp:1567)
16:20:06.824 [12680.8964] <2> Orb::setOrbRequestTimeout: timeout seconds: 1800(Orb.cpp:1576)
16:20:07.791 [10424.13532] <2> tapelib: wait_for_ltid, Mount, timeout 0
16:20:29.475 [10424.13532] <2> io_ioctl: command (5)MTREW 1 0x0 from (bptm.c.8312) on drive index 4
16:20:29.506 [10424.13532] <2> io_ioctl: command (1)MTFSF 1 0x0 from (bptm.c.8564) on drive index 4
16:25:48.761 [10424.13532] <8> io_position_for_write: cannot locate on drive index 4, No more data is on the tape.
16:25:48.761 [10424.13532] <2> io_ioctl: command (5)MTREW 1 0x0 from (bptm.c.7146) on drive index 4
16:26:14.719 [10424.13532] <2> io_position_for_write: If header checks out, then overwrite it at LBA 0
16:26:14.813 [10424.13532] <2> io_position_for_write: Media at BOT
Please ignore LOT 3 handling, I only use LTO5, the focus is on tape: XO0354.
I attached the logs, please ask if smth is needed in order to gather more data for troubleshooting.
Thank You
//George
Solved! Go to Solution.
08-29-2014 02:47 PM
OK ...
So ... in summary ...
We have multiple tapes that are NOT full, that have no empty header at the end, at least accoding to scsi_command -map
The issue is seen when attempting to position to the end of data, which we cannot do as we can't find an empty header, which is what we look for.
The issue actually happens when the last backup to that tappe was written, so one thing we will have to do is look at the backup logs and see what they say - do we try and write the header and the tape drive 'ignores' the request, of do we just not even try.
Also, once I work out the syntax, we'lll try nt_ttu to position 'outsdie' fo NBU.
Regarding the logs, if the issue is happeneing very frequesntly as you suggest, it could be worth making some very small backups (just a few files) and then checking the tape to see if the header is missing. If this works, we could get the logs and new_scsi map output within a short time.
It's not impossible that the issue happens only on 'big' backups - ok, I can't see how, but when t/shooting you need to consider everything no matter how unlikely - this would make it a bit more time consuming to reproduce.
The other way, is to just turn on bptm at verbose 5, and wait for a failure on a duplication, then look back at the logs - if you are a bit short on disk space, we might have to back the logs up and then delete them, then when we get a failure, restore the log we need.
That's all I can think of at the moment - very odd, as this issue hasn't been reported before - so I suspect it will come down to be a combination of factors that are unique to your environment, eg. driver version + firmware version + NBU version + maybe some OS patch or something ...
I'll also look to see if there is some scsi logging we can turn on, not sure in Windows, certainly you can in Linux/ Solaris - if it's available, Engineering would ask for that so might as well do it now .
So in summary, the next steps as I see them at the moment.
Reproduce issue and get bptm logs + scsi logs (if available) + event logs + mm logs
Get scsi_command map output for new failure (to clearly demonstrate the header is missing
nt_ttiu test (once I've worked out how, sorry, manually page is useless)
We'll also have a chat about the exact history (I know you have mentioned it but we'll confirm the exact details) and maybe look at trying a different tape driver.
Once we have this, if no solution can be found (I'll have a chat with my colleague Pete who is also BL and very very good at tape drive issues), there are also a couple of BL guys in the US who are very good at these sort of things as well.
If that provides no solution, then I'll be in a position to formally esclalate to Engineering.
Hopefully this is acceptable for you,
Regards,
Martin
08-21-2014 10:50 AM
other than the below log i do also see lot of robot operation failrues... does the Drives working fine for backups?
by any chance did you check the normal backup on these Drives?
Line 613: 16:48:59.597 [10424.13532] <2> io_ioctl: command (1)MTFSF 1 0x0 from (bptm.c.7565) on drive index 4
Line 614: 16:48:59.597 [10424.13532] <2> io_position_for_write: If header checks out, then overwrite it at LBA 19505160
Line 615: 16:48:59.613 [10424.13532] <2> io_position_for_write: at file number 1485
Line 616: 16:48:59.613 [10424.13532] <2> io_ioctl: command (1)MTFSF 3 0x0 from (bptm.c.7358) on drive index 4
Line 617: 16:53:28.261 [10424.13532] <2> io_ioctl: MTFSF encountered EOD, normal return
Line 618: 16:53:28.261 [10424.13532] <2> io_position_for_write: If header checks out, then overwrite it at LBA 19505729
Line 619: 16:53:28.292 [10424.13532] <2> io_read_block: read error on media id XO0354, drive index 4 reading header block, len = 0; No more data is on the tape. (1104)
Line 620: 16:53:28.292 [10424.13532] <16> io_position_for_write: cannot position media id XO0354 for write
Line 621: 16:53:28.323 [10424.13532] <2> send_MDS_msg: DEVICE_STATUS 1 162081420 adm-master01 XO0354 4002628 IBM.ULT3580-TD5.Drive1 2000494 POSITION_ERROR 0 0
Line 622: 16:53:28.385 [10424.13532] <2> log_media_error: successfully wrote to error file - 08/21/14 16:53:28 XO0354 4 POSITION_ERROR IBM.ULT3580-TD5.Drive1
08-21-2014 12:26 PM
As Nagalla mentioned, this looks to be a hardware issue with either tape or drive. This could just be a bad tape. If you have not already, try other tapes with this drive. Positioning errors could be due to a dirty tape head, when was the last time this drive was cleaned?
08-21-2014 12:41 PM
08-21-2014 03:10 PM
Just checked the etracks - can't find any match.
There was an issue where we could hit EOM (end-of-media) which is not quite what you have. I'd forgotten about that one, but it would only affect systems that had been upgraded from NBU 5.1 and had the forge_media script run to correct catalog inconsistency issues.
Using nt_ttu I can't see anyway to position the tape, you could install mt command for windows (3rd party)
http://www.holistech.co.uk/sw/mt/mt.html
This doesn't seem to have anyway to fsf the tape, so we're still stuck.
The only way I see of doing this is using nt_ttu to read all the way through the tape and output this to a file, I guess would have to try and read more bytes then are in the tape so we definately get to the end - it would error but that wouldn't matter.
08-21-2014 10:19 PM
Curious to know if these drives have been working fine previously?
If so, when did the problem start?
Before or after the new firmware?
Latest firmware is not always problem-free....
I agree with previous comments - way too many robot-related errors.
2014-08-21 04:27:51 - granted resource adm-master01-hcart2-robot-tld-1 2014-08-21 04:37:27 - Error bptm(pid=8784) error requesting media, TpErrno = No robot daemon or robotics are unavailable
2014-08-21 07:31:47 - Error bptm(pid=12852) error requesting media, TpErrno = Robot operation failed 2014-08-21 07:32:27 - Warning bptm(pid=12852) media id XO0354 load operation reported an error
Just a check for my own sanity - master is a physical server, not a VM, right?
Have you check Windows Event Viewer System log for errors?
08-21-2014 11:18 PM
Really good posts from MPH999
Have you experienced power drop(s) ?
If power was cut off while tape drive was writing, no empty header is written and Netbackup will have troubles locating where it ended writing.
Suspend affected tapes, once data is expired you can re-use tapes.
08-22-2014 01:35 AM
The Problems are since ~2 years ago.
IBM suggested the firmware updates in order for them to investigate, Their Support suck..every time they`ve come, they changed the drives and ...done.
Before the updates some jobs runned, some not, we thought this was a tape issue.
Apparently with this new updates we can obviously see that all Robot is f.k up.
Master and Medias are physical.
Yes, I did check the event viewer and alot of NB Tape Manager or NB TLD Daemon
TapeAlert Code: 0x27, Type: Warning, Flag: DIAGNOSTICS REQ., from drive IBM.ULT3580-TD5.Drive4 (index 6), Media Id XO0338
TLD(0) [10056] timed out after waiting 844 seconds for ready, drive 1
Operator/EMM server has DOWN'ed drive IBM.ULT3580-TD5.Drive3 (device 2)
TapeAlert Code: 0x0c, Type: Informational, Flag: UNSUPPORTED FORMAT, from drive IBM.ULT3580-TD5.Drive3 (index 2), Media Id IBM047
TLD(0) [4156] timed out after waiting 849 seconds for ready, drive 2
Operator/EMM server has DOWN'ed drive IBM.ULT3580-TD5.Drive5 (device 5)
08-22-2014 03:19 AM
If TAPE_ALERTS are involved, you certainly have tape drive / media issues of some sort.
However, if these are leading to the problem you describe I cannot say. It would I think be reasonable to suggest 'they might be' and thus should be resolved to confirm (given we know that this uissue isn't haoppening to everyone else ...)
I'd escalate within IBM and get their higher levels support involved.
08-22-2014 05:10 AM
You must resolve the "Unsupported format" issues. Either you have LTO tapes in the robot the current generation can't wrute (write compability is one major version back). Or you have tape where servo track are damagesd. remove them from the robot and dischard them. Tape with damaged servo track can't be re-initialized.
Following LTO tapes can be used with LTO5 drives:
LTO5: Read & Write
LTO4: Read & Write
LTO3: Read Only
LTO2: Unsupported
LTO1: Unsupported
From :
http://www-01.ibm.com/support/knowledgecenter/STQRQ9/com.ibm.storage.ts4500.doc/opg_3584_xtault.html
about the message "UNSUPPORTED FORMAT"
Set when you load an unsupported cartridge type into the drive or when the cartridge format is corrupted.
08-22-2014 05:49 AM
How can this be? we have IBM tapes, the 5th generation. :(
08-22-2014 06:07 AM
Inspect ALL tape if they really are type 5. If that not the case you have a LTO5 tape with damages servo track. If the amount of tape is relative small, mount every tape by hand and see if the LTO drives become ready for writing. Some red LED should be lit if the tape format is unsupported.
If you have large number of tapes, enable BPTM logging at verbose 5 and search for "waiting 844 seconds for ready". Close to the text mention, you should be able to find the bar code id of the tape causing the issue. Multiple tapes could be affected.
3rd option is to shelf all tapes and run exclusive on new non-ibm tapes and see if the problem re-occur.
08-22-2014 07:51 AM
I`m trying to rediscover the drives again.
But the Administration Assisted Discovery has no TLD robot there, the ACS is the old Hp robot that we do not use.
Please see the attachment.Can I do smth to enable it?
08-24-2014 11:27 AM
You might want to reboot (sorry - windows thing...)
I was reviewing the data and noticed you have stale files in ...netbackup/db/media/drives (UNIX) ...\NetBackup\db\media\drives (Windows) on the media server the bptm log came from:
The CURTIME in each line is the current ctime, the TIME is the time the ctime file was created. So it's easy to see that a majority of these are really7 old:
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = HP.ULTRIUM3-SCSI.002 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1373016298
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = HP.ULTRIUM3-SCSI.005 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1383175540
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = HP.ULTRIUM3-SCSI.006 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1383412067
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = IBM.ULT3580-TD5.002 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1380696923
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = IBM.ULT3580-TD5.004 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1380696873
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = IBM.ULT3580-TD5.005 LOCK = FALSE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1380699237
16:46:45.749 [11080.6104] <2> report_drives: DRIVE = IBM.ULT3580-TD5.Drive1 LOCK = TRUE CURTIME = 1408632405
16:46:45.749 [11080.6104] <2> report_drives: TIME = 1408630806
The only drive file that looks ok is the one with IBM.ULT3580-TD5.Drive in the name.
When you have time - shutdown nbu on that server and just remove all the left over files after the processes are down.
08-27-2014 07:24 AM
OK, just gone through this on a test server, scsi_command -map does work on windows ....
C:\Users\Administrator>scsi_command -map -d {4,0,1,2}
00000000: file 1: record 1: size 1024: NBU MEDIA header (E03002)
00000001: file 1: eof after 1 records: 1024 bytes
00000002: file 2: record 1: size 1024: NBU BACKUP header
backup_id nbmedia1_1409139439: frag 1: file 1: copy 1
expiration 1410349039: retention 1: block_size 65536
flags 0x0: mpx_headers 0: resume_count 0: media E03002
00000003: file 2: record 2: size 11264
00000004: file 2: eof after 2 records: 12288 bytes
00000005: file 3: record 1: size 1024: NBU BACKUP header
backup_id nbmedia1_1409139714: frag 1: file 2: copy 1
expiration 1410349314: retention 1: block_size 65536
flags 0x0: mpx_headers 0: resume_count 0: media E03002
00000006: file 3: record 2: size 11264
00000007: file 3: eof after 2 records: 12288 bytes
00000008: file 4: record 1: size 1024: NBU EMPTY header (file 3)
eot
The address for the drive can be seen from vmoprcmd.
As can be seen, scsi_command -map can id the different bits on the tape, and quite simply we look at the end of the the output to see if there is an EMPTY header.
08-29-2014 02:56 AM
Hello mph999,
Is it norman to have the same DrivePath for Drive1? :
Drive Name Label Ready RecMID ExtMID Wr.Enbl. Type
Host DrivePath Status
=============================================================================
IBM.ULT3580-TD5.Drive1 Yes Yes XO0493 XO0493 Yes hcart2-Clean
adm-master01 {3,0,6,0} DOWN-TLD
adm-media02 {3,0,6,0} DOWN-TLD
adm-media04 {3,0,6,0} ACTIVE
adm-media01 {3,0,4,0} DOWN-TLD
adm-media03 {3,0,6,0} DOWN-TLD
IBM.ULT3580-TD5.Drive2 No No No hcart2-Clean
adm-master01 {3,0,7,0} DOWN-TLD
adm-media02 {3,0,7,0} DOWN-TLD
adm-media04 {3,0,7,0} DOWN-TLD
adm-media01 {3,0,5,0} DOWN-TLD
adm-media03 {3,0,7,0} DOWN-TLD
IBM.ULT3580-TD5.Drive3 No No No hcart2-Clean
adm-master01 {3,0,8,0} DOWN-TLD
adm-media02 {3,0,8,0} DOWN-TLD
adm-media04 {3,0,8,0} DOWN-TLD
adm-media01 {3,0,6,0} DOWN-TLD
adm-media03 {3,0,8,0} DOWN-TLD
IBM.ULT3580-TD5.Drive4 No No No hcart2-Clean
adm-master01 {3,0,9,0} DOWN-TLD
adm-media02 {3,0,9,0} DOWN-TLD
adm-media04 {3,0,9,0} DOWN-TLD
adm-media01 {3,0,7,0} DOWN-TLD
adm-media03 {3,0,9,0} DOWN-TLD
IBM.ULT3580-TD5.Drive5 No No No hcart2-Clean
adm-master01 {3,0,10,0} DOWN-TLD
adm-media02 {3,0,10,0} DOWN-TLD
adm-media04 {3,0,10,0} DOWN-TLD
adm-media01 {3,0,8,0} DOWN-TLD
adm-media03 {3,0,10,0} DOWN-TLD
08-29-2014 04:17 AM
Hi,
So I found a tape without the empty header at the end :(:
00000000: file 1: record 1: size 1024: NBU MEDIA header (XO0493) 00000001: file 1: eof after 1 records: 1024 bytes 00000002: file 2: record 1: size 1024: NBU BACKUP header backup_id esf-shrpnt-db.esf.se_1398474101: frag 1: file 1: copy 2 expiration 2147483647: retention 9: block_size 262144 flags 0x0: mpx_headers 0: resume_count 0: media XO0493 00000003: file 2: record 2: size 262144 00000021: file 2: record 20: size 229376 00000022: file 2: eof after 20 records: 4948992 bytes 00000023: file 3: record 1: size 1024: NBU BACKUP header backup_id esf-shrpnt-db.esf.se_1398474155: frag 1: file 2: copy 2 expiration 2147483647: retention 9: block_size 262144 flags 0x0: mpx_headers 0: resume_count 0: media XO0493 00000024: file 3: record 2: size 262144 00000890: file 3: record 868: size 229376 00000891: file 3: eof after 868 records: 227247104 bytes 00000892: file 4: record 1: size 1024: NBU BACKUP header backup_id esf-shrpnt-db.esf.se_1398474224: frag 1: file 3: copy 2 expiration 2147483647: retention 9: block_size 262144 flags 0x0: mpx_headers 0: resume_count 0: media XO0493 00000893: file 4: record 2: size 262144 00000992: file 4: record 101: size 32768 00000993: file 4: eof after 101 records: 25986048 bytes 00000994: file 5: record 1: size 1024: NBU BACKUP header backup_id esf-shrpnt-db.esf.se_1398474276: frag 1: file 4: copy 2 expiration 2147483647: retention 9: block_size 262144 flags 0x0: mpx_headers 0: resume_count 0: media XO0493 00000995: file 5: record 2: size 262144 00001030: file 5: record 37: size 98304 00001031: file 5: eof after 37 records: 9274368 bytes 00001032: file 6: record 1: size 1024: NBU BACKUP header backup_id p250-1.esf.se_1398468601: frag 1: file 5: copy 2 expiration 2147483647: retention 9: block_size 262144 flags 0x0: mpx_headers 0: resume_count 0: media XO0493 00001033: file 6: record 2: size 262144 00041033: file 6: eof after 40001 records: 10485761024 bytes 00041034: file 7: record 1: size 1024: NBU BACKUP header backup_id p250-1.esf.se_1398468601: frag 2: file 6: copy 2 expiration 2147483647: retention 9: block_size 262144 flags 0x0: mpx_headers 0: resume_count 0: media XO0493 00041035: file 7: record 2: size 262144 00081035: file 7: eof after 40001 records: 10485761024 bytes 00081036: file 8: record 1: size 1024: NBU BACKUP header backup_id p250-1.esf.se_1398468601: frag 3: file 7: copy 2 expiration 2147483647: retention 9: block_size 262144 flags 0x0: mpx_headers 0: resume_count 0: media XO0493 00081037: file 8: record 2: size 262144 eot
What do you think? IBM issues? Firmware issues? I think the last tape drive firmware didn`t wrote the Empty header at the end and it didn`t check fo it somehow..and now, the new firmware is looking for it in the tapes. Can this be possible?
//George
08-29-2014 11:53 AM
08-29-2014 12:20 PM
Hello, the tape is not full:
Server Host = adm-media01
id rl images allocated last updated density kbytes restores
vimages expiration last read <------- STATUS ------->
On Hold
--------------------------------------------------------------------------------
XO0493 9 4 04/26/2014 03:45 04/26/2014 04:07 hcart2 261184 0
4 INFINITY N/A
0
or:
XO0493L5 XO0493 Active 3 adm-media01 ESF_for_ever 4 2038-01-19 04:14:07 HCART2 TLD 0 adm-master01 252 000_00000_TLD 2014-04-26 03:45:20 - NetBackup -
I have other tape cheked please see the attachments, and its not full as well.
//George
08-29-2014 12:58 PM