cancel
Showing results for 
Search instead for 
Did you mean: 

Error bptm(pid=5832) error requesting media, TpErrno = Robot operation failed

d3an_phillips
Level 2

Hi,

I have been having big problem with one of our Media Servers. We had an un-related problem with the OS, and had to rebuild the server. I've finally got our backups working again, but am unable to Vault images to tape.

Robtest is able to successfully move tapes from slots to drives and visa versa, but no images are mounted to tape. Our backup procedure is as follows:

Backup to Disk > Vault to ON site and OFF site copies > Set ON site copy to primary and delete the disk backup on Image Cleanup.

 

Here is the full Vault job status report:

20/12/2013 20:06:46 - begin Duplicate
20/12/2013 20:07:13 - requesting resource backup4-dlt2-robot-tld-2
20/12/2013 20:07:13 - requesting resource backup4-dlt2-robot-tld-2
20/12/2013 20:07:13 - granted resource ON2348
20/12/2013 20:07:13 - granted resource QUANTUM.DLT-S4.006
20/12/2013 20:07:13 - granted resource backup4-dlt2-robot-tld-2
20/12/2013 20:07:13 - granted resource OF2562
20/12/2013 20:07:13 - granted resource QUANTUM.DLT-S4.009
20/12/2013 20:07:13 - granted resource backup4-dlt2-robot-tld-2
20/12/2013 20:07:16 - started process bptm (5800)
20/12/2013 20:07:23 - started process bpdm (5920)
20/12/2013 20:07:26 - started process bptm (5800)
20/12/2013 20:07:26 - mounting ON2348
20/12/2013 20:07:44 - begin reading
20/12/2013 20:18:57 - Error bptm(pid=5800) error requesting media, TpErrno = Mount canceled, device daemon is terminating  
20/12/2013 20:18:57 - Warning bptm(pid=5800) media id ON2348 load operation reported an error     
20/12/2013 20:18:57 - current media ON2348 complete, requesting next resource Any
20/12/2013 20:19:01 - Error nbjm(pid=8264) NBU status: 98, EMM status: The media server reported a system error
20/12/2013 20:19:11 - Error nbjm(pid=8264) nbrb status: LTID reset media server resources      
20/12/2013 20:19:29 - Error bptm(pid=5800) INF - error requesting media (tpreq) (98), cannot continue with copy 2
20/12/2013 20:20:23 - Error bpduplicate(pid=8380) host backup4 backup id doradb04_1386981799 read failed, termination requested by administrator (150).
20/12/2013 20:20:23 - Error bpduplicate(pid=8380) host backup4 backupid doradb04_1386981799 write failed, error requesting media (tpreq) (98).  
20/12/2013 20:21:17 - Error bpduplicate(pid=8380) Duplicate of backupid doradb04_1386981799 failed, error requesting media (tpreq) (98).   
20/12/2013 20:21:18 - started process bptm (5832)
20/12/2013 20:21:18 - requesting resource backup4-dlt2-robot-tld-2
20/12/2013 20:21:18 - granted resource ON2348
20/12/2013 20:21:18 - granted resource QUANTUM.DLT-S4.007
20/12/2013 20:21:18 - granted resource backup4-dlt2-robot-tld-2
20/12/2013 20:21:20 - requesting resource backup4-dlt2-robot-tld-2
20/12/2013 20:21:20 - granted resource OF2562
20/12/2013 20:21:20 - granted resource QUANTUM.DLT-S4.008
20/12/2013 20:21:20 - granted resource backup4-dlt2-robot-tld-2
20/12/2013 20:21:24 - started process bpdm (4524)
20/12/2013 20:21:27 - started process bptm (5832)
20/12/2013 20:21:27 - mounting ON2348
20/12/2013 20:21:45 - begin reading
20/12/2013 20:21:50 - Error bptm(pid=5832) error requesting media, TpErrno = Robot operation failed     
20/12/2013 20:21:50 - Warning bptm(pid=5832) media id ON2348 load operation reported an error     
20/12/2013 20:21:51 - current media ON2348 complete, requesting next resource Any
20/12/2013 20:22:13 - started process bptm (5832)
20/12/2013 20:22:13 - mounting ON2348
20/12/2013 20:22:13 - granted resource ON2348
20/12/2013 20:22:13 - granted resource QUANTUM.DLT-S4.006
20/12/2013 20:22:13 - granted resource backup4-dlt2-robot-tld-2
20/12/2013 20:37:37 - Error bptm(pid=5832) error requesting media, TpErrno = Robot operation failed     
20/12/2013 20:37:38 - Warning bptm(pid=5832) media id ON2348 load operation reported an error     
20/12/2013 20:37:38 - current media ON2348 complete, requesting next resource Any
20/12/2013 20:37:44 - granted resource ON2348
20/12/2013 20:37:44 - granted resource QUANTUM.DLT-S4.009
20/12/2013 20:37:44 - granted resource backup4-dlt2-robot-tld-2
20/12/2013 20:37:46 - started process bptm (5832)
20/12/2013 20:37:46 - mounting ON2348
20/12/2013 20:53:12 - Error bptm(pid=5832) error requesting media, TpErrno = Robot operation failed     
20/12/2013 20:53:12 - Warning bptm(pid=5832) media id ON2348 load operation reported an error     
20/12/2013 20:53:12 - current media ON2348 complete, requesting next resource Any
20/12/2013 20:53:19 - granted resource ON2064
20/12/2013 20:53:19 - granted resource QUANTUM.DLT-S4.007
20/12/2013 20:53:19 - granted resource backup4-dlt2-robot-tld-2
20/12/2013 20:53:20 - started process bptm (5832)
20/12/2013 20:53:20 - mounting ON2064
20/12/2013 20:54:48 - started process bptm (5832)
20/12/2013 20:54:48 - mounting OF2562
20/12/2013 21:10:14 - Error bptm(pid=5832) error requesting media, TpErrno = Robot operation failed     
20/12/2013 21:10:14 - Warning bptm(pid=5832) media id OF2562 load operation reported an error     
20/12/2013 21:10:15 - current media OF2562 complete, requesting next resource Any
20/12/2013 21:10:21 - granted resource OF2562
20/12/2013 21:10:21 - granted resource QUANTUM.DLT-S4.006
20/12/2013 21:10:21 - granted resource backup4-dlt2-robot-tld-2
20/12/2013 21:10:23 - started process bptm (5832)
20/12/2013 21:10:23 - mounting OF2562
20/12/2013 21:24:16 - Error bptm(pid=5832) FREEZING media id OF2562, Medium identifiers do not match    
20/12/2013 21:24:16 - Warning bptm(pid=5832) media id OF2562 load operation reported an error     
20/12/2013 21:24:17 - current media OF2562 complete, requesting next resource Any
20/12/2013 21:45:53 - started process bptm (5832)
20/12/2013 21:45:53 - mounting OF2142
20/12/2013 21:45:53 - granted resource OF2142
20/12/2013 21:45:53 - granted resource QUANTUM.DLT-S4.009
20/12/2013 21:45:53 - granted resource backup4-dlt2-robot-tld-2
20/12/2013 22:01:19 - Error bptm(pid=5832) error requesting media, TpErrno = Robot operation failed     
20/12/2013 22:01:19 - Warning bptm(pid=5832) media id OF2142 load operation reported an error     
20/12/2013 22:01:20 - current media OF2142 complete, requesting next resource Any
20/12/2013 22:01:26 - started process bptm (5832)
20/12/2013 22:01:26 - mounting OF2142
20/12/2013 22:01:26 - granted resource OF2142
20/12/2013 22:01:26 - granted resource QUANTUM.DLT-S4.008
20/12/2013 22:01:26 - granted resource backup4-dlt2-robot-tld-2

I am able to inventory the robot successfully, and the robot is capable of emptying tapes from the media access port, and placing them in slots. We are using NetBackup 6.5.6. Up until last week the full procedure worked perfectly, so our OS, library and NetBackup version are all definitely cross-compatible.

Any help would be gratefully received!

1 ACCEPTED SOLUTION

Accepted Solutions

Mark_Solutions
Level 6
Partner Accredited Certified

There is a long delay between the mount and the failure

This could be any of a few reasons so please work through these:

1. The drives are detected in the wrong order - so it puts the tape in drive 1 but tries to read it from drive 2 - you could delete them all and re-run the device wizard to try and get this right - or use robtest to put a tape in drive 1 and see which drive device monitor shows that tape to be in and then start shuffling them around from there

2. Ensure you have the removable device manager service stopped and disabled

3. ensure you have created / set the AutoRun registry key to 0 to stop windows interferring with the tape drives :

http://support.microsoft.com/kb/842411

 

View solution in original post

9 REPLIES 9

StefanosM
Level 6
Partner    VIP    Accredited Certified

check your cleaning tapes.

One posible cause is that the drive is requesting clenaing and there are no cleaning tape available or the cleanings remaining is 0

d3an_phillips
Level 2

Thanks for your response. There is one cleaning tape in the library and it has 9 cleanings remaining so I don't believe this is the issue.

mph999
Level 6
Employee Accredited
Would need to see: VERBOSE = 5 in bp.conf VERBOSE in vm.conf DRIVE_DEBUG (empty file) in /usr/openv/volmgr ROBOT_DEBUG (empty file) in /usr/openv/volmgr (ltid will need restarting) then collect bptm log robots log (in volmgr/debug) ltid log (in volmgr/debug) system log Details from activity monitor for job (like above) All for the same job that shows the error I would also test ... tpreq -m -f /tmp/TAPE (use tpunmount -f /tmp/TAPE to unload) I guess this will fail, but worth checking - if so, could use this as the test and just get the logs above covering this command). I would personally be tempted to just remove all devices, including from os, powercycle everything and reconfig + reinventory. Just occasionally this fixes things, so might as well get it out of the way. If it doesn't, at least the config has been recreated and can be eliminated from the list of possibles. You didn't mention the os, but it looks like linux/ unix.

d3an_phillips
Level 2

Hi mph999,


Thanks for your response. We're running Windows 2003 R2 x64 Standard Edition.

I'm guessing your steps above are to enable advanced logging for our NetBackup environment? I can find the bp.conf file, but I have found vm.conf - do I just need to append the vm.conf file with the line 'VERBOSE'? Also, where should I create the bp.conf, DRIVE_DEBUG, and ROBOT_DEBUG files on a W2K3 environment?

When you say ltid will need restarting, is that the robot which will need restating?

Sorry I'm a bit of a novice with NetBackup, but our NetBackup admin recently left, so I'm trying to pick up where he left off!

Mark_Solutions
Level 6
Partner Accredited Certified

There is a long delay between the mount and the failure

This could be any of a few reasons so please work through these:

1. The drives are detected in the wrong order - so it puts the tape in drive 1 but tries to read it from drive 2 - you could delete them all and re-run the device wizard to try and get this right - or use robtest to put a tape in drive 1 and see which drive device monitor shows that tape to be in and then start shuffling them around from there

2. Ensure you have the removable device manager service stopped and disabled

3. ensure you have created / set the AutoRun registry key to 0 to stop windows interferring with the tape drives :

http://support.microsoft.com/kb/842411

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
Error bptm(pid=5832) FREEZING media id OF2562, Medium identifiers do not match  

 

We need to see logs...

On Windows server, create bptm log folder under <install-path>\veritas\netbackup\logs.

Use Host Properties -> Master -> Logging to increase logging level for bptm process.

Add VERBOSE entry to vm.conf in <install-path>\veritas\volmgr.
Create empty DRIVE_DEBUG and ROBOT_DEBUG files in volmgr folder. Check that the files have no file extention.
Create debug folder under <install-path>\veritas\volmgr, then create robots and ltid folders under debug:
<install-path>\veritas\volmgr\debug\robots\
<install-path>\veritas\volmgr\debug\ltid\

Restart NetBackup Device Management service to enable these logs.

After next error, check bptm log, the logs in the debug folders as well as Event Viewer System and Application logs.

You can also post them here as follows:

Copy log files to txt files to reflect process name (e.g. bptm.txt) and upload here as File attachments.

Save Event Viewer logs as .txt (not .evt) files and upload as well.

mph999
Level 6
Employee Accredited

24/03/2014 11:46:04 - Info bptm(pid=10040) Waiting for mount of media id 000107 (copy 2) on server appnbu5.
24/03/2014 11:46:04 - started process bptm (10040)
24/03/2014 11:46:04 - mounting 000107
24/03/2014 11:46:04 - Info bptm(pid=10040) INF - Waiting for mount of media id 000107 on server appnbu5 for writing.
24/03/2014 11:46:05 - granted resource 000107
24/03/2014 11:46:05 - granted resource Drive002
24/03/2014 11:46:05 - granted resource Mediaserver5-hcart-robot-tld-0
24/03/2014 12:00:52 - Error bptm(pid=10040) error requesting media, TpErrno = Robot operation failed    
24/03/2014 12:00:52 - Warning bptm(pid=10040) media id 000107 load operation reported an error    
24/03/2014 12:00:53 - current media 000107 complete, requesting next resource Any
24/03/2014 12:01:06 - Info bptm(pid=10040) Waiting for mount of media id 000107 (copy 2) on server appnbu5.
24/03/2014 12:01:06 - started process bptm (10040)
24/03/2014 12:01:06 - mounting 000107
24/03/2014 12:01:06 - Info bptm(pid=10040) INF - Waiting for mount of media id 000107 on server appnbu5 for writing.
24/03/2014 12:01:06 - granted resource 000107
24/03/2014 12:01:06 - granted resource Drive002
24/03/2014 12:01:06 - granted resource Mediaserver5-hcart-robot-tld-0
24/03/2014 12:15:49 - Error bptm(pid=10040) error requesting media, TpErrno = Robot operation failed    
24/03/2014 12:15:49 - Warning bptm(pid=10040) media id 000107 load operation reported an error    
24/03/2014 12:15:50 - current media 000107 complete, requesting next resource Any

I wouldn't log any calls just yet, as per my usual comments on cases like this it's unlikely to be NBU (the vast majority of tape/ library issues are outside of NBU) but it's not impossible it's not us.


We would need to demonstrate it's not NBU before passing the issue to a 3rd party vendor (only polite ...)
Sometimes I request assistance from a 3rd part if I need their 'help' but we're not at that stage.

The problem here is tha the log messages generally don't get any better than what we're seen here - for whatever reason, NBU does't think the tape is in the drive, hence the error - it can't tell why, so we get no real details.

You've done som good work so far, so let's just confirm :

The fact the library console can move the tape is good - it shows the physical mechanics work.

The jobs fails as explained, but you see the tape loaded in the robot log:

 : <../../lib/MmDeviceMappings.cpp:974> search robot list (length=406) for ADIC Scalar i500, type 8
12:00:36.866 [9040.9048] <4> MmDeviceMappings::GetRobotAttributes
 : <../../lib/MmDeviceMappings.cpp:1227> found match: "ADIC Scalar i500" ADIC Scalar i500
12:00:36.866 [9040.9048] <5> tldcd:inquiry: inquiry() function processing library ADIC     Scalar i500      636G:
12:00:36.866 [9040.9048] <6> tldcd:read_element_status_drive: RES drive 1
12:00:37.773 [9040.9048] <6> tldcd:tape_in_drive: valid = 1, sel = 4127, barcode = (000107L4                        )
12:00:37.773 [9040.9048] <6> tldcd:read_element_status_drive: RES drive 1
12:00:39.085 [9040.9048] <6> tldcd:read_element_status_slot: RES storage element 32
12:00:40.382 [9040.9048] <5> tldcd:move_medium: TLD(0) initiating MOVE_MEDIUM from addr 256 to addr 4127
12:00:51.101 [9040.9048] <5> tldcd:tld_main: TLD(0) closing/unlocking robotic path
12:00:51.101 [9040.9048] <6> tldcd:ChildExit: TLD Child process 9040 has exited normally: STATUS_SUCCESS

Persoanlly my next step would be to run a job and then check this is true using robtest from the robot control host.

It's easy to do:

1.  Run job, what activity monitor until you see it mounting media, give it a minute to be sure
2.  Run robtest from the robot control host, select the correct robot
3.  Once in robtest, run the command s d

This shows the drives, and you can see the barcodes of the tapes in the drives, do you see the tape in a drive that is listed in Activity Monitor.

If you don't report back here ...
If you do, um, report back here ...

What should happen, is that once a tape is requested to be put in the drive by NBU, NBU has to determine when the tape is actually in the drive and ready for use.
It does this by constant asking the drive 'Are you ready' ...  it does this multiple times (quite quickly) until the drive responds, 'Yep I've got a tape loaded, I'm ready'.
The way it does this, is by sending the scsi command TUR (Test unit ready).

If TUR never returns with the correct response (or sometimes any responce) then NBU cannot determine the tape is loaded and you get a robot load failure, even though, the tape did actually get loaded - it's a bit of a false error in a way, but that's how it is.

I suspect that this is what is happening here, there is some issue on the TUR side (the reason is that this is the only thing I can think of, providing the tape is really in the drive, hence the robtest check).  Marks post is worth checking first, as if the tape is in the wrong drive, then we don;t get the TUR back.

It could also be worth watching the library physically if some one is close enough - is the tape fully in the drive, or does it pop out again in an 'ejected' position (not sure what robtest would report for that).

If I am correct, then the issue changes slightly - but you are most likely looking at one of:

Drive fault
Firmware issue
Driver issue
SAN issue (see below, if HBA firmware can cause thyis, then quite possible some other SAN issue can cause it (eg switch firmware)
HBA issue (I've seen HBA firmware cause this exact issue)

Of course if no drivers have been changed recently, this is unlikely (+ I would expect it to affect other drives)

We can kinda rule out NBU already as this affects only one drive, if it was NBU I would expect it to affect all drives, as the same code and commands (scsi commands) are sent for each drive.

One test I'm thinking of is loading te tape using robtest, and then sending a TUR command outside of NBU.   I can;t think if windows has the scsi_command (if it does it will be in volmgr\bin dir ) - ok this is a Symantec command, but not a NBU command as such and is useful (at least in Unix land) for issues like this.

Hope this helps.

Will_Restore
Level 6

Martin, think you meant to post this somewhere else. blush

mph999
Level 6
Employee Accredited

Ahh, I wondered where that post has disappeared to ...  Yes, it was meant for Katies robot issue ...