cancel
Showing results for 
Search instead for 
Did you mean: 

NetBackup config problem or bad tape drive?

MB-Rvbd
Level 2

I need some help with troubleshooting my NetBackup configuration. It was running fine until two weeks ago. Here is my setup:

    NetBackup 6.5.6
    Windows Server 2003 R2 Standard x64 Edition with SP2

    Exabyte Magnum 448 robot
    Driver version 2.1.9.0

    IBM Ultrium TD4 tape drive
    Driver version 6.2.1.7

    Master Server: windows-backup
    Media Server: windows-backup

Windows' Device Manager says the hardware is working properly.

NetBackup says the hardware is up and running.

The robot passes Drive Diagnostics. However, the taoe drive fails diagnostics:

    Drive Information - Success
    Basic Test - Failed (Error on writing tape, aborting test)
    Locate Block - Failed (ditto)
    Error Checking - Failed (ditto)
    Label - Failed (ditto)
    Performance - Failed (ditto)

In Activity Monitor, the job's Detailed Status says "media manager failure(810)".

If I run any profile, it fails with the error "Error bptm(pid=2600) write error on media id {whatever tape was chosen}, drive index 0, writing header block, 1111". This has happened on 3 different tapes. Is this a configuration problem with NetBackup, or has my tape drive developed bad read/write heads?

1 ACCEPTED SOLUTION

Accepted Solutions

mph999
Level 6
Employee Accredited

 

Hi there,

NBU has very little to do with tape drives, or robots for that matter.  Although you have spent many $ on your hardware, I am truely srry if I lhave just left your hopes and dreams lying shattered on the ground ...  ;0)

Put simply, NB does not read or write to tape, that's all done by the OS.  Further, even with an inventory we wait to be 'told' what is where by the library, we don't 'look'.  Robtest - just a bunch of scsi commands, scan - again a bunch of scsi commands ....  you get the idea ...

Don't get me wrong, NBU can cause device issues, but it's rare.

Device config, well this basically runs scan using 'poll' whatever is seen by the OS to configure it - so as long as it configures, all is good.  OK, sure, you can end up with config issues if drive paths get renamed due to not having persistent binding set.  What then happens is that a tape is loaded into drive1, but the paths of the drives got swapped so the path we think is drive1 is now pointing to say drive 3, we then try and access drive 3, which of course either has no tape in it, or not the tape we think it is ...

What I would do is as follows:

Delete the devices, I would use nbemmcmd -deletealldevices -allrecords

NOTE: This removes ALL robots and ALL tape drives, I'm presuming you only have the h/ware mentioned.  If yoou have multiple robots etc... this might not be such a good idea.

And then reconfig and re-inventory (be sure to make a note of, and put back the Media ID rules and barcode rules if set.

Why - this is the quickest way to see if the config is the issue, because once back, providing you don't reboot we can say it's good, with a high level of confidence.  If you don't have persistant binding (set on the HBA) to 'glue' a OS path to to a particlar drive they can reorder on boot and cause bad things to happen.

Alternativly, we could load a tape into a drive using tpreq and then work out if it's gone into the right drive, or look at scan output and compare to tpconfig -dl output to workout if the paths are all correct, but, reconfig is quicker, sets your mind at rest, and either fixes the issue, or does't.  If it does resolve, then be sure to set persistent binding (as mentioned this is done via the HBA utility - either Sansurfer or HBAnywhere depending on the card brand which will be either Emulex or Qlogic).

You could also look for errors in the system event logs, though may need to increase volume mager logs (see the link in my signiture for how to do this).

Worth mentioning that drive paths getting swapped round can also be caused by scsi resets on the SAN, though reboots are the more common cause, at least in my opinion.  If the machine hasn't rebooted between it working and not working, then it is less likely to be a config issue, but certainly not impossible.

If after reconfig it still fails, I'd start thinking more towards a drive or media issue.

Hope this helps,

Martin

View solution in original post

5 REPLIES 5

mph999
Level 6
Employee Accredited

 

Hi there,

NBU has very little to do with tape drives, or robots for that matter.  Although you have spent many $ on your hardware, I am truely srry if I lhave just left your hopes and dreams lying shattered on the ground ...  ;0)

Put simply, NB does not read or write to tape, that's all done by the OS.  Further, even with an inventory we wait to be 'told' what is where by the library, we don't 'look'.  Robtest - just a bunch of scsi commands, scan - again a bunch of scsi commands ....  you get the idea ...

Don't get me wrong, NBU can cause device issues, but it's rare.

Device config, well this basically runs scan using 'poll' whatever is seen by the OS to configure it - so as long as it configures, all is good.  OK, sure, you can end up with config issues if drive paths get renamed due to not having persistent binding set.  What then happens is that a tape is loaded into drive1, but the paths of the drives got swapped so the path we think is drive1 is now pointing to say drive 3, we then try and access drive 3, which of course either has no tape in it, or not the tape we think it is ...

What I would do is as follows:

Delete the devices, I would use nbemmcmd -deletealldevices -allrecords

NOTE: This removes ALL robots and ALL tape drives, I'm presuming you only have the h/ware mentioned.  If yoou have multiple robots etc... this might not be such a good idea.

And then reconfig and re-inventory (be sure to make a note of, and put back the Media ID rules and barcode rules if set.

Why - this is the quickest way to see if the config is the issue, because once back, providing you don't reboot we can say it's good, with a high level of confidence.  If you don't have persistant binding (set on the HBA) to 'glue' a OS path to to a particlar drive they can reorder on boot and cause bad things to happen.

Alternativly, we could load a tape into a drive using tpreq and then work out if it's gone into the right drive, or look at scan output and compare to tpconfig -dl output to workout if the paths are all correct, but, reconfig is quicker, sets your mind at rest, and either fixes the issue, or does't.  If it does resolve, then be sure to set persistent binding (as mentioned this is done via the HBA utility - either Sansurfer or HBAnywhere depending on the card brand which will be either Emulex or Qlogic).

You could also look for errors in the system event logs, though may need to increase volume mager logs (see the link in my signiture for how to do this).

Worth mentioning that drive paths getting swapped round can also be caused by scsi resets on the SAN, though reboots are the more common cause, at least in my opinion.  If the machine hasn't rebooted between it working and not working, then it is less likely to be a config issue, but certainly not impossible.

If after reconfig it still fails, I'd start thinking more towards a drive or media issue.

Hope this helps,

Martin

MB-Rvbd
Level 2

NetBackup is managing just those 2 devices listed above.

The tape library and tape drive are SCSI devices; their IDs haven't changed.

I ran "nbemmcmd -deletealldevices -allrecords", then launched NetBackup and ran "Configure Storage Devices". It found the library and drive. The only change that I had to make was to move the drive from standalone to the library. Then I went to the robot inventory and ran "Update volume configuration". Afterward, I ran "tpconfig -dl":

    Currently defined drives and robots are:

        Drive Name              IBM.ULTRIUM-TD4.000
        Index                   0
        SCSI coordinates        {1,0,3,0}
        Type                    hcart
        Status                  UP
        SCSI Protection         SR (Global)
        Shared Access           No
        TLD(0) Definition DRIVE=1
        Serial Number           1310052583

    Currently defined robotics are:
      TLD(0)     SCSI coordinates = {1,0,3,1}

    EMM Server = windows-backup

I have a regularly scheduled backup for tonight. I'll check its status on Monday morning.

mph999
Level 6
Employee Accredited

OK, so one tape drive, probably unlikely path has changed.

 

Sounds in that case like it may have gone bad.  Lets see what has happened on Monday.

MB-Rvbd
Level 2

My weekend backup jobs failed again, same as before. I tried to clean the tape drives heads and was informed that my cleaning cartridge had expired.

I bought a new cartridge, cleaned the drive, and then ran another backup profile. No change:

Error bptm(pid=3356) write error on media id A00012, drive index 0, writing header block, 1111

That's the same error on 4 different tapes. The only conclusion that I have is the tape drive's read/write heads are damaged, and the drive needs to be replaced.

mph999
Level 6
Employee Accredited

I would agree ....