cancel
Showing results for 
Search instead for 
Did you mean: 

Drive Path keeps on getting down

rsakimoto
Level 5

Hi,

One of our tape drive just recently replaced. After it was replaced, I re-scanned / re-add to introduce its new serial number. I had to delete first the old tape drive via admin console. I did this in the master server. Then I noticed, the drive path of the replaced drive keeps on getting down, and duplication jobs keeps on failing on several media pools when robot auto assign or feeds tape in the replaced tape drive. Also the firmware version of the tape drive and library was updated. We have 4 tape drives. Then when I check on the media server which is also the drive path (bdnbu07), in it windows event logs, event id 14021 - Netbackup TLD Daemon (TLD(0) [7924] timed out after waiting 841 seconds for ready, drive 4)  and event id 2636 - Netbackup (Operator/EMM server has DOWN'ed drive HP.ULTRIUM5-SCSI.000 (device 0)) keeps on flagging everytime it reads or duplicate on the affected drive. Is this something a lack of configuration on my part after the drive was replaced? Or still a hardware issue? The hardware vendor just replaced it then update firmware. And said it will automaticall detect. But turned out it didnt so I hade to delete and re-add/scan the new tape drive. Now I let the status of the drive path to be on a down state so as my duplication jobs will not be affected and tapes will not get frozen since the other 3 tape drives are working.

Would appreciate your know solution / expertise help about my scenario.

Thank You! :)

 

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

mph999
Level 6
Employee Accredited

I've never tested the fact that NBU may following the drive even if they are swapped round.

Until today, and my testing shows that it doesn't seem to ...  Excuse the odd serial numbers, it's a VTL.

 


From changer in scan

Drive 1 Serial Number      : "XYZZY_C1"
Drive 2 Serial Number      : "XYZZY_C2"


From tpconfig -dl

        Drive Name              IBM.ULT3580-TD4.001
        Index                   3
        NonRewindDrivePath      /dev/nst5
        TLD(0) Definition DRIVE=1              (this is wrong, the drive is actually in 'physical' position 2)
        Serial Number           XYZZY_C2

        Drive Name              IBM.ULT3580-TD4.002
        Index                   5
        NonRewindDrivePath      /dev/nst1
        TLD(0) Definition DRIVE=2  (this is wrong, the drive is actually in 'physical' position 1)
        Serial Number           XYZZY_C1


So here, we can see my drives in NBU config are 'swapped' round with what the robot reports.

I restarted ltid and ran a job:

Job details shows

02/16/2016 03:32:04 - granted resource  E03004
02/16/2016 03:32:04 - granted resource  IBM.ULT3580-TD4.001   (This in position 2 in the library, but NBU thinks it is in position 1)


From robtest we see the tape was loaded into the drive in position 1

drive 1 (addr 1) access = 1 Contains Cartridge = yes
Source address = 1027 (slot 4)
Barcode = E03004L4

vmoprcmd shows the drive in /dev/nst1 

IBM.ULT3580-TD4.001      No      No                     No        hcart
    nbmaster2                  /dev/nst5                            ACTIVE

IBM.ULT3580-TD4.002      Yes     Yes    E03004          Yes       hcart
    nbmaster2                  /dev/nst1                            TLD


The job hangs as the tape never mounts, we see in bptm we're access the wrong drive.

03:32:06.049 [23119] <4> create_tpreq_file: symlink to path /dev/nst5   <<<<<< WRONG, tape is in drive with path /dev/nst1
03:32:06.088 [23119] <2> manage_drive_before_load: SCSI RESERVE
03:32:06.090 [23119] <2> manage_drive_before_load: report_attr, fl1 0x00000001, fl2 0x00000000
03:32:06.090 [23119] <4> expandpath: /usr/openv/netbackup/db/media/tpreq/drive_IBM.ULT3580-TD4.001
03:32:06.172 [23119] <2> tapelib: wait_for_ltid, Mount, timeout 0

(NOTE the reason the tape is found in the correct drive in vmoprcmd output is because NBU continuously scans the drive and will identify any tape that it inserted into a drive, provisding it has a NBU media header)

(Editted to add:   Job eventually failed with Robot load error.  Reconfigured drives so they were in sync with the robot drive positions and same job ran successfully).

 

Suggest you delete and reconfig drives.

 

View solution in original post

rsakimoto
Level 5

Hi,

 

Martin's isolation is somehow what I did. I found out an incorrect Robot drive number assignment. Viewing on the gui-admin console\devices\drives VS on the Tape Library console Tape drive number and serial number. Cause the serial number does not match. Ex. Two tape drives on the gui mismatched VS on the Tape Library. The robot drive number 1's serial number on the gui does not match the serial number assigned on the Tape Library. Two drives were affected. I just change the drive number on the gui\devices\drives for the affected drives. Then restarted services. Then my duplication jobs going smooth again. Drive paths do not go down as before.

Thanks and Regards

 

 

View solution in original post

13 REPLIES 13

sdo
Moderator
Moderator
Partner    VIP    Certified

From the robot control host, please show us output of:

scan -changer

scan -tape

tpautoconfig -report_disc

Michael_G_Ander
Level 6
Certified

Thing to check starting from the "beginning"

1. Is the tape drive Online and ready on the robot

2. Is the tape drive logged into the SAN switch in case of FC attached tape drive

3. Is the tape drive shown correctly on the HBA or SAS card in the OS

4. Is the tape drive driver loaded properly in the OS, windows often requires reboot for this

5. Is the correct tape drive serial number shown in Netbackup after rescan of tape drives

If 5. is the issue there is a technote about updating the serial number

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

rsakimoto
Level 5

Hi sdo,

:\Netbackup\Volmgr\bin>scan -changer
************************************************************
*********************** SDT_CHANGER ************************
************************************************************
------------------------------------------------------------
Device Name  : "Changer0"
Passthru Name: "Changer0"
Volume Header: ""
Port: 3; Bus: 0; Target: 5; LUN: 1
Inquiry    : "HP      MSL G3 Series   8.80"
Vendor ID  : "HP      "
Product ID : "MSL G3 Series   "
Product Rev: "8.80"
Serial Number: "MXA131Z0DK"
WWN          : ""
WWN Id Type  : 0
Device Identifier: "HP      MSL G3 Series   MXA131Z0DK"
Device Type    : SDT_CHANGER
NetBackup Robot Type: 8
Removable      : Yes
Device Supports: SCSI-5
Number of Drives : 4
Number of Slots  : 48
Number of Media Access Ports: 0
Drive 1 Serial Number      : "HU1128HBRL"
Drive 2 Serial Number      : "HU1133HWPB"
Drive 3 Serial Number      : "HU1133HWP7"
Drive 4 Serial Number      : "HUE44216G9"
Flags : 0x0
Reason: 0x0

- I noticed comparing the assigned serial numbers on the Drive number result on the scan -changer vs on the admin console\devices\drives, the drives 2 and 4 does not match on the result/order in the scan -changer result. It interchanged. Is this why the conflict? Cause I also checked on our other netbackup machines, it matches. I mean drive number's serial number assigned is the same both on the admin console and scan -changer in command window.

:\Netbackup\Volmgr\bin>scan -tape
************************************************************
*********************** SDT_TAPE    ************************
************************************************************
------------------------------------------------------------
Device Name  : "Tape0"
Passthru Name: "Tape0"
Volume Header: ""
Port: 2; Bus: 0; Target: 4; LUN: 0
Inquiry    : "HP      Ultrium 5-SCSI  Y6NW"
Vendor ID  : "HP      "
Product ID : "Ultrium 5-SCSI  "
Product Rev: "Y6NW"
Serial Number: "HUE44216G9"
WWN          : ""
WWN Id Type  : 0
Device Identifier: ""
Device Type    : SDT_TAPE
NetBackup Drive Type: 10
Removable      : Yes
Device Supports: SCSI-6
Flags : 0x0
Reason: 0x0
------------------------------------------------------------
Device Name  : "Tape1"
Passthru Name: "Tape1"
Volume Header: ""
Port: 2; Bus: 0; Target: 5; LUN: 0
Inquiry    : "HP      Ultrium 5-SCSI  Y6NW"
Vendor ID  : "HP      "
Product ID : "Ultrium 5-SCSI  "
Product Rev: "Y6NW"
Serial Number: "HU1133HWP7"
WWN          : ""
WWN Id Type  : 0
Device Identifier: ""
Device Type    : SDT_TAPE
NetBackup Drive Type: 10
Removable      : Yes
Device Supports: SCSI-6
Flags : 0x0
Reason: 0x0
------------------------------------------------------------
Device Name  : "Tape2"
Passthru Name: "Tape2"
Volume Header: ""
Port: 3; Bus: 0; Target: 4; LUN: 0
Inquiry    : "HP      Ultrium 5-SCSI  Y6NW"
Vendor ID  : "HP      "
Product ID : "Ultrium 5-SCSI  "
Product Rev: "Y6NW"
Serial Number: "HU1133HWPB"
WWN          : ""
WWN Id Type  : 0
Device Identifier: ""
Device Type    : SDT_TAPE
NetBackup Drive Type: 10
Removable      : Yes
Device Supports: SCSI-6
Flags : 0x0
Reason: 0x0
------------------------------------------------------------
Device Name  : "Tape3"
Passthru Name: "Tape3"
Volume Header: ""
Port: 3; Bus: 0; Target: 5; LUN: 0
Inquiry    : "HP      Ultrium 5-SCSI  Y6NW"
Vendor ID  : "HP      "
Product ID : "Ultrium 5-SCSI  "
Product Rev: "Y6NW"
Serial Number: "HU1128HBRL"
WWN          : ""
WWN Id Type  : 0
Device Identifier: ""
Device Type    : SDT_TAPE
NetBackup Drive Type: 10
Removable      : Yes
Device Supports: SCSI-6
Flags : 0x0
Reason: 0x0

How do I get result for tpautoconf -report_disc? No result when I command it. Just returns to the path.

Thank You!

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

.... timed out after waiting 841 seconds for ready, drive 4....

Still a drive problem. 

The robot loaded the tape in the drive, but the drive never reported 'ready' state back to OS and NBU.

NBU will DOWN a drive after 3 failures in 12 hours abd Freeze a tape after 3 errors on the same media-id in 12 hours. 

sdo
Moderator
Moderator
Partner    VIP    Certified

What I just got you do is check that the actual OS view (shown by the scan commands) matches the NetBackup device configuration view (i.e. no discrepencies reported by tpautoconf -report_disc) for just one host.

What I would do next, is issue the same commands on every other NetBackup Server (in the SSO configuration) which also has access to the same tape library.

N.B:  The tpautoconf -report_disc    command means "show me the discepencies", if there is no output then there are no discrepencies for that particular NetBackup Server, which is a good thing.

But you now need to check the other NetBackup Servers which also have access to the same tape library.

.

If you/we can identify that the other hosts are able to scan/dectect devices ok, AND that they have no dicrepencies too, then something else is wrong.  These checks that I'm getting you to do are just simple standard bread and butter type checks.  Theses are some of the most basic/simple checks that one would do in this type of situation.

rsakimoto
Level 5

Hi sdo, Michael, Marianne,

Also no output for tpautoconf -report_disc on other nbu servers connected to the library. So no discrepencies?  I noticed though on the device manager of OS on nbu master server, under Storage Controllers, a missing driver - Microsoft Multi-Path Bus Driver is not present. This was already not present when the tape drive failed. And even when replaced, the driver is still not present. But on other nbu servers attached like its media servers, that driver is present. Even on my other nbu master servers and its media servers, that driver is present. Is this what causing it? This probably a hardware issue  or just driver needs to get re-installed?

when i command scan -changer output, on the lower portion part:

Drive 1 Serial Number      : "HU1128HBRL"
Drive 2 Serial Number      : "HU1133HWPB"
Drive 3 Serial Number      : "HU1133HWP7"
Drive 4 Serial Number      : "HUE44216G9" --------- this is the new serial number of tape drive that was replaced

Vs comparing on netbackup device configuration, I noticed discrepency on robot drive 2 and 4:

Drive Name                               Robot Drive Number        Serial Number

HP.ULTRIUM-SCI.003                 1                                   HU1128HBRL

HP.ULTRIUM-SCI.002                 2                                   HUE44216G9 --------- this is the new serial number of tape drive that was replaced

HP.ULTRIUM-SCI.001                 3                                   HU1133HWP7

HP.ULTRIUM-SCI.000                 4                                   HU1133HWPB

The hardware vendor when he was installing the new tape drive, interchanged the drive assignment of the tape drive which he said is universal and no effect on nbu once re-scan and detected?

sdo
Moderator
Moderator
Partner    VIP    Certified

You have no discrepencies, which is ok.

The "Robot Drive Number" vs. number 'N' in the "Drive N Serial Number" should not matter, because NetBackup uses 'serialization' to manage access to tape drives.

My current thought, is...

Did you also remember to restart 'ltid' (in Windows this is "NetBackup Device Manager") on all NetBackup Servers (masters and medias) which have access to this library?

This needs to be done after a tape drive replacement.   But this can only be done when all tape drives are free and not being used.

.

My procedure would be:

1) Down all tape drive paths on all NetBackup Servers.

2) Double check that all paths are down.  (it can take a few minutes sometimes).

3) Then restart ltid on all master/media servers (N.B: you should not need to restart all services, you should only need to restart 'ltid'.)

4) Then up all tape drive paths on all NetBackup Servers.

5) Run your test backup policy (which is configured in such a way (usually by not using multi-plexing on the schedule) as to use all tape drives)

HTH.

rsakimoto
Level 5

Hi sdo,

I think I restarted the nbu servers that's connected to the library. It should restarted the services as well right?

Ok Im gonna restart only ltid service only. Should I down only the drive path of the affected (replaced drive) drive? I have 4 tape drives. Or all the drive paths of the other tape drive as well?

To up the drive path, is more effective and fast via tpconfig or admin consol gui or doesn't matter which way?

Thanks You!

sdo
Moderator
Moderator
Partner    VIP    Certified

When replacing a tape drive, one did not normally need to restart all services, but I could be wrong.  Maybe one has to with the newer versions of NetBackup.  But you are on NetBackup v7.5.x.x, right?

I would recommend dowbning all tape drive paths on a NetBackup Server when restarting ltid.   Remember, "ltid" is not per tape drive, it is the "NetBackup Device Manager" for all tape drives on a NetBackup Server.

You could do the down and up via CLI if you want, I just thought you'd find it easier in the GUI.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

I have always done tape replacements with delete old drive, restart ltid, add new drive, restart ltid.

bptm log file on the media server will show if drive serial number is matching NBU device database.
From what I can see, this is not the problem, but the drive failing to perform the normal load, rewind, position action and notice to OS and NBU that the drive is ready.

I don't quite understand what the following means:

The hardware vendor when he was installing the new tape drive, interchanged the drive assignment of the tape drive which he said is universal and no effect on nbu once re-scan and detected?

You may want to let the vendor know - this is not working, and tape drive is giving this error in Event Viewer:

 .... timed out after waiting 841 seconds for ready, drive 4....

About your query about Microsoft Multi-Path Bus Driver - this is for disks, not tape drives.

 

I have just thought about something - maybe a good idea to look for 'ghost drives' at OS level and delete them.

DOCUMENTATION: How to delete "ghost" or "phantom" devices from the Windows Device Manager when using Symantec NetBackup (tm)
http://www.veritas.com/docs/000042456

mph999
Level 6
Employee Accredited

I've never tested the fact that NBU may following the drive even if they are swapped round.

Until today, and my testing shows that it doesn't seem to ...  Excuse the odd serial numbers, it's a VTL.

 


From changer in scan

Drive 1 Serial Number      : "XYZZY_C1"
Drive 2 Serial Number      : "XYZZY_C2"


From tpconfig -dl

        Drive Name              IBM.ULT3580-TD4.001
        Index                   3
        NonRewindDrivePath      /dev/nst5
        TLD(0) Definition DRIVE=1              (this is wrong, the drive is actually in 'physical' position 2)
        Serial Number           XYZZY_C2

        Drive Name              IBM.ULT3580-TD4.002
        Index                   5
        NonRewindDrivePath      /dev/nst1
        TLD(0) Definition DRIVE=2  (this is wrong, the drive is actually in 'physical' position 1)
        Serial Number           XYZZY_C1


So here, we can see my drives in NBU config are 'swapped' round with what the robot reports.

I restarted ltid and ran a job:

Job details shows

02/16/2016 03:32:04 - granted resource  E03004
02/16/2016 03:32:04 - granted resource  IBM.ULT3580-TD4.001   (This in position 2 in the library, but NBU thinks it is in position 1)


From robtest we see the tape was loaded into the drive in position 1

drive 1 (addr 1) access = 1 Contains Cartridge = yes
Source address = 1027 (slot 4)
Barcode = E03004L4

vmoprcmd shows the drive in /dev/nst1 

IBM.ULT3580-TD4.001      No      No                     No        hcart
    nbmaster2                  /dev/nst5                            ACTIVE

IBM.ULT3580-TD4.002      Yes     Yes    E03004          Yes       hcart
    nbmaster2                  /dev/nst1                            TLD


The job hangs as the tape never mounts, we see in bptm we're access the wrong drive.

03:32:06.049 [23119] <4> create_tpreq_file: symlink to path /dev/nst5   <<<<<< WRONG, tape is in drive with path /dev/nst1
03:32:06.088 [23119] <2> manage_drive_before_load: SCSI RESERVE
03:32:06.090 [23119] <2> manage_drive_before_load: report_attr, fl1 0x00000001, fl2 0x00000000
03:32:06.090 [23119] <4> expandpath: /usr/openv/netbackup/db/media/tpreq/drive_IBM.ULT3580-TD4.001
03:32:06.172 [23119] <2> tapelib: wait_for_ltid, Mount, timeout 0

(NOTE the reason the tape is found in the correct drive in vmoprcmd output is because NBU continuously scans the drive and will identify any tape that it inserted into a drive, provisding it has a NBU media header)

(Editted to add:   Job eventually failed with Robot load error.  Reconfigured drives so they were in sync with the robot drive positions and same job ran successfully).

 

Suggest you delete and reconfig drives.

 

rsakimoto
Level 5

Hi,

 

Martin's isolation is somehow what I did. I found out an incorrect Robot drive number assignment. Viewing on the gui-admin console\devices\drives VS on the Tape Library console Tape drive number and serial number. Cause the serial number does not match. Ex. Two tape drives on the gui mismatched VS on the Tape Library. The robot drive number 1's serial number on the gui does not match the serial number assigned on the Tape Library. Two drives were affected. I just change the drive number on the gui\devices\drives for the affected drives. Then restarted services. Then my duplication jobs going smooth again. Drive paths do not go down as before.

Thanks and Regards

 

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
This points back to drive replacement procedure that was not done correctly.