02-15-2016 12:55 AM
Hi,
One of our tape drive just recently replaced. After it was replaced, I re-scanned / re-add to introduce its new serial number. I had to delete first the old tape drive via admin console. I did this in the master server. Then I noticed, the drive path of the replaced drive keeps on getting down, and duplication jobs keeps on failing on several media pools when robot auto assign or feeds tape in the replaced tape drive. Also the firmware version of the tape drive and library was updated. We have 4 tape drives. Then when I check on the media server which is also the drive path (bdnbu07), in it windows event logs, event id 14021 - Netbackup TLD Daemon (TLD(0) [7924] timed out after waiting 841 seconds for ready, drive 4) and event id 2636 - Netbackup (Operator/EMM server has DOWN'ed drive HP.ULTRIUM5-SCSI.000 (device 0)) keeps on flagging everytime it reads or duplicate on the affected drive. Is this something a lack of configuration on my part after the drive was replaced? Or still a hardware issue? The hardware vendor just replaced it then update firmware. And said it will automaticall detect. But turned out it didnt so I hade to delete and re-add/scan the new tape drive. Now I let the status of the drive path to be on a down state so as my duplication jobs will not be affected and tapes will not get frozen since the other 3 tape drives are working.
Would appreciate your know solution / expertise help about my scenario.
Thank You! :)
Solved! Go to Solution.
02-16-2016 11:03 PM
I've never tested the fact that NBU may following the drive even if they are swapped round.
Until today, and my testing shows that it doesn't seem to ... Excuse the odd serial numbers, it's a VTL.
From changer in scan
Drive 1 Serial Number : "XYZZY_C1"
Drive 2 Serial Number : "XYZZY_C2"
From tpconfig -dl
Drive Name IBM.ULT3580-TD4.001
Index 3
NonRewindDrivePath /dev/nst5
TLD(0) Definition DRIVE=1 (this is wrong, the drive is actually in 'physical' position 2)
Serial Number XYZZY_C2
Drive Name IBM.ULT3580-TD4.002
Index 5
NonRewindDrivePath /dev/nst1
TLD(0) Definition DRIVE=2 (this is wrong, the drive is actually in 'physical' position 1)
Serial Number XYZZY_C1
So here, we can see my drives in NBU config are 'swapped' round with what the robot reports.
I restarted ltid and ran a job:
Job details shows
02/16/2016 03:32:04 - granted resource E03004
02/16/2016 03:32:04 - granted resource IBM.ULT3580-TD4.001 (This in position 2 in the library, but NBU thinks it is in position 1)
From robtest we see the tape was loaded into the drive in position 1
drive 1 (addr 1) access = 1 Contains Cartridge = yes
Source address = 1027 (slot 4)
Barcode = E03004L4
vmoprcmd shows the drive in /dev/nst1
IBM.ULT3580-TD4.001 No No No hcart
nbmaster2 /dev/nst5 ACTIVE
IBM.ULT3580-TD4.002 Yes Yes E03004 Yes hcart
nbmaster2 /dev/nst1 TLD
The job hangs as the tape never mounts, we see in bptm we're access the wrong drive.
03:32:06.049 [23119] <4> create_tpreq_file: symlink to path /dev/nst5 <<<<<< WRONG, tape is in drive with path /dev/nst1
03:32:06.088 [23119] <2> manage_drive_before_load: SCSI RESERVE
03:32:06.090 [23119] <2> manage_drive_before_load: report_attr, fl1 0x00000001, fl2 0x00000000
03:32:06.090 [23119] <4> expandpath: /usr/openv/netbackup/db/media/tpreq/drive_IBM.ULT3580-TD4.001
03:32:06.172 [23119] <2> tapelib: wait_for_ltid, Mount, timeout 0
(NOTE the reason the tape is found in the correct drive in vmoprcmd output is because NBU continuously scans the drive and will identify any tape that it inserted into a drive, provisding it has a NBU media header)
(Editted to add: Job eventually failed with Robot load error. Reconfigured drives so they were in sync with the robot drive positions and same job ran successfully).
Suggest you delete and reconfig drives.
02-29-2016 10:04 PM
Hi,
Martin's isolation is somehow what I did. I found out an incorrect Robot drive number assignment. Viewing on the gui-admin console\devices\drives VS on the Tape Library console Tape drive number and serial number. Cause the serial number does not match. Ex. Two tape drives on the gui mismatched VS on the Tape Library. The robot drive number 1's serial number on the gui does not match the serial number assigned on the Tape Library. Two drives were affected. I just change the drive number on the gui\devices\drives for the affected drives. Then restarted services. Then my duplication jobs going smooth again. Drive paths do not go down as before.
Thanks and Regards
02-15-2016 01:01 AM
From the robot control host, please show us output of:
scan -changer scan -tape tpautoconfig -report_disc
02-15-2016 01:25 AM
Thing to check starting from the "beginning"
1. Is the tape drive Online and ready on the robot
2. Is the tape drive logged into the SAN switch in case of FC attached tape drive
3. Is the tape drive shown correctly on the HBA or SAS card in the OS
4. Is the tape drive driver loaded properly in the OS, windows often requires reboot for this
5. Is the correct tape drive serial number shown in Netbackup after rescan of tape drives
If 5. is the issue there is a technote about updating the serial number
02-15-2016 01:49 AM
Hi sdo,
:\Netbackup\Volmgr\bin>scan -changer
************************************************************
*********************** SDT_CHANGER ************************
************************************************************
------------------------------------------------------------
Device Name : "Changer0"
Passthru Name: "Changer0"
Volume Header: ""
Port: 3; Bus: 0; Target: 5; LUN: 1
Inquiry : "HP MSL G3 Series 8.80"
Vendor ID : "HP "
Product ID : "MSL G3 Series "
Product Rev: "8.80"
Serial Number: "MXA131Z0DK"
WWN : ""
WWN Id Type : 0
Device Identifier: "HP MSL G3 Series MXA131Z0DK"
Device Type : SDT_CHANGER
NetBackup Robot Type: 8
Removable : Yes
Device Supports: SCSI-5
Number of Drives : 4
Number of Slots : 48
Number of Media Access Ports: 0
Drive 1 Serial Number : "HU1128HBRL"
Drive 2 Serial Number : "HU1133HWPB"
Drive 3 Serial Number : "HU1133HWP7"
Drive 4 Serial Number : "HUE44216G9"
Flags : 0x0
Reason: 0x0
- I noticed comparing the assigned serial numbers on the Drive number result on the scan -changer vs on the admin console\devices\drives, the drives 2 and 4 does not match on the result/order in the scan -changer result. It interchanged. Is this why the conflict? Cause I also checked on our other netbackup machines, it matches. I mean drive number's serial number assigned is the same both on the admin console and scan -changer in command window.
:\Netbackup\Volmgr\bin>scan -tape
************************************************************
*********************** SDT_TAPE ************************
************************************************************
------------------------------------------------------------
Device Name : "Tape0"
Passthru Name: "Tape0"
Volume Header: ""
Port: 2; Bus: 0; Target: 4; LUN: 0
Inquiry : "HP Ultrium 5-SCSI Y6NW"
Vendor ID : "HP "
Product ID : "Ultrium 5-SCSI "
Product Rev: "Y6NW"
Serial Number: "HUE44216G9"
WWN : ""
WWN Id Type : 0
Device Identifier: ""
Device Type : SDT_TAPE
NetBackup Drive Type: 10
Removable : Yes
Device Supports: SCSI-6
Flags : 0x0
Reason: 0x0
------------------------------------------------------------
Device Name : "Tape1"
Passthru Name: "Tape1"
Volume Header: ""
Port: 2; Bus: 0; Target: 5; LUN: 0
Inquiry : "HP Ultrium 5-SCSI Y6NW"
Vendor ID : "HP "
Product ID : "Ultrium 5-SCSI "
Product Rev: "Y6NW"
Serial Number: "HU1133HWP7"
WWN : ""
WWN Id Type : 0
Device Identifier: ""
Device Type : SDT_TAPE
NetBackup Drive Type: 10
Removable : Yes
Device Supports: SCSI-6
Flags : 0x0
Reason: 0x0
------------------------------------------------------------
Device Name : "Tape2"
Passthru Name: "Tape2"
Volume Header: ""
Port: 3; Bus: 0; Target: 4; LUN: 0
Inquiry : "HP Ultrium 5-SCSI Y6NW"
Vendor ID : "HP "
Product ID : "Ultrium 5-SCSI "
Product Rev: "Y6NW"
Serial Number: "HU1133HWPB"
WWN : ""
WWN Id Type : 0
Device Identifier: ""
Device Type : SDT_TAPE
NetBackup Drive Type: 10
Removable : Yes
Device Supports: SCSI-6
Flags : 0x0
Reason: 0x0
------------------------------------------------------------
Device Name : "Tape3"
Passthru Name: "Tape3"
Volume Header: ""
Port: 3; Bus: 0; Target: 5; LUN: 0
Inquiry : "HP Ultrium 5-SCSI Y6NW"
Vendor ID : "HP "
Product ID : "Ultrium 5-SCSI "
Product Rev: "Y6NW"
Serial Number: "HU1128HBRL"
WWN : ""
WWN Id Type : 0
Device Identifier: ""
Device Type : SDT_TAPE
NetBackup Drive Type: 10
Removable : Yes
Device Supports: SCSI-6
Flags : 0x0
Reason: 0x0
How do I get result for tpautoconf -report_disc? No result when I command it. Just returns to the path.
Thank You!
02-15-2016 01:49 AM
.... timed out after waiting 841 seconds for ready, drive 4....
Still a drive problem.
The robot loaded the tape in the drive, but the drive never reported 'ready' state back to OS and NBU.
NBU will DOWN a drive after 3 failures in 12 hours abd Freeze a tape after 3 errors on the same media-id in 12 hours.
02-15-2016 02:17 AM
What I just got you do is check that the actual OS view (shown by the scan commands) matches the NetBackup device configuration view (i.e. no discrepencies reported by tpautoconf -report_disc) for just one host.
What I would do next, is issue the same commands on every other NetBackup Server (in the SSO configuration) which also has access to the same tape library.
N.B: The tpautoconf -report_disc command means "show me the discepencies", if there is no output then there are no discrepencies for that particular NetBackup Server, which is a good thing.
But you now need to check the other NetBackup Servers which also have access to the same tape library.
.
If you/we can identify that the other hosts are able to scan/dectect devices ok, AND that they have no dicrepencies too, then something else is wrong. These checks that I'm getting you to do are just simple standard bread and butter type checks. Theses are some of the most basic/simple checks that one would do in this type of situation.
02-15-2016 04:50 PM
Hi sdo, Michael, Marianne,
Also no output for tpautoconf -report_disc on other nbu servers connected to the library. So no discrepencies? I noticed though on the device manager of OS on nbu master server, under Storage Controllers, a missing driver - Microsoft Multi-Path Bus Driver is not present. This was already not present when the tape drive failed. And even when replaced, the driver is still not present. But on other nbu servers attached like its media servers, that driver is present. Even on my other nbu master servers and its media servers, that driver is present. Is this what causing it? This probably a hardware issue or just driver needs to get re-installed?
when i command scan -changer output, on the lower portion part:
Drive 1 Serial Number : "HU1128HBRL"
Drive 2 Serial Number : "HU1133HWPB"
Drive 3 Serial Number : "HU1133HWP7"
Drive 4 Serial Number : "HUE44216G9" --------- this is the new serial number of tape drive that was replaced
Vs comparing on netbackup device configuration, I noticed discrepency on robot drive 2 and 4:
Drive Name Robot Drive Number Serial Number
HP.ULTRIUM-SCI.003 1 HU1128HBRL
HP.ULTRIUM-SCI.002 2 HUE44216G9 --------- this is the new serial number of tape drive that was replaced
HP.ULTRIUM-SCI.001 3 HU1133HWP7
HP.ULTRIUM-SCI.000 4 HU1133HWPB
The hardware vendor when he was installing the new tape drive, interchanged the drive assignment of the tape drive which he said is universal and no effect on nbu once re-scan and detected?
02-15-2016 05:14 PM
You have no discrepencies, which is ok.
The "Robot Drive Number" vs. number 'N' in the "Drive N Serial Number" should not matter, because NetBackup uses 'serialization' to manage access to tape drives.
My current thought, is...
Did you also remember to restart 'ltid' (in Windows this is "NetBackup Device Manager") on all NetBackup Servers (masters and medias) which have access to this library?
This needs to be done after a tape drive replacement. But this can only be done when all tape drives are free and not being used.
.
My procedure would be:
1) Down all tape drive paths on all NetBackup Servers.
2) Double check that all paths are down. (it can take a few minutes sometimes).
3) Then restart ltid on all master/media servers (N.B: you should not need to restart all services, you should only need to restart 'ltid'.)
4) Then up all tape drive paths on all NetBackup Servers.
5) Run your test backup policy (which is configured in such a way (usually by not using multi-plexing on the schedule) as to use all tape drives)
HTH.
02-15-2016 05:49 PM
Hi sdo,
I think I restarted the nbu servers that's connected to the library. It should restarted the services as well right?
Ok Im gonna restart only ltid service only. Should I down only the drive path of the affected (replaced drive) drive? I have 4 tape drives. Or all the drive paths of the other tape drive as well?
To up the drive path, is more effective and fast via tpconfig or admin consol gui or doesn't matter which way?
Thanks You!
02-15-2016 11:50 PM
When replacing a tape drive, one did not normally need to restart all services, but I could be wrong. Maybe one has to with the newer versions of NetBackup. But you are on NetBackup v7.5.x.x, right?
I would recommend dowbning all tape drive paths on a NetBackup Server when restarting ltid. Remember, "ltid" is not per tape drive, it is the "NetBackup Device Manager" for all tape drives on a NetBackup Server.
You could do the down and up via CLI if you want, I just thought you'd find it easier in the GUI.
02-16-2016 12:11 AM
I have always done tape replacements with delete old drive, restart ltid, add new drive, restart ltid.
bptm log file on the media server will show if drive serial number is matching NBU device database.
From what I can see, this is not the problem, but the drive failing to perform the normal load, rewind, position action and notice to OS and NBU that the drive is ready.
I don't quite understand what the following means:
The hardware vendor when he was installing the new tape drive, interchanged the drive assignment of the tape drive which he said is universal and no effect on nbu once re-scan and detected?
You may want to let the vendor know - this is not working, and tape drive is giving this error in Event Viewer:
.... timed out after waiting 841 seconds for ready, drive 4....
About your query about Microsoft Multi-Path Bus Driver - this is for disks, not tape drives.
I have just thought about something - maybe a good idea to look for 'ghost drives' at OS level and delete them.
DOCUMENTATION: How to delete "ghost" or "phantom" devices from the Windows Device Manager when using Symantec NetBackup (tm)
http://www.veritas.com/docs/000042456
02-16-2016 11:03 PM
I've never tested the fact that NBU may following the drive even if they are swapped round.
Until today, and my testing shows that it doesn't seem to ... Excuse the odd serial numbers, it's a VTL.
From changer in scan
Drive 1 Serial Number : "XYZZY_C1"
Drive 2 Serial Number : "XYZZY_C2"
From tpconfig -dl
Drive Name IBM.ULT3580-TD4.001
Index 3
NonRewindDrivePath /dev/nst5
TLD(0) Definition DRIVE=1 (this is wrong, the drive is actually in 'physical' position 2)
Serial Number XYZZY_C2
Drive Name IBM.ULT3580-TD4.002
Index 5
NonRewindDrivePath /dev/nst1
TLD(0) Definition DRIVE=2 (this is wrong, the drive is actually in 'physical' position 1)
Serial Number XYZZY_C1
So here, we can see my drives in NBU config are 'swapped' round with what the robot reports.
I restarted ltid and ran a job:
Job details shows
02/16/2016 03:32:04 - granted resource E03004
02/16/2016 03:32:04 - granted resource IBM.ULT3580-TD4.001 (This in position 2 in the library, but NBU thinks it is in position 1)
From robtest we see the tape was loaded into the drive in position 1
drive 1 (addr 1) access = 1 Contains Cartridge = yes
Source address = 1027 (slot 4)
Barcode = E03004L4
vmoprcmd shows the drive in /dev/nst1
IBM.ULT3580-TD4.001 No No No hcart
nbmaster2 /dev/nst5 ACTIVE
IBM.ULT3580-TD4.002 Yes Yes E03004 Yes hcart
nbmaster2 /dev/nst1 TLD
The job hangs as the tape never mounts, we see in bptm we're access the wrong drive.
03:32:06.049 [23119] <4> create_tpreq_file: symlink to path /dev/nst5 <<<<<< WRONG, tape is in drive with path /dev/nst1
03:32:06.088 [23119] <2> manage_drive_before_load: SCSI RESERVE
03:32:06.090 [23119] <2> manage_drive_before_load: report_attr, fl1 0x00000001, fl2 0x00000000
03:32:06.090 [23119] <4> expandpath: /usr/openv/netbackup/db/media/tpreq/drive_IBM.ULT3580-TD4.001
03:32:06.172 [23119] <2> tapelib: wait_for_ltid, Mount, timeout 0
(NOTE the reason the tape is found in the correct drive in vmoprcmd output is because NBU continuously scans the drive and will identify any tape that it inserted into a drive, provisding it has a NBU media header)
(Editted to add: Job eventually failed with Robot load error. Reconfigured drives so they were in sync with the robot drive positions and same job ran successfully).
Suggest you delete and reconfig drives.
02-29-2016 10:04 PM
Hi,
Martin's isolation is somehow what I did. I found out an incorrect Robot drive number assignment. Viewing on the gui-admin console\devices\drives VS on the Tape Library console Tape drive number and serial number. Cause the serial number does not match. Ex. Two tape drives on the gui mismatched VS on the Tape Library. The robot drive number 1's serial number on the gui does not match the serial number assigned on the Tape Library. Two drives were affected. I just change the drive number on the gui\devices\drives for the affected drives. Then restarted services. Then my duplication jobs going smooth again. Drive paths do not go down as before.
Thanks and Regards
02-29-2016 10:30 PM