03-05-2018 03:54 PM
Few weeks ago we had to power down the machine room. Upon power up, I can only see 4 of the six LTO7 drives.
I removed the robot/drives and we did the wizard.
It finds all six drives but says 2 of them under control of a remote host....
I am really confused how that happened and even more confused how I fix it.
03-05-2018 04:28 PM
03-05-2018 11:00 PM
Seems this particular server (nbu2?) has 'lost' 2 tape drives at OS-level.
Other media server(s) sharing these tape drives can still see them.
So, troubleshooting needs to start at OS-level.
You selected Linux as OS - so, if this server is Linux, then you can use this command to list tape drives seen at OS-level:
cat /proc/scsi/scsi
'scan' is good, as it not only lists the OS devices, but will only report on devices that respond to scsi-commands.
03-06-2018 04:12 AM
This may be a SCSI reservation not released - it matches the power drop scenario. If a tape drive has a SCSI reservation active it will look like its attached to a remote host. Depending on OS there are tools to break the SCSI reservation, alternative reboot the tape drive themselves.
03-06-2018 09:13 AM - edited 03-06-2018 09:21 AM
Even more fun now.
The switch (qlogic sanbox 5800) Can see all the drives.
Each host (NBU1 master) and (NBU2 media) only see 3 drives.
I've tried rescanning the scsi bus, I've tried new cables, new optics, new ports on the switch, different HBAs.
I'm completely at a loss and now at the point of trying a new switch. But I don't think it's the switch.
In the picture attached of the switch GUI.. port 1 and 2 are the hosts. And the rest are the tape drives.
[root@nbu2 ~]# cat /proc/scsi/scsi
Attached devices:
Host: scsi1 Channel: 00 Id: 00 Lun: 00
Vendor: HP Model: P440ar Rev: 6.06
Type: RAID ANSI SCSI revision: 05
Host: scsi1 Channel: 01 Id: 00 Lun: 00
Vendor: HP Model: LOGICAL VOLUME Rev: 6.06
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
Host: scsi2 Channel: 00 Id: 01 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
Host: scsi2 Channel: 00 Id: 01 Lun: 01
Vendor: QUANTUM Model: Scalar i6000 Rev: 745Q
Type: Medium Changer ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 02 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
[root@nbu2 ~]#
[root@nbu1 ~]# cat /proc/scsi/scsi
Attached devices:
Host: scsi1 Channel: 00 Id: 00 Lun: 00
Vendor: HP Model: P440ar Rev: 6.06
Type: RAID ANSI SCSI revision: 05
Host: scsi1 Channel: 01 Id: 00 Lun: 00
Vendor: HP Model: LOGICAL VOLUME Rev: 6.06
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi1 Channel: 01 Id: 00 Lun: 01
Vendor: HP Model: LOGICAL VOLUME Rev: 6.06
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
Host: scsi2 Channel: 00 Id: 01 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
Host: scsi2 Channel: 00 Id: 01 Lun: 01
Vendor: QUANTUM Model: Scalar i6000 Rev: 745Q
Type: Medium Changer ANSI SCSI revision: 03
[root@nbu1 ~]#
03-06-2018 09:31 AM
So it seems now it's settled on 3 drives showing up on each machine.
I'm really and utterly completely lost right now. Nothing I can try seems to make any more drives show up.
The switch looks ok. I be3lieve this is a OS thing but I just don't know what to do.
03-06-2018 11:05 AM
Another reboot.. now it's 4 drives.. and same error as before. Robot has 6 drives but 2 are in use on another system.
*puts head on desk*
03-07-2018 12:06 AM - edited 03-07-2018 12:13 AM
I remember some years ago that devices on a SAN had to be booted up in a certain order.
Not sure what this order is and if it's still applicable.
Hopefully others with more current experience will advise.
Have you tried to reboot the tape library?
As long as /proc/scsi/scsi and 'scan' do not see all tape drives, forget about NBU device wizard.
*** Edit ***
I notice that you have tried many things:
I've tried rescanning the scsi bus, I've tried new cables, new optics, new ports on the switch, different HBAs.
Did you redo zoning when making switch and/or hba changes?
Are all tape drives connected to the same hba on the server? Did you rescan all hostadapter id's?
03-07-2018 01:17 AM - edited 03-07-2018 01:22 AM
Have you tried using tools like "Qlogic SAN surfer" to list what WWN the fibre channel adapter can see on master and media servers. There may be a difference in what the OS and the HBA can see.
Is is possible some sort of robot partition has become active after the power down ?
In the screen dump you attached from the switch there is a column named "active zones", but none is listed. Is this intended ?
03-07-2018 03:49 PM
The switch is unzoned as it's the ONLY thing ever going to be on this switch so I was told leaving it unzoned is fine.
But I've also tried all the things above with it zoned (and still no go).
After removing the robot and readding it for the 123098129 time. It found all six drives. I could back up. 10-15GB and restore no problem.
So we restarted the 25TB backup. And after 30GB....
[91561.348896] rport-2:0-6: blocked FC remote port time out: removing target and saving binding
[91615.066704] qla2xxx [0000:08:00.1]-801c:2: Abort command issued nexus=2:5:0 -- 1 2002.
[91615.067870] qla2xxx [0000:08:00.1]-801c:2: Abort command issued nexus=2:5:0 -- 1 2002.
[91615.067880] qla2xxx [0000:08:00.1]-8009:2: DEVICE RESET ISSUED nexus=2:5:0 cmd=ffff8807e269d340.
[91635.258200] qla2xxx [0000:08:00.1]-800c:2: do_reset failed for cmd=ffff8807e269d340.
[91635.259153] qla2xxx [0000:08:00.1]-800f:2: DEVICE RESET FAILED: Task management failed nexus=2:5:0 cmd=ffff8807e269d340.
[91635.259157] qla2xxx [0000:08:00.1]-8009:2: TARGET RESET ISSUED nexus=2:5:0 cmd=ffff8807e269d340.
[91635.261900] qla2xxx [0000:08:00.1]-800e:2: TARGET RESET SUCCEEDED nexus:2:5:0 cmd=ffff8807e269d340.
[91635.262600] qla2xxx [0000:08:00.1]-801c:2: Abort command issued nexus=2:5:0 -- 1 2002.
[91635.262603] qla2xxx [0000:08:00.1]-8012:2: BUS RESET ISSUED nexus=2:5:0.
[91635.286515] qla2xxx [0000:08:00.1]-802b:2: BUS RESET SUCCEEDED nexus=2:5:0.
[91645.288262] qla2xxx [0000:08:00.1]-8018:2: ADAPTER RESET ISSUED nexus=2:5:0.
[91645.288269] qla2xxx [0000:08:00.1]-00af:2: Performing ISP error recovery - ha=ffff88085b88c000.
[91645.814515] qla2xxx [0000:08:00.1]-500a:2: LOOP UP detected (8 Gbps).
[91645.996339] qla2xxx [0000:08:00.1]-8017:2: ADAPTER RESET SUCCEEDED nexus=2:5:0.
[91651.340789] rport-2:0-6: blocked FC remote port time out: removing target and saving binding
[91655.998258] scsi 2:0:5:0: scsi scan: 70 byte inquiry failed. Consider BLIST_INQUIRY_36 for this device
[91655.998266] scsi 2:0:5:0: rejecting I/O to offline device
[91661.325667] rport-2:0-1: blocked FC remote port time out: removing target and saving binding
[91661.326500] st 2:0:0:0: rejecting I/O to offline device
[91661.327157] rport-2:0-2: blocked FC remote port time out: removing target and saving binding
[91661.327192] st 2:0:0:0: [st0] Error 10000 (driver bt 0x0, host bt 0x1).
[91661.328023] st 2:0:1:0: rejecting I/O to offline device
[91661.328835] st 2:0:1:0: [st1] Error 10000 (driver bt 0x0, host bt 0x1).
[91661.328838] rport-2:0-3: blocked FC remote port time out: removing target and saving binding
[91661.328841] st 2:0:2:0: rejecting I/O to offline device
[91661.328849] rport-2:0-4: blocked FC remote port time out: removing target and saving binding
[91661.328851] st 2:0:3:0: rejecting I/O to offline device
[91661.328855] st 2:0:2:0: [st2] Error 10000 (driver bt 0x0, host bt 0x1).
[91661.328859] rport-2:0-5: blocked FC remote port time out: removing target and saving binding
[91661.328862] st 2:0:4:0: rejecting I/O to offline device
[91661.328890] st 2:0:3:0: [st3] Error 10000 (driver bt 0x0, host bt 0x1).
[91661.328959] st 2:0:4:0: [st4] Error 10000 (driver bt 0x0, host bt 0x1).
[91725.356397] scsi 2:0:5:0: [st5] Error on write filemark.
03-07-2018 04:18 PM
I've now decided I'm done with this Qlogic Sanbox 5800.
Replacing it with a spare Brocade 300 we have.
I will update progress.
03-08-2018 06:34 AM - edited 03-08-2018 06:35 AM
Looks like FC issues to me - device lost.
blocked FC remote port time out: removing target and saving binding
I strongly recommend "One HBA to one drive in one zone" , as this will prevent a FC LIP or SCSI bus reset from propergating from one drive to the other. A HBA can be a member of multiple zones.
Not following this pratice has causes issues for me on even the most simpel configurations.
Best Regards
Nicolai
03-08-2018 11:16 AM
Another question related to FC setup.
I'm doing your zoning advice.
Port type... GL or FL... Right now they are set to FL... is this right?
03-09-2018 12:14 AM
F-port is best. Try reboot a tape drive with FC switch ready. I re-call LTO tape drives try F port login first , and if that fails they revert to NL. A loop is the "worst" kind of ports since all devices attached to a loop will get interrupted by a LIP (loop initialization procedure) SCSI bus reset.
FL = port fabric + loop port
G = general port
03-09-2018 08:39 AM
Thanks!!!
so follow up, I replaced the Qlogic sanbox switch with a Brocade 300. And succesfully put 25TB on tape last night.
03-12-2018 01:01 AM
Thanks for the update - glad you solved the problem.