Problem aftger a power down and up.

o1crimson1o · ‎03-05-2018

Few weeks ago we had to power down the machine room. Upon power up, I can only see 4 of the six LTO7 drives.

I removed the robot/drives and we did the wizard.

It finds all six drives but says 2 of them under control of a remote host....

I am really confused how that happened and even more confused how I fix it.

Amol_Nair · ‎03-05-2018

what is the os on the media server.?

Can you run the “scan” command on the media server and verify if it shows all the drives on the OS.. If scan does not show all of them you need to check the zoning of these drives..

If scan lists all the drives then we could attempt to run tpconfig -dev_ping command on the drive path..

The screenshot says shared drive for all other drives so please do ensure you are checking on the correct media server

Marianne · ‎03-05-2018

Seems this particular server (nbu2?) has 'lost' 2 tape drives at OS-level.
Other media server(s) sharing these tape drives can still see them.

So, troubleshooting needs to start at OS-level.
You selected Linux as OS - so, if this server is Linux, then you can use this command to list tape drives seen at OS-level:
cat /proc/scsi/scsi

'scan' is good, as it not only lists the OS devices, but will only report on devices that respond to scsi-commands.

Handy NetBackup Links

Nicolai · ‎03-06-2018

This may be a SCSI reservation not released - it matches the power drop scenario. If a tape drive has a SCSI reservation active it will look like its attached to a remote host. Depending on OS there are tools to break the SCSI reservation, alternative reboot the tape drive themselves.

o1crimson1o · ‎03-06-2018

Even more fun now.

The switch (qlogic sanbox 5800) Can see all the drives.

Each host (NBU1 master) and (NBU2 media) only see 3 drives.

I've tried rescanning the scsi bus, I've tried new cables, new optics, new ports on the switch, different HBAs.

I'm completely at a loss and now at the point of trying a new switch. But I don't think it's the switch.

In the picture attached of the switch GUI.. port 1 and 2 are the hosts. And the rest are the tape drives.

[root@nbu2 ~]# cat /proc/scsi/scsi
Attached devices:
Host: scsi1 Channel: 00 Id: 00 Lun: 00
Vendor: HP Model: P440ar Rev: 6.06
Type: RAID ANSI SCSI revision: 05
Host: scsi1 Channel: 01 Id: 00 Lun: 00
Vendor: HP Model: LOGICAL VOLUME Rev: 6.06
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
Host: scsi2 Channel: 00 Id: 01 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
Host: scsi2 Channel: 00 Id: 01 Lun: 01
Vendor: QUANTUM Model: Scalar i6000 Rev: 745Q
Type: Medium Changer ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 02 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
[root@nbu2 ~]#

[root@nbu1 ~]# cat /proc/scsi/scsi
Attached devices:
Host: scsi1 Channel: 00 Id: 00 Lun: 00
Vendor: HP Model: P440ar Rev: 6.06
Type: RAID ANSI SCSI revision: 05
Host: scsi1 Channel: 01 Id: 00 Lun: 00
Vendor: HP Model: LOGICAL VOLUME Rev: 6.06
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi1 Channel: 01 Id: 00 Lun: 01
Vendor: HP Model: LOGICAL VOLUME Rev: 6.06
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
Host: scsi2 Channel: 00 Id: 01 Lun: 00
Vendor: IBM Model: ULTRIUM-TD7 Rev: H5B2
Type: Sequential-Access ANSI SCSI revision: 06
Host: scsi2 Channel: 00 Id: 01 Lun: 01
Vendor: QUANTUM Model: Scalar i6000 Rev: 745Q
Type: Medium Changer ANSI SCSI revision: 03
[root@nbu1 ~]#

o1crimson1o · ‎03-06-2018

So it seems now it's settled on 3 drives showing up on each machine.

I'm really and utterly completely lost right now. Nothing I can try seems to make any more drives show up.

The switch looks ok. I be3lieve this is a OS thing but I just don't know what to do.

o1crimson1o · ‎03-06-2018

Another reboot.. now it's 4 drives.. and same error as before. Robot has 6 drives but 2 are in use on another system.

*puts head on desk*

Marianne · ‎03-07-2018

I remember some years ago that devices on a SAN had to be booted up in a certain order.
Not sure what this order is and if it's still applicable.
Hopefully others with more current experience will advise.

Have you tried to reboot the tape library?

As long as /proc/scsi/scsi and 'scan' do not see all tape drives, forget about NBU device wizard.

*** Edit ***

I notice that you have tried many things:

I've tried rescanning the scsi bus, I've tried new cables, new optics, new ports on the switch, different HBAs.

Did you redo zoning when making switch and/or hba changes?

Are all tape drives connected to the same hba on the server? Did you rescan all hostadapter id's?

Handy NetBackup Links

Nicolai · ‎03-07-2018

Have you tried using tools like "Qlogic SAN surfer" to list what WWN the fibre channel adapter can see on master and media servers. There may be a difference in what the OS and the HBA can see.

Is is possible some sort of robot partition has become active after the power down ?

In the screen dump you attached from the switch there is a column named "active zones", but none is listed. Is this intended ?

o1crimson1o · ‎03-07-2018

The switch is unzoned as it's the ONLY thing ever going to be on this switch so I was told leaving it unzoned is fine.

But I've also tried all the things above with it zoned (and still no go).

After removing the robot and readding it for the 123098129 time. It found all six drives. I could back up. 10-15GB and restore no problem.

So we restarted the 25TB backup. And after 30GB....

[91561.348896] rport-2:0-6: blocked FC remote port time out: removing target and saving binding
[91615.066704] qla2xxx [0000:08:00.1]-801c:2: Abort command issued nexus=2:5:0 -- 1 2002.
[91615.067870] qla2xxx [0000:08:00.1]-801c:2: Abort command issued nexus=2:5:0 -- 1 2002.
[91615.067880] qla2xxx [0000:08:00.1]-8009:2: DEVICE RESET ISSUED nexus=2:5:0 cmd=ffff8807e269d340.
[91635.258200] qla2xxx [0000:08:00.1]-800c:2: do_reset failed for cmd=ffff8807e269d340.
[91635.259153] qla2xxx [0000:08:00.1]-800f:2: DEVICE RESET FAILED: Task management failed nexus=2:5:0 cmd=ffff8807e269d340.
[91635.259157] qla2xxx [0000:08:00.1]-8009:2: TARGET RESET ISSUED nexus=2:5:0 cmd=ffff8807e269d340.
[91635.261900] qla2xxx [0000:08:00.1]-800e:2: TARGET RESET SUCCEEDED nexus:2:5:0 cmd=ffff8807e269d340.
[91635.262600] qla2xxx [0000:08:00.1]-801c:2: Abort command issued nexus=2:5:0 -- 1 2002.
[91635.262603] qla2xxx [0000:08:00.1]-8012:2: BUS RESET ISSUED nexus=2:5:0.
[91635.286515] qla2xxx [0000:08:00.1]-802b:2: BUS RESET SUCCEEDED nexus=2:5:0.
[91645.288262] qla2xxx [0000:08:00.1]-8018:2: ADAPTER RESET ISSUED nexus=2:5:0.
[91645.288269] qla2xxx [0000:08:00.1]-00af:2: Performing ISP error recovery - ha=ffff88085b88c000.
[91645.814515] qla2xxx [0000:08:00.1]-500a:2: LOOP UP detected (8 Gbps).
[91645.996339] qla2xxx [0000:08:00.1]-8017:2: ADAPTER RESET SUCCEEDED nexus=2:5:0.
[91651.340789] rport-2:0-6: blocked FC remote port time out: removing target and saving binding
[91655.998258] scsi 2:0:5:0: scsi scan: 70 byte inquiry failed. Consider BLIST_INQUIRY_36 for this device
[91655.998266] scsi 2:0:5:0: rejecting I/O to offline device
[91661.325667] rport-2:0-1: blocked FC remote port time out: removing target and saving binding
[91661.326500] st 2:0:0:0: rejecting I/O to offline device
[91661.327157] rport-2:0-2: blocked FC remote port time out: removing target and saving binding
[91661.327192] st 2:0:0:0: [st0] Error 10000 (driver bt 0x0, host bt 0x1).
[91661.328023] st 2:0:1:0: rejecting I/O to offline device
[91661.328835] st 2:0:1:0: [st1] Error 10000 (driver bt 0x0, host bt 0x1).
[91661.328838] rport-2:0-3: blocked FC remote port time out: removing target and saving binding
[91661.328841] st 2:0:2:0: rejecting I/O to offline device
[91661.328849] rport-2:0-4: blocked FC remote port time out: removing target and saving binding
[91661.328851] st 2:0:3:0: rejecting I/O to offline device
[91661.328855] st 2:0:2:0: [st2] Error 10000 (driver bt 0x0, host bt 0x1).
[91661.328859] rport-2:0-5: blocked FC remote port time out: removing target and saving binding
[91661.328862] st 2:0:4:0: rejecting I/O to offline device
[91661.328890] st 2:0:3:0: [st3] Error 10000 (driver bt 0x0, host bt 0x1).
[91661.328959] st 2:0:4:0: [st4] Error 10000 (driver bt 0x0, host bt 0x1).
[91725.356397] scsi 2:0:5:0: [st5] Error on write filemark.

o1crimson1o · ‎03-07-2018

I've now decided I'm done with this Qlogic Sanbox 5800.

Replacing it with a spare Brocade 300 we have.

I will update progress.

Nicolai · ‎03-08-2018

Looks like FC issues to me - device lost.

blocked FC remote port time out: removing target and saving binding

I strongly recommend "One HBA to one drive in one zone" , as this will prevent a FC LIP or SCSI bus reset from propergating from one drive to the other. A HBA can be a member of multiple zones.

Not following this pratice has causes issues for me on even the most simpel configurations.

Best Regards

Nicolai

o1crimson1o · ‎03-08-2018

Another question related to FC setup.

I'm doing your zoning advice.

Port type... GL or FL... Right now they are set to FL... is this right?

Nicolai · ‎03-09-2018

F-port is best. Try reboot a tape drive with FC switch ready. I re-call LTO tape drives try F port login first , and if that fails they revert to NL. A loop is the "worst" kind of ports since all devices attached to a loop will get interrupted by a LIP (loop initialization procedure) SCSI bus reset.

FL = port fabric + loop port
G = general port

o1crimson1o · ‎03-09-2018

Thanks!!!

so follow up, I replaced the Qlogic sanbox switch with a Brocade 300. And succesfully put 25TB on tape last night.

Nicolai · ‎03-12-2018

Thanks for the update - glad you solved the problem.

VOX

Problem aftger a power down and up.