Solved: master and media servers lose tape drives after li...

tythecellist · ‎08-26-2015

This sounds like a dumb question, but I have looked high and low for a solution here in the forums as well as various vendor sites. Nothing. The basic problem is that our master and media servers lose connectivity to robot and tape drives within a library if the library goes offline (say, for a reboot or even hot-swap replacement of a controller board.)

Our NBU 7.6.0.4 environment consists of a master server and 22 media servers, all running RHEL 6.6. A number of the media servers, and the master server, are zoned a Quantum i6000 tape library with 26 LTO5 drives. We're using Emulex 8Gb HBAs throughout.

When we lose connectivity to our drives and robot occurs, the only effective resolution we've found is to then reboot all of those 23 hosts, which is obviously disruptive to our users (particularly DBAs, whose frequent backup jobs take a hit when there's no server to back up to) and time-consuming for our server team as well.

Upon reboot, all the drives return to normal operational status and are visible from the hosts we expect to see them ... device paths remain the same, etc.

Our belief is that the servers should automatically see the devices when the library (and its drives) return to normal operational status, but this isn't happening. Is this expected behavior? Seems to us that we ought to be capable of replacing a hot-swap controller board (or even rebooting the library) without bringing the entire environment down.

Thoughts?

Marianne · ‎09-04-2015

My thoughts...

Library maintenance should be few an far between - unless you have an old, 'flaky' robot that need repairs way too often.

IMHO - it is perfectly normal that OS will lose connectivity to devices when the library is rebooted.

A rescan at OS level should ideally be all that is needed to restore device access, not a reboot.

Persistent binding is only needed to ensure that device paths remain the same across reboots, but it seems that you do not have an issue with changed device paths:

.. device paths remain the same,

Have you tried to rescan before rebooting media servers?

Handy NetBackup Links

View solution in original post

RiaanBadenhorst · ‎08-27-2015

Hi,

Have you looked into the OS / SCSI layer to see what is happening there. Where is the issue, with the robot / library connection or the drive connection? How have you zoned the drives? Everything visible to everything or have you assigned drives to groups of servers.

NetBackup relies on what the OS provides so if loses connection Netbackup is pretty much stuck.

tythecellist · ‎08-27-2015

Thanks Riann. Yeah, we've looked some at the OS/SCSI layer some (though we'll be doing more in the coming days...) and yes, the drives and robot are zoned to groups of servers.

Grateful for your reply though!

sdo · ‎08-27-2015

The topic for you to research is "persistent binding" - which should guarantee consistency across multiple resets/reboots at any point in the multiple (7?) layers of the device stack.

tythecellist · ‎08-27-2015

Thank you, sdo. I have wondered about that, though I've asked that question of my linux team previously and been assured that this is in place. (It may be time to corroborate that with some digging of my own.)

sdo · ‎08-28-2015

A similar recent question, with some useful notes/links:

https://www-secure.symantec.com/connect/forums/missing-drive-path-3

Nicolai · ‎08-28-2015

How is the SAN zoning done ?

You shoud have the SAN HBA and one drive in one zone.

Remeber that SAN HBA can be a member of multiple zones.

It does mean a lot of zones - I know But that also prevent noise from one drive interferring with other tape drives work. It could be what you are seeing if all drives a member of one zone or alike

sdo · ‎08-28-2015

FYI - This response/posting/entry is regarding SAN FC, and does not refer to SAN iSCSI...

.

IMO, usually, "single initiator zoning" is enough:

https://www.google.co.uk/?gws_rd=ssl#q=single+initiator+zoning

...which, as Nicolai has pointed out, can lead to several zones - but is best practice.

Personally, I have never had to create one zone per "one server initiator HBA port and one tape drive target" - i.e. I have never had a problem with zones containing one initiator and either multiple tape targets or multiple disk targets - but never mix disk and tape targets in the same zone. If you ever do mix tape and disk targets in the same zone, then one day you'll have a nasty surprise and then realise why no-one ever does that :\

.

Initiators are usually server HBA ports.

Targets are usually storage devices (tape drives, disk arrays).

.

However, usage of of NetBackup FT SAN Media Servers would generate a topology where a NetBackup FT Client has one or more SAN HBA ports in the default mode of "initiator mode", and a NetBackup FT SAN Media Server has one or more SAN HBA ports modified to be in "target mode".

sdo · ‎08-28-2015

Here's something else to try next time it happens, for one of the servers... instead of rebooting it... what you could try is to temporarily close the SAN switch port(s) facing the media server HBA port(s) that are zoned to the i6000 tape library IOB port(s) or drive(s) (directly), wait 60 seconds, and then re-open the SAN switch port(s). This should cause the FC driver on the media server to perform a Fabric PLOGI and FLOGI and then re-discover targets.

What your problem appears to be is an inconsistency, or timing issue, whereby when the i6000 "drops" off the SAN, and later comes back, then a signal is being lost somewhere. One place to look is in the SAN switch "port logs" and look for SCN (State Change Notification) events - which is where an initiator, or a target, can inform each other of "events" within the zone.

Also, instead of rebooting, you may be able to issue an Emulex related command on the RHEL servers which performs a "reset" of particular FC HBA port(s) - and force the FC port to re-logon to the fabric.

.

And do check that single initiator zoning is in place - here's just one reason (of several) why... because some SCSI "reset" events issued by a server will cause all SCSI devices on the same SCSI bus to reset. Remember SCSI is carried over FC, and so multiple tape drives can appear, in SCSI terms, to be on the same bus. So, if you have multiple server initiator HBAs with visibility of other server initiator HBAs and/or other tape drives that they shouldn't have visibilty of... then some SCSI reset events can interfere with other servers and other SCSI devices (i.e. other FC targets).

It is possible that Nicolai's advice of single initiator and single target zoning is the best advice - and that my advice of single initiator but multiple target zoning is not the best advice - because... if any of... any one server HBA port, or any one SAN switch port, or any one Scalar i6000 IOB port, or any one tape drive are creating noise in the zone - then this could be causing drops and resets - and so the only way to isoloate the problem may be to implement single initiator and single target zoning.

Also, check your SFP light levels from the SAN switch side. HP themselves recommend a signal loss level of no greater than -12dB.

Also, try clearing the SAN switch port error counters - and look for accumulating errors, and or LLI (link loss indicator?) - i.e. unexpected events of port links going down.

Also, some sites like to ensure that their SAN switch ports do not speed negotiate - i.e. they set their SAN switch ports to a fixed speed - in an attempt to know sooner when a link is beginning to have a problem - and avoids strange issues of speeds silently reducing.

.

You definitely should not need to reboot all servers - that's just plain nuts - so, in summary you may have an interoperability issue which is going to require all parties to work together, you the end-user... plus Emulex, RHEL, Brocade and Quantum.

Nicolai · ‎08-31-2015

LOl - we have a philosopher among us :)

tythecellist · ‎09-02-2015

Thanks, all. This is very helpful information to have.

Persistent binding sure seems to be the favored response here, so we'll dig into that more to understand what impact enabling that would have on our environnment. (

(For example, did I read somewhere above [or another forum post ?] that enabling persistent binding results in a different device path for each discovered tape drive, as well as the robotic control path?)

I'm grateful for everybody here who's chimed in ... thanks!

Nicolai · ‎09-03-2015

Also take a look on how SAN zoning is performed.

HBA-> Drive (GOOD)

HBA->Drive (BAD)

->Drive

Zoning on the SAN should be one drive, one HBA in one zone.

sdo · ‎09-03-2015

You already have a different system OS path for each drive. Your problem appears to be that the tape drives keep hoping around the OS 'paths', i.e. the paths are not persistent across reboots, FC resets, path failover, SCSI resets, etc...

Yes, it is very likely that one final reboot following configuration of persistent binding will be required, and this 'path' discovery again, and possibly device re-configuration again, one last time - and then that should be the last of it.

If you still have problems after configuring persistent binding, then your issue would appear to be a multi-vendor interoperability issue.

Marianne · ‎09-04-2015

My thoughts...

Library maintenance should be few an far between - unless you have an old, 'flaky' robot that need repairs way too often.

IMHO - it is perfectly normal that OS will lose connectivity to devices when the library is rebooted.

A rescan at OS level should ideally be all that is needed to restore device access, not a reboot.

Persistent binding is only needed to ensure that device paths remain the same across reboots, but it seems that you do not have an issue with changed device paths:

.. device paths remain the same,

Have you tried to rescan before rebooting media servers?

Handy NetBackup Links

tythecellist · ‎09-04-2015

Thank you Marianne and others.

I'm not 100% sure that we know the correct process for doing the rescan, however in the last hour I've done some poking around and found that my devices are all visible from within

/sys/class/fc_transport/target2:0:0/device/

... and that

* cat'ing /sys/class/fc_transport/target2:0:0/device/2:0:0:0/vendor yields "QUANTUM", and

* cat'ing /sys/class/fc_transport/target2:0:0/device/2:0:0:0/model yields "Scalar i6000"

... and that cat'ing other devices such as 2:0:0:1/vendor and ./model yields "IBM" and "ULTRIUM-TD5" as expected.

Since this matches our library's vendor, model and tape drive vendor/model configuration, I'm led to believe I'm looking in the right place.

If I need to rescan the host in question, what is the proper procedure for doing so? I've heard a lot of different recommendations from coworkers but I'm unsure of the authoritative answer.

Thanks!

RiaanBadenhorst · ‎09-04-2015

To initiate a SCSI bus rescan type echo "- - -" > /sys/class/scsi_host/hostX/scan where X stands for the SCSI bus you want to scan.

areznik · ‎09-04-2015

Are you on HP servers? you might already have the hp_rescan utility which can help with the grunt work of rescanning hbas on dozens of servers.

As someone already mentioned, you may need to disable and re-enable your switch ports to force a rescan. I think this has more to do with the fc switch that the OS or Tape drive/Library

tythecellist · ‎09-04-2015

Thanks Riann! I had previously heard it was

echo "- - -" > /sys/class/fc_host/hostX/scan (fc_host, not scsi_host as you've suggested)

... but I've never had any luck making the OS "re-see" the drives/robot using this command.

I'll give this a shot. I guess I'm unclear why I'd use scsi_host and not fc_host, given that they're FC tape drives plugged in to I/O blades that are then FC-connected into a switch.

But thanks for the info anyway!

tythecellist · ‎09-04-2015

hey Areznik, no, we're not on HP servers. They're Dell 710s and 910s.

RiaanBadenhorst · ‎09-04-2015

Each OS might have it listed slightly differently but if you explore the folder structure you'll find your current devices and therefore know which ones to target with the echo.

VOX

master and media servers lose tape drives after library reboot