cancel
Showing results for 
Search instead for 
Did you mean: 

Tape drives going down ...suspect SAN ???

Jim-90
Level 6

Tape drives going down ...suspect SAN ???

Why I think its the SAN

Two identical libraries (STK SL3000, 5x HP LTO5) with all tape drives shared amongst the media servers.   Only one ATL has problems with drives being downed.

  • All media servers have HBA static binding ...no issues there
  • Nothing wrong with windows OS tape or robotic drivers ...if there was both ATLs would be affected.
  • No errors could be found in library logs.  Everything looked normal from a physical insection of the ATL.
  • Cleaning tape drives is automated and infrequent.... a manual clean didn't help.
  • Power cycling tape drives didn't help.  That would have cleared any SCSI locks.
  • Wasted a lot time in the BPTM logs ...they are bearly human readable ..leave that one up to Support.
  • Only one tape library affected.  If the problem was on the media server then the second tape library would also be affected.

The  ATL has behaving itself for the last three years, first one tape drive became a problem now its spreading to the other tape drives.  With the ATL looking and reporting as OK and the media servers are OK plus only one ATL affected can I assume the problem is very likely to be the SAN? 

Assumptions

  • The robotic control SAN path is responsible only tape movements, library inventories and possibly library reporting status.
  • The tape drive SAN connection is a data path, tape control (positioning) and reporting tape status.

If these assumptions are true then when you have errors on each of the tape drive SAN connections which leads me to assume a SAN issue.  If all the reporting is via the robotic SAN path and the tape drive is data only then it could be either library problem or/and a SAN problem.  

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Have you checked Event Viewer System log?

View solution in original post

The fix "appears" to simple......library was rebooted.  It had been running for ~4yrs without a reboot. So far it hasn't reported any issues after 12 hrs.  Give it a couple days that will confirm the solution. 

 

 

View solution in original post

5 REPLIES 5

Michael_G_Ander
Level 6
Certified

What does the SAN switch logs show ?

Have had similar issues that could be traced back to a faulty fiber cable or SFP.

 

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

mph999
Level 6
Employee Accredited

It's nice to see a reasonable argument for/ against where the issue is apart from the usual 'lets blame NBU'.

The bptm log may be helpful, you are just looking for the line(s) where the drive has some problem (providing we get that far) ...  eg

Does a tape get mounted successfully

Do we start to write

Do we finish writing error on the eject.

Searching for 'error' or <16> can be helpful

/usr/openv/netbackup/db/media/errors files (from each media server) can sometimes be useful.

Activity monitor usually shows the error, and if we're lucky, the pid of the log involved, along with the time.

Is the drive disappearing from the san (messages or event logs should show this).

Generally speaking, NBU doesn't cause drive issues - the majority of tape operation stuff is down by the OS (reads, writes) not NBU, although we do send various scsi commands (eg to position) even then the underlying operations go via the OS.  It's not impossible for NBU to cause an issue, but it's rare.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Have you checked Event Viewer System log?

Errors from Activity Monitor now mostly "positioning"  errors (Status code 96) and occuring early in the backup

Media servers with event logs show consistent IO errors in the form of :

Log Name:      System
Source:        hplto
Date:          3/04/2017 5:10:30 PM
Event ID:      7
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      media_server_FQDN
Description:
The device, \Device\Tape0, has a bad block.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="hplto" />
    <EventID Qualifiers="49156">7</EventID>
    <Level>2</Level>
    <Task>0</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2017-04-03T07:10:30.278322600Z" />
    <EventRecordID>720917</EventRecordID>
    <Channel>System</Channel>
    <Computer>media_server_FQDN</Computer>
REST DELETED.

 

 

 

 

The fix "appears" to simple......library was rebooted.  It had been running for ~4yrs without a reboot. So far it hasn't reported any issues after 12 hrs.  Give it a couple days that will confirm the solution.