04-03-2017 03:48 AM
Tape drives going down ...suspect SAN ???
Why I think its the SAN
Two identical libraries (STK SL3000, 5x HP LTO5) with all tape drives shared amongst the media servers. Only one ATL has problems with drives being downed.
The ATL has behaving itself for the last three years, first one tape drive became a problem now its spreading to the other tape drives. With the ATL looking and reporting as OK and the media servers are OK plus only one ATL affected can I assume the problem is very likely to be the SAN?
Assumptions
If these assumptions are true then when you have errors on each of the tape drive SAN connections which leads me to assume a SAN issue. If all the reporting is via the robotic SAN path and the tape drive is data only then it could be either library problem or/and a SAN problem.
Solved! Go to Solution.
04-03-2017 05:24 AM
04-04-2017 04:52 AM
The fix "appears" to simple......library was rebooted. It had been running for ~4yrs without a reboot. So far it hasn't reported any issues after 12 hrs. Give it a couple days that will confirm the solution.
04-03-2017 04:58 AM
What does the SAN switch logs show ?
Have had similar issues that could be traced back to a faulty fiber cable or SFP.
04-03-2017 05:02 AM
It's nice to see a reasonable argument for/ against where the issue is apart from the usual 'lets blame NBU'.
The bptm log may be helpful, you are just looking for the line(s) where the drive has some problem (providing we get that far) ... eg
Does a tape get mounted successfully
Do we start to write
Do we finish writing error on the eject.
Searching for 'error' or <16> can be helpful
/usr/openv/netbackup/db/media/errors files (from each media server) can sometimes be useful.
Activity monitor usually shows the error, and if we're lucky, the pid of the log involved, along with the time.
Is the drive disappearing from the san (messages or event logs should show this).
Generally speaking, NBU doesn't cause drive issues - the majority of tape operation stuff is down by the OS (reads, writes) not NBU, although we do send various scsi commands (eg to position) even then the underlying operations go via the OS. It's not impossible for NBU to cause an issue, but it's rare.
04-03-2017 05:24 AM
Have you checked Event Viewer System log?
04-03-2017 06:08 AM
Errors from Activity Monitor now mostly "positioning" errors (Status code 96) and occuring early in the backup
Media servers with event logs show consistent IO errors in the form of :
Log Name: System
Source: hplto
Date: 3/04/2017 5:10:30 PM
Event ID: 7
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: media_server_FQDN
Description:
The device, \Device\Tape0, has a bad block.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="hplto" />
<EventID Qualifiers="49156">7</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2017-04-03T07:10:30.278322600Z" />
<EventRecordID>720917</EventRecordID>
<Channel>System</Channel>
<Computer>media_server_FQDN</Computer>
REST DELETED.
04-04-2017 04:52 AM
The fix "appears" to simple......library was rebooted. It had been running for ~4yrs without a reboot. So far it hasn't reported any issues after 12 hrs. Give it a couple days that will confirm the solution.