cancel
Showing results for 
Search instead for 
Did you mean: 

Find the reason why Tape Drives are becoming down

Kasra_Hashemi
Level 5

Hi Everybody,

   As you can see below, two drives became down, this time just drive 1 and 3. if I want to tell the exact history in last three month, I have to say  it is not just drive 1 and 3 that become down, sometimes drive 2 or 4, or any other combination of these 4 drives.

I have read windows event viewer Log and HP MSL drive and I've listed them in this post. 

I try to Up them manually via Netbackup administration console but after some hour or minutes, again I encounter the same issue.

Thank you.

Drive Status.PNGEvent.PNG

Netdigest
1 ACCEPTED SOLUTION

Accepted Solutions

Marianne
Level 6
Partner    VIP    Accredited Certified

You will need to check Application log at the time when the drive is DOWN'ed.

My guess is that it will be because of tapes manually loaded in seemingly empty slots. Hours later when job is finished, the tape cannot be returned to its 'home slot':

Error 5/28/2018 9:35:20 AM NetBackup TLD Control Daemon 13945 None TLD(0) cannot dismount drive 1, slot 34 already is full

Slot 34 will need to be manually emptied before the tape from drive 1 can be returned to this slot.

So, the operator needs to communicate with someone who has access to the GUI, plus he should only use the MAP/CAP to remove or load tapes.

When the MAP/CAP is used, it does not matter if jobs are running - the library will not be DOWN'ed. 

After tapes have been placed in the MAP, the operator needs to inform backup admin. 
At this point, run Inventory and select 'Empty media access port prior to update'.
The library will now select correct empty slots to put the tapes. 

So, the biggest problem is not so much with the library being opened while jobs are running, but rather with tapes put into slots where the tapes are being used in drives. 

View solution in original post

11 REPLIES 11

VirgilDobos
Moderator
Moderator
Partner    VIP    Accredited Certified

Hi Kasra,

Are the tape drives shared across multiple media servers?

 
--Virgil

Marianne
Level 6
Partner    VIP    Accredited Certified

@Kasra_Hashemi

The error in Application log is a good start:

"Request for media ID XXXXX is being rejected .....  reason = robotic daemon going to DOWN state "

This is normally an indication that the OS has lost connectivity to the library.

Please check Event Viewer System log for hardware and/or driver errors for the same period.

There seems to be way too many 'Device Manager' errors in Application log.

Is it possible that you could export both Application and System logs to .txt files (not .evt) and upload here? 

@VirgilDobos 

no, I just have one media server.

Netdigest

@Marianne

Here is my System Log and Application Log Attachment .

https://www.dropbox.com/sh/e7pz42o7hrknu4o/AAACkzG4pX7U0xpMvPjeV27Ua?dl=0

Netdigest

Marianne
Level 6
Partner    VIP    Accredited Certified

Could you please save the Event Viewer logs in txt format? 

evt(x) format is only useful on the source server. 

Marianne
Level 6
Partner    VIP    Accredited Certified

There is nothing in the System log that looks like hardware issues.

The Application log seems to indicate operational issues in the way that the tape library and tape handling is done.

It seems to me as if operator(s) are opening the tape library to load or unload tapes instead of using the MAP (media Access port). 

When a library door is opened, the library is put in a 'down state' as it no longer responds to request from NBU via the OS: 

Error 5/28/2018 9:10:51 AM NetBackup TLD Daemon 5705 None TLD(0) going to DOWN state, status: Unable to sense robotic device

When the robot door is closed, the robot now takes some time to initialize: 

Error 5/28/2018 9:25:50 AM NetBackup TLD Control Daemon 13942 None TLD(0) key = 0x2, asc = 0x4, ascq = 0x1, LOGICAL UNIT IS IN PROCESS OF BECOMING READY

Here tapes was manually loaded in seemingly empty slots, but the tapes that were in those slots  were in  tape drives for jobs that were still running (could be backup, restore, duplication, etc). When the job finished and the tape needed to be returned to its 'home' slot, there was another tape in that slot and the tape had to be left in the drive.
That is why the drive is DOWN'ed.

Error 5/28/2018 9:35:20 AM NetBackup TLD Control Daemon 13945 None TLD(0) cannot dismount drive 1, slot 34 already is full

Error 6/6/2018 11:20:09 AM NetBackup TLD Control Daemon 13945 None TLD(0) cannot dismount drive 3, slot 9 already is full

In summary: 
Do not open the robot door while jobs are in progress. 
Try to only use the MAP/CAP to remove or load tapes.

Information	5/8/2018 9:42:53 AM	MsiInstaller	1029	None	Product: Veritas NetBackup. Restart required. The installation or update for the product required a restart for all changes to take effect.  The restart was deferred to a later time.
Information	6/5/2018 4:25:02 AM	MsiInstaller	1035	None	Windows Installer reconfigured the product. Product Name: Veeam Endpoint Backup. Product Version: 1.0.0.1954. Product Language: 1033. Manufacturer: Veeam Software AG. Reconfiguration success or error status: 0.

It would appear that you have Veeam installed on this server as well as NetBackup...  Both products would be competing for access to the library and therefore cause the services to go down.

Even if there are no jobs running, Backup tools will place a lock on removable storage devices.

I would recommend that you remove the Veeam backup tool if you wish to use NetBackup.

Error	4/29/2018 11:22:54 AM	Microsoft-Windows-Security-SPP	8198	None	"License Activation (slui.exe) failed with the following error code:
hr=0x8007232B

 Your version of Windows Operating System also does not appear to be registered, which could potentially be causing this problem.

Warning	6/5/2018 8:01:56 PM	Tcpip	4227	None	TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

Network communications are also an issue depending on the type of Library that you have (iSCSI or other IP based library type).  It would be good to have a dedicated BackupLAN if this is the case.

Also review Marianne's post as this information is pertinent also.

@Marianne

Yes , our operator may open the library without checking if there are any jobs running , because he has no access the administration console .

we off-site tapes once a week . I have already realized that opening the library may put robotic in down state , so one of my tasks was to up them after off-site procedure , but after two or three hours two of four drives become down again .

what do you think about this?

Thank You

 

Netdigest

Marianne
Level 6
Partner    VIP    Accredited Certified

You will need to check Application log at the time when the drive is DOWN'ed.

My guess is that it will be because of tapes manually loaded in seemingly empty slots. Hours later when job is finished, the tape cannot be returned to its 'home slot':

Error 5/28/2018 9:35:20 AM NetBackup TLD Control Daemon 13945 None TLD(0) cannot dismount drive 1, slot 34 already is full

Slot 34 will need to be manually emptied before the tape from drive 1 can be returned to this slot.

So, the operator needs to communicate with someone who has access to the GUI, plus he should only use the MAP/CAP to remove or load tapes.

When the MAP/CAP is used, it does not matter if jobs are running - the library will not be DOWN'ed. 

After tapes have been placed in the MAP, the operator needs to inform backup admin. 
At this point, run Inventory and select 'Empty media access port prior to update'.
The library will now select correct empty slots to put the tapes. 

So, the biggest problem is not so much with the library being opened while jobs are running, but rather with tapes put into slots where the tapes are being used in drives. 

@J_MCCOLL

Yes I have Veem Endpoint on this server , for bare metal restore of this physical server .

and I did not add any tape library to this backup product . so this can not be an issue.

about the windows activation I have to tell you we have Intranet KMS server , Netback server may have not access to any of predefined ports of the KMS server , So that can not be an issue too.

 

Netdigest