Solved: LTO drives goes down 2 drives from 4 goes down

maciej123 · ‎11-30-2017

Hi
My problem is that 2 drives from 4 installed in a liblary SL150 are going donw

tpconfig -d
Id DriveName Type Residence
Drive Path Status
****************************************************************************
0 LTO-6.DRIVE.1 hcart2 TLD(0) DRIVE=1
/dev/nst3 UP
1 LTO-6.DRIVE.4 hcart2 TLD(0) DRIVE=4
/dev/nst2 UP
2 LTO-6.DRIVE.2 hcart2 TLD(0) DRIVE=2
/dev/nst1 DOWN
3 LTO-6.DRIVE.3 hcart2 TLD(0) DRIVE=3
/dev/nst0 DOWN

Currently defined robotics are:
TLD(0) robotic path = /dev/sg14

EMM Server = lllll

I am after the restart liblary sl150 and after stop/start media server
some logs from OS from /var/log/messages

Nov 30 08:45:50 ppppp tldcd[12903]: TLD(0) cannot dismount drive 2, slot 73 already is full
Nov 30 08:45:53 ppppp ltid[11722]: Operator/EMM server has DOWN'ed drive LTO-6.DRIVE.2 (device 2)
Nov 30 08:52:43 ppppp xinetd[2852]: START: nrpe pid=13101 from=::ffff:10.64.7.8
Nov 30 08:52:43 ppppp xinetd[2852]: EXIT: nrpe status=0 pid=13101 duration=0(sec)
Nov 30 08:52:52 ppppp xinetd[2852]: START: nrpe pid=13104 from=::ffff:10.64.7.8
Nov 30 08:52:52 ppppp xinetd[2852]: EXIT: nrpe status=0 pid=13104 duration=0(sec)
Nov 30 08:53:36 ppppp xinetd[2852]: START: nrpe pid=13132 from=::ffff:10.64.7.8
Nov 30 08:53:36 ppppp xinetd[2852]: EXIT: nrpe status=0 pid=13132 duration=0(sec)
Nov 30 08:53:52 ppppp xinetd[2852]: START: nrpe pid=13137 from=::ffff:10.64.7.8
Nov 30 08:53:52 ppppp xinetd[2852]: EXIT: nrpe status=0 pid=13137 duration=0(sec)
Nov 30 08:54:29 ppppp xinetd[2852]: START: nrpe pid=13166 from=::ffff:10.64.7.8
Nov 30 08:54:29 ppppp xinetd[2852]: EXIT: nrpe status=0 pid=13166 duration=0(sec)
Nov 30 08:54:30 ppppp xinetd[2852]: START: nrpe pid=13169 from=::ffff:10.64.7.8
Nov 30 08:54:30 ppppp xinetd[2852]: EXIT: nrpe status=0 pid=13169 duration=0(sec)
Nov 30 08:58:00 ppppp tldcd[13277]: TLD(0) cannot dismount drive 2, slot 73 already is full
Nov 30 08:59:25 ppppp tldd[12850]: TLD(0) [12850] timed out after waiting 855 seconds for ready, drive 3
Nov 30 09:00:06 ppppp ltid[11722]: Operator/EMM server has DOWN'ed drive LTO-6.DRIVE.3 (device 3)
Nov 30 09:02:44 ppppp xinetd[2852]: START: nrpe pid=13591 from=::ffff:10.64.7.8

example error from the job "1: (2009) All compatible drive paths are down but media is available " but I am not sure that is all.

In a libraly I see that is try to mount tapes, in a web gui of SL-150 I see that tape is in a drive but nothing happens more. Drive goes down.

[root@ppppp media]# vmoprcmd

HOST STATUS
Host Name Version Host Status
========================================= ======= ===========
lllll 761100 ACTIVE-DISK
ppppp 761100 ACTIVE
sssss 750000 DEACTIVATED

PENDING REQUESTS

<NONE>

DRIVE STATUS

Drive Name Label Ready RecMID ExtMID Wr.Enbl. Type
Host DrivePath Status
=============================================================================
LTO-6.DRIVE.1 Yes Yes 0021L6 0021L6 Yes hcart2
ppppp.tpsa.pl /dev/nst3 ACTIVE

LTO-6.DRIVE.2 No No No hcart2
ppppp.tpsa.pl /dev/nst1 DOWN-TLD

LTO-6.DRIVE.3 No No No hcart2
ppppp.tpsa.pl /dev/nst0 DOWN-TLD

LTO-6.DRIVE.4 Yes Yes 1375L6 1375L6 Yes hcart2
ppppp.tpsa.pl /dev/nst2 ACTIVE

Regards
Maciej

Nicolai · ‎11-30-2017

1: the physical tapes location does not match what NBU has, run a inventory with all tapes dismounted (problem two is likley also root cause to this issus).

2: Likely drive order is wrong, Netbackup think order is 1 2 3 4, when in reality is it 1 2 4 3.

TLD(0) [12850] timed out after waiting 855 seconds for ready, drive 3

Netbackup is expecting drive 3 to be mounted with a tape, but likley it has been mounted in drive 4. After 855 seconds Netbackup give up and down the drive. When Netbackup then mount on tape drive 4, it ends up in tape drive 3 which is already full.

Also ensure Netbackup has SCSI connection to all tape drives - use command lsscsi or /usr/openv/volmgr/bin/scan

View solution in original post

Nicolai · ‎11-30-2017

1: the physical tapes location does not match what NBU has, run a inventory with all tapes dismounted (problem two is likley also root cause to this issus).

2: Likely drive order is wrong, Netbackup think order is 1 2 3 4, when in reality is it 1 2 4 3.

TLD(0) [12850] timed out after waiting 855 seconds for ready, drive 3

Netbackup is expecting drive 3 to be mounted with a tape, but likley it has been mounted in drive 4. After 855 seconds Netbackup give up and down the drive. When Netbackup then mount on tape drive 4, it ends up in tape drive 3 which is already full.

Also ensure Netbackup has SCSI connection to all tape drives - use command lsscsi or /usr/openv/volmgr/bin/scan

Marianne · ‎11-30-2017

You have more than one problem:

TLD(0) cannot dismount drive 2, slot 73 already is full
This normally happens when tapes were manually loaded in robot slots while a tape was in use in drive 2. The tape can now not be returned to its 'home slot' because someone has loaded another tape there. Remove the tape from slot 73 to solve this.
TLD(0) [12850] timed out after waiting 855 seconds for ready, drive 3
Something wrong between OS and tape drive, or else incorrect device mapping.
The command was sent to the robot to mount the tape. You will now see the lights flashing on the tape drive to rewind and 'ready' the tape. After this, you will see lights stop flashing (ready state). At this point, the OS picks up the 'ready' status via device driver. NBU then gets the 'go ahead' from the OS.
You can see that NBU was waiting for 855 seconds before giving up.
Nothing can be done from NBU-level to fix this.
You need to troubleshoot and fix at device and/or OS-level.
If possible to view the tape drive after tape mount, watch the lights to see what happens.

Handy NetBackup Links

VOX

LTO drives goes down 2 drives from 4 goes down