Several drives going DOWN in short period with "ME...

derfur · ‎06-03-2020

We have Netbackup 7.7.3 and attached to it tape library SL3000(attached over FC).
1. To the media/master server1(on Solaris 11 SPARC) attached robot and 13 tape drives from SL3000.
2. To the media server2(on SUSE ES 11) attached 5 tape drives from same SL3000
3. To the media server3 (on SUSE ES 11) attached same 4 tape drives from same SL3000 as media server2
It worked fine several years. But now, sometimes, when backups on server2 starts, several drives path may going down on server2 and different drives on server1.

Support of tape library did not find any issues on library.

We deleted all tape drives and added them again(this wag suggestion of tape lirary support team). Stange thing, that not shared drives was added with old names.

This helps for two weeks. And now it comes again..

tpconfig -l output:

Server1:
Device Robot Drive Robot Drive Device Second
Type Num Index Type DrNum Status Comment Name Path Device Path
robot 0 - TLD - - - - /dev/sg/c0tw500104f000b22918l0
drive - 0 hcart3 4 UP - IBM.ULTRIUM-TD6.000 /dev/rmt/7cbn
drive - 1 hcart2 3 DOWN - STK.T10000D.001 /dev/rmt/8cbn
drive - 2 hcart2 2 DOWN - server1_579004003808 /dev/rmt/4cbn
drive - 3 hcart2 1 UP - server1_579004005455 /dev/rmt/0cbn
drive - 4 hcart2 8 UP - STK.T10000D.003 /dev/rmt/3cbn
drive - 5 hcart2 7 UP - server1_579004002624 /dev/rmt/1cbn
drive - 6 hcart2 6 UP - server1_579004008297 /dev/rmt/5cbn
drive - 7 hcart2 5 UP - server1_579004005394 /dev/rmt/6cbn
drive - 8 hcart2 12 DOWN - STK.T10000D.002 /dev/rmt/2cbn
drive - 9 hcart2 11 UP - STK.T10000D.000 /dev/rmt/12cbn
drive - 10 hcart2 10 DOWN - server1_579004006393 /dev/rmt/11cbn
drive - 11 hcart2 9 UP - server1_579004006402 /dev/rmt/10cbn
drive - 12 hcart2 13 UP - server1_579004006389 /dev/rmt/9cbn

Server2:
Device Robot Drive Robot Drive Device Second
Type Num Index Type DrNum Status Comment Name Path Device Path
robot 0 - TLD - - - - server1
drive - 0 hcart2 12 DOWN - STK.T10000D.002 /dev/nst1
drive - 1 hcart2 8 UP - STK.T10000D.003 /dev/nst0
drive - 2 hcart3 4 UP - IBM.ULTRIUM-TD6.000 /dev/nst3
drive - 3 hcart2 3 DOWN - STK.T10000D.001 /dev/nst2
drive - 4 hcart2 11 UP - STK.T10000D.000 /dev/nst4

Server3:
Device Robot Drive Robot Drive Device Second
Type Num Index Type DrNum Status Comment Name Path Device Path
robot 0 - TLD - - - - server1
drive - 0 hcart2 11 UP - STK.T10000D.000 /dev/nst3
drive - 1 hcart2 3 UP - STK.T10000D.001 /dev/nst2
drive - 2 hcart2 12 UP - STK.T10000D.002 /dev/nst1
drive - 3 hcart2 8 UP - STK.T10000D.003 /dev/nst0

In messages on server1 we have errors:
Jun 2 20:20:25 server1 tldcd[18489]: [ID 702911 daemon.error] TLD(0) key = 0x4, asc = 0x53, ascq = 0x0, MEDIA LOAD OR EJECT FAILED
Jun 2 20:20:25 server1 tldcd[18489]: [ID 702911 daemon.error] TLD(0) Move_medium error
Jun 2 20:20:54 server1 ltid[2215]: [ID 702911 daemon.error] Operator/EMM server has DOWN'ed drive STK.579004003808 (device 2)
Jun 2 20:22:07 server1 avrd[2331]: [ID 702911 daemon.notice] Reservation Conflict status from STK.T10000D.000 (device 9)

Tha same time on server2 we have messages:
Jun 2 20:15:58 server2 kernel: st 11:0:0:0: [sg2] Warning! Received an indication that the mode parameters on this target
have changed. The Linux SCSI layer does not automatically adjust these parameters.
Jun 2 20:20:12 server2 kernel: st 12:0:2:0: [sg6] Warning! Received an indication that the mode parameters on this target
have changed. The Linux SCSI layer does not automatically adjust these parameters.
Jun 2 20:23:05 server2 kernel: st 12:0:0:0: [sg4] Warning! Received an indication that the mode parameters on this target
have changed. The Linux SCSI layer does not automatically adjust these parameters.
Jun 2 20:23:54 server2 ltid[15144]: Operator/EMM server has DOWN'ed drive STK.T10000D.001 (device 3)

Tape_Archived · ‎06-03-2020

The reservation conflict message indicates your problem:

Jun 2 20:22:07 server1 avrd[2331]: [ID 702911 daemon.notice] Reservation Conflict status from STK.T10000D.000 (device 9)

Try troubleshooting with the help of one of the solutions - https://vox.veritas.com/t5/NetBackup/Reservation-Conflict-in-SSO-Drive/td-p/574316

pats_729 · ‎06-03-2020

Hi,

There are reservtion conflict happening on OS level.

Jun 2 20:22:07 server1 avrd[2331]: [ID 702911 daemon.notice] Reservation Conflict status from STK.T10000D.000 (device 9)

Have you rebooted these servers recentely ? If not try a reboot or else try to release conflicts using this article.

https://www.veritas.com/support/en_US/article.100015350

Hope it helps.

Nicolai · ‎06-03-2020

This one one pretty clear, the robot is reporting a error:

Jun 2 20:20:25 server1 tldcd[18489]: [ID 702911 daemon.error] TLD(0) key = 0x4, asc = 0x53, ascq = 0x0, MEDIA LOAD OR EJECT FAILED

This either mean there is a fault on the tape drive OR there is a fault in the configuration. E.g Tape1 is configured as position one in the robot, but it is actual in position 3. This will cause at least two drives to be down'ed

Possibility 3, old device driver names are used, that no longer represent a actual tape drive.

To verify:
down drive in Netbackup.
Manual mount a tape in the drive
From the host do a "mt -f /dev/rmt/??cbn status" - when/if tape drive is ready, a ready/BOT message will be returned from the mt command. Unload tape with mt -f /dev/rmt/??cbn offline".

Marianne · ‎06-03-2020

@derfur

You maybe want to start troubleshooting on the Linux server (server2).

The errors in messages files tells us that 'something' is not right at sg driver level.
This could be causing the reservation conflict on the shared drives.

Please check if hba, sg and st drivers are up to date at OS-level.

Some links about 'reservation conflict':

https://www.veritas.com/support/en_US/article.100027655

https://www.veritas.com/support/en_US/doc/24437881-126559615-0/v95674354-126559615

https://vox.veritas.com/t5/NetBackup/Reservation-Conflict-in-SSO-Drive/td-p/574316

TN about troubleshooting drive/robot issues:
https://www.veritas.com/support/en_US/article.100014480

Handy NetBackup Links

Dollypee · ‎06-03-2020

@derfur start by checking all drives are visible to the OS , See the link already provided by @Marianne . Once confirmed, drives are seen, you can rule out the library been the cause time been. Also ensure library and drives firmware are up to date prior ruling out the library. Then focus on the nbu side, ensure there are no path mix-match, delete and scan for each drive. @Marianne already provided all relevant TN you need to achieve this troubleshooting.

VOX

Several drives going DOWN in short period with "MEDIA LOAD OR EJECT FAILED"