Solved: Drives go DOWN

jluis · ‎08-02-2011

Hello,

I've a problem with a robotic library managed by ACSLS: Aprox 50% of drives go DOWN every night (5-6 of 10).

This is a clustered master server (also is a media server) Solaris10, 10 media servers Solaris10 with Netbackup Enterprise 7.1.0.1

SCSI errors are showed on messages:

Aug 1 20:03:08 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@36,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec16d,0 (st28):
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice]    Requested Block: 1000000000                Error Block: 1000000000
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice]    Vendor: HP                                 Serial Number:
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice]    Sense Key: Media Error
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice]    ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x0
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@12,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec167,0 (st15):
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice]    Requested Block: 1183                      Error Block: 1183
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice]    Vendor: HP                                 Serial Number:
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice]    Sense Key: Media Error
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice]    ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x0
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@12,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec179,0 (st17):
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice]    Requested Block: 1                         Error Block: 1
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice]    Vendor: HP                                 Serial Number:
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice]    Sense Key: Media Error
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice]    ASC: 0x14 (recorded entity not found), ASCQ: 0x0, FRU: 0x0
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@36,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec15b,0 (st12):
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice]    Requested Block: 1                         Error Block: 1
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice]    Vendor: HP                                 Serial Number:
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice]    Sense Key: Media Error
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice]    ASC: 0x14 (recorded entity not found), ASCQ: 0x0, FRU: 0x0
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@12,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec179,0 (st17):
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice]    Requested Block: 20114                     Error Block: 20114
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice]    Vendor: HP                                 Serial Number:
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice]    Sense Key: Media Error
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice]    ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x0

Also write&possition errors:

bpadm.viznar bash # sudo tail -f /usr/openv/netbackup/db/media/errors

08/01/11 20:03:08 LU1690 1 WRITE_ERROR LTO4_0012_4
08/01/11 20:34:48 LU1696 9 WRITE_ERROR LTO4_0012_6
08/01/11 21:05:43 LU1693 12 POSITION_ERROR LTO4_0012_8
08/01/11 22:00:53 LU1693 16 POSITION_ERROR LTO4_0012_2
08/01/11 22:40:51 LU1699 12 WRITE_ERROR LTO4_0012_8
08/02/11 01:09:13 LU1693 16 WRITE_ERROR LTO4_0012_2
08/02/11 05:06:13 LU1170 13 WRITE_ERROR LTO4_0012_0
08/02/11 05:46:45 LU1693 13 POSITION_ERROR LTO4_0012_0

bptm log fragment:

20:03:08.366 [16060] <16> io_ioctl: ioctl (MTWEOF) failed on media id LU1690, drive index 1, I/O error (bptm.c.22233)
20:03:08.366 [16060] <2> send_MDS_msg: DEVICE_STATUS 1 34821236 netbackup-sc LU1690 4003198 LTO4_0012_4 2000415 WRITE_ERROR 0 0
20:03:08.625 [16060] <2> log_media_error: successfully wrote to error file - 08/01/11 20:03:08 LU1690 1 WRITE_ERROR LTO4_0012_4
20:03:08.649 [16060] <16> write_backup: FROZE media id LU1690, could not write tape mark to begin new image
20:03:08.650 [16060] <2> send_MDS_msg: MEDIADB 1 34821236 LU1690 4003198 *NULL* 20 1312166567 1312220056 1313429226 0 1196176768 899 899 1 18 0 1 1024 0 4675
053 0
20:03:08.869 [16060] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/drive_LTO4_0012_4, from bptm.c.16250
20:03:08.869 [16060] <2> drivename_write: Called with mode 1
20:03:08.870 [16060] <2> drivename_unlock: unlocked
20:03:08.870 [16060] <2> drivename_close: Called for file LTO4_0012_4
20:03:08.870 [16060] <2> tpunmount: NOP: MEDIA_DONE 0 3035952 0 LU1690 4003198 180 {7DF013A0-BC67-11E0-8AF6-0021280D4B0E}
20:03:08.870 [16060] <2> nbjm_media_request: Passing job control to NBJM, type FAIL/9
20:03:08.870 [16060] <2> nbjm_media_request: old_media_id = LU1690, media_id = NULL
20:03:52.384 [21143] <2> bptm: INITIATING (VERBOSE = 0): -unload -dn LTO4_0012_4 -dp /dev/rmt/28cbn -dk 2000415 -m LU1690 -mk 4003198 -mds 8 -alocid 34821236
-jobid -1311951091 -jm
20:03:52.385 [21143] <2> check_touch_file: Found /usr/openv/volmgr/database/NO_TAPEALERT from (bptm.c.1153)
20:03:52.385 [21143] <2> vnet_same_host: ../../libvlibs/vnet_addrinfo.c.2926: 0: name2 is empty: 0 0x00000000
20:03:52.385 [21143] <4> bptm: emmserver_name = netbackup-sc
20:03:52.385 [21143] <4> bptm: emmserver_port = 1556
20:03:52.406 [21143] <2> Orb::init: initializing ORB EMMlib_Orb with: dbstunitq -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirective "sta
tic PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolF
actory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBDefaultInitRef '' -ORBSvcConfDirective "static
PBXIOP_Evaluator_Factory '-orb EMMlib_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBSvcConf /dev/null -ORBSvcConfDi
rective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(../Orb.cpp:824)
20:03:52.418 [21143] <2> Orb::init: caching EndpointSelectorFactory(../Orb.cpp:839)
20:03:52.444 [21143] <2> send_brm_msg: PID of bpxm = 21143
20:03:52.444 [21143] <2> nbjm_media_request: Passing job control to NBJM, type UNLOAD/6
20:03:52.444 [21143] <2> nbjm_media_request: old_media_id = NULL, media_id = LU1690
20:03:52.446 [21143] <2> Orb::init: Created anon service name: NB_21143_549438776(../Orb.cpp:696)
20:03:52.446 [21143] <2> Orb::init: endpointvalue is : pbxiop://1556:NB_21143_549438776(../Orb.cpp:713)
20:03:52.446 [21143] <2> Orb::init: initializing ORB Default_DAEMON_Orb with: Unknown -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirectiv
e "static PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBPro
tocolFactory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBDefaultInitRef '' -ORBSvcConfDirective "s
tatic PBXIOP_Evaluator_Factory '-orb Default_DAEMON_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBEndpoint pbxiop:/
/1556:NB_21143_549438776 -ORBSvcConf /dev/null -ORBSvcConfDirective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(../Orb.cpp:824)
20:03:52.447 [21143] <2> Orb::init: caching EndpointSelectorFactory(../Orb.cpp:839)
20:03:52.746 [21143] <2> RequestInitialResources: returning
20:03:52.746 [21143] <2> parse_resource_strings: MEDIADB 1 34821236 LU1690 4003198 ------ 20 1312166567 1312220056 1313429226 0 1196176768 899 899 1 18 0 0 1
024 0 4675053 0
20:03:52.746 [21143] <2> parse_resource_strings: Parsed message type 15, version 1, 21 parameters
20:03:52.746 [21143] <2> parse_resource_strings: VOLUME 1 LU1690 4003198 LU1690 Offsite HP 1240446565 24 1 1 0 0 {00000000-0000-0000-0000-000000000000} 0
20:03:52.746 [21143] <2> parse_resource_strings: Parsed message type 16, version 1, 14 parameters
20:03:52.746 [21143] <2> parse_resource_strings: DRIVE 3 LTO4_0012_4 2000415 HU19447EAP /dev/rmt/28cbn -1 -1 -1 -1 0 0 12 4 *NULL* *NULL* *NULL* *NULL* 1 0 0
1 0 0
20:03:52.747 [21143] <2> check_touch_file: Found /usr/openv/volmgr/database/NO_TAPEALERT from (jmcomm.c.194)
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 17, version 3, 23 parameters
20:03:52.747 [21143] <2> parse_resource_strings: STORAGE 1 netbackup-sc-hcart3-robot-acs-1 20 1048576 2 1 0 0 netbackup-sc viznar netbackup-sc
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 18, version 1, 11 parameters
20:03:52.747 [21143] <2> parse_resource_strings: DISKGROUP 0 6 *NULL* 6 *NULL* 6 *NULL*
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 21, version 0, 7 parameters
20:03:52.747 [21143] <2> parse_resource_strings: DISKVOLUME 0 6 *NULL* 6 *NULL* 0
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 22, version 0, 6 parameters
20:03:52.747 [21143] <2> parse_resource_strings: DISKMOUNTPOINT 0 6 *NULL*
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 23, version 0, 3 parameters
20:03:52.747 [21143] <2> nbjm_media_request: Job control returned to BPTM
20:03:52.747 [21143] <2> drivename_open: Called with Create 1, file LTO4_0012_4
20:03:52.747 [21143] <2> drivename_lock: lock established
20:03:52.780 [21143] <4> create_tpreq_file: symlink to path /dev/rmt/28cbn
20:03:52.780 [21143] <2> drivename_write: Called with mode 2
20:03:52.781 [21143] <2> really_tpunmount: tpunmount'ing /usr/openv/netbackup/db/media/tpreq/drive_LTO4_0012_4
20:03:53.648 [21143] <2> tapelib: wait_for_ltid, UnMount, timeout 2400
20:04:52.658 [21143] <2> send_MDS_msg: OP_STATUS 0 34821236 netbackup-sc 0 2 0 0 0 0 0 0 *NULL* 16
20:04:52.761 [21143] <2> send_operation_error: Decoded status = 0 from 2
20:04:52.761 [21143] <4> create_tpreq_file: symlink to path /dev/rmt/28cbn
20:04:52.824 [21143] <2> tapealert_and_release: report_attr, fl1 0x00000001, fl2 0x00000000
20:04:52.825 [21143] <2> tapealert_and_release: SCSI RELEASE
20:04:52.826 [21143] <2> drivename_unlock: unlocked
20:04:52.826 [21143] <2> drivename_close: Called for file LTO4_0012_4
20:04:52.826 [21143] <2> drivename_remove: Called
20:04:52.826 [21143] <2> main: Sending [EXIT STATUS 0] to NBJM
20:04:52.827 [21143] <2> bptm: EXITING with status 0 <----------

I opened a case with HW support, they tested the drives, replaced some of them and upgraded the firmware, BUT DRIVES GO DOWN AGAIN.

Any help will be apreciated.

Thanks.

jluis · ‎11-11-2011

Hello,

At the end, we solved the problem.

As Marianne suggested, we focused on devices and robots because of the SCSI error.

After several scalations on HW support, they found a bug in the library tape microcode affecting the cleaning of tapes with ACSLS. So they upgraded the microcode of the library.

Then, we noticed that ACSLS was not cleaning drive tapes for several months, so the drive were seriously degraded and all of them must be replaced.

After that, everything went fine.

Thanks everybody for your help.

View solution in original post

RiaanBadenhorst · ‎08-02-2011

Hi,

You'll have to go back to the SCSI layer. It could be the drives, it could the fibre cards, etc etc. The fact that you're getting the errors in your systems event logs is an indication that the problem lies beneath NetBackup.

Unfortunately you'll find that HW support usually only looks the device, and if they see its working they leave. The fact of tha matter is that there is a whole path to consider, the device, the connection (SAN Fabric), the fibre card, the firmware, the drivers installed on the OS.

Also, running tape on your master is technically not supported. Please see the note

http://www.symantec.com/docs/TECH147444

jluis · ‎08-02-2011

Can Netbackup provoke a SCSI error (e.g. because a timeout) ??

By the other hand, I've realized that the drives DOWN in Netbackup contains cleaning tapes if I ask to ACSLS:

bpadm.viznar bash # sudo tpconfig -d
Id DriveName           Type   Residence
      Drive Path                                                       Status
****************************************************************************
0   LTO4_0012_9          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=9
      /dev/rmt/27cbn                                                   UP
1   LTO4_0012_4          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=4
      /dev/rmt/28cbn                                                   DOWN
9   LTO4_0012_6          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=6
      /dev/rmt/15cbn                                                   DOWN
10 LTO4_0012_5          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=5
      /dev/rmt/14cbn                                                   DOWN
11 LTO4_0012_7          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=7
      /dev/rmt/16cbn                                                   UP
12 LTO4_0012_8          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=8
      /dev/rmt/17cbn                                                   DOWN
13 LTO4_0012_0          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=0
      /dev/rmt/10cbn                                                   UP
14 LTO4_0012_1          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=1
      /dev/rmt/11cbn                                                   UP
15 LTO4_0012_3          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=3
      /dev/rmt/13cbn                                                   DISABLED
16 LTO4_0012_2          hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=2
      /dev/rmt/12cbn                                                   DOWN

ACSSA> qu dr all
2011-08-02 12:25:23                                                     Drive Status
Identifier   State           Status      Volume     Type
   0, 0,12, 0 online          in use      LU1026     HP-LTO4
   0, 0,12, 1 online          in use      LU1563     HP-LTO4
   0, 0,12, 2 online          in use      CLNU04     HP-LTO4
   0, 0,12, 3 online          available              HP-LTO4
   0, 0,12, 4 online          available              HP-LTO4
   0, 0,12, 5 online          in use      CLNU01     HP-LTO4
   0, 0,12, 6 online          in use      CLNU02     HP-LTO4
   0, 0,12, 7 online          in use      LU1562     HP-LTO4
   0, 0,12, 8 online          in use      CLNU03     HP-LTO4
   0, 0,12, 9 online          in use      LU1696     HP-LTO4

Thanks for your help.

Marianne · ‎08-02-2011

Can Netbackup provoke a SCSI error ?

Nope. NBU is reacting to device/media errors. Please see extract from Status 84 Troubleshooting Guide (attached):

As an application, NetBackup has no direct access to a device, instead relying on the operating system (OS) to handle any communication with the device. This means that during a write operation NetBackup asks the OS to write to the device and report back the success or failure of that operation. If there is a failure, NetBackup will merely report that a failure occurred, and any troubleshooting should start at the OS level. If the OS is unable to perform the write, there are three likely causes; OS configuration, a problem on the SCSI path, or a problem with the device.

Handy NetBackup Links

jluis · ‎08-02-2011

Thanks Marianne, very accurate answer.

Do you have any clues on drives with cleaning tapes in ACSLS that are DOWN in Netbackup ??

Marianne · ‎08-02-2011

Do you have /usr/openv/volmgr/database/NO_TAPEALERT touch-file on ALL media servers?

Please also add VERBOSE entry to /usr/openv/volmgr/vm.conf on all media servers and restart NBU.

All request to ACSLS and response received from ACSLS will be logged in each media server's /var/adm/messages.

Please post messages as well as acsss_event log (as attachments) next time a drive is DOWN'ed. NBU will DOWN a drive if an error is returned from ACSLS.

I have in the past seen that drives are getting down'ed when this message is returned from ACSLS in response to NBU tape mount request for a particular drive: : ACS status = 29, STATUS_DRIVE_IN_USE (seen in /var/adm/messages with VERBOSE entry in vm.conf).

Handy NetBackup Links

jluis · ‎08-03-2011

Some days ago, I put the /usr/openv/volmgr/database/NO_TAPEALERT file in media servers that manage this drives, BUT not in other media server that also see the drive (althought they don't use them). I get no success and the drives were down again.

Do you think is important that ALL media servers that see the drives must have that file even if they don't use them ??

By the other hand, I got 29 (STATUS_DRIVE_IN_USE) and 72 (STATUS_PENDING) ACS status codes on drives that go DOWN. These errors are caused by cleanning tapes that Netbackup is unable to dismount.

Attach the log file.

Regards.

Marianne · ‎08-03-2011

The problem seems to be with the LENGTH of time that the cleaning tape is mounted on the drive. NBU keeps on resubmitting the request:

Aug 1 18:59:28 viznar acsd[17796]: [ID 560350 daemon.error] ACS(1) dismount failure for volume CLNU01 on drive (0,0,12,5), ACS status = 29, STATUS_DRIVE_IN_USE
Aug 1 18:59:28 viznar acsd[17796]: [ID 188966 daemon.error] ACS(1) waiting to resubmit dismount request (attempt 2) for volume CLNU01 on drive (0,0,12,5)
........
........
Aug 1 19:18:20 viznar acsd[24055]: [ID 477730 daemon.error] ACS(1) waiting to resubmit dismount request (attempt 5) for volume CLNU01 on drive (0,0,12,5)
Aug 1 19:20:50 viznar acsd[24055]: [ID 560350 daemon.error] ACS(1) dismount failure for volume CLNU01 on drive (0,0,12,5), ACS status = 29, STATUS_DRIVE_IN_USE

Drive cleaning should not take that long - seems NBU eventually gives up and DOWN's the drive.

Handy NetBackup Links

Marianne · ‎09-27-2011

Have you been able to find out why ACSLS drive cleaning is taking that long?

Maybe worn cleaning tapes that need to be replaced? Or maybe normal data tapes accidently labeled with CLN label and thus not auto-ejecting after cleaning has completed?

Check acsss_event log on ACSLS server for evidence of drive cleaning and the length of time taken.

Handy NetBackup Links

Michael_G_Ander · ‎10-12-2011

Be while since I've worked with ACSLS and SSO, but think there are some ACS options for vm.conf for issues like this

Was in an appendix in device config guide if I remember correctly

Regards

Michael

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

jluis · ‎11-11-2011

Hello,

At the end, we solved the problem.

As Marianne suggested, we focused on devices and robots because of the SCSI error.

After several scalations on HW support, they found a bug in the library tape microcode affecting the cleaning of tapes with ACSLS. So they upgraded the microcode of the library.

Then, we noticed that ACSLS was not cleaning drive tapes for several months, so the drive were seriously degraded and all of them must be replaced.

After that, everything went fine.

Thanks everybody for your help.