Drives go DOWN
Hello,
I've a problem with a robotic library managed by ACSLS: Aprox 50% of drives go DOWN every night (5-6 of 10).
This is a clustered master server (also is a media server) Solaris10, 10 media servers Solaris10 with Netbackup Enterprise 7.1.0.1
SCSI errors are showed on messages:
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@36,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec16d,0 (st28):
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice] Requested Block: 1000000000 Error Block: 1000000000
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice] ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x0
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@12,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec167,0 (st15):
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice] Requested Block: 1183 Error Block: 1183
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice] ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x0
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@12,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec179,0 (st17):
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice] Requested Block: 1 Error Block: 1
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice] ASC: 0x14 (recorded entity not found), ASCQ: 0x0, FRU: 0x0
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@36,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec15b,0 (st12):
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice] Requested Block: 1 Error Block: 1
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice] ASC: 0x14 (recorded entity not found), ASCQ: 0x0, FRU: 0x0
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@12,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec179,0 (st17):
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice] Requested Block: 20114 Error Block: 20114
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice] ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x0
Also write&possition errors:
bpadm.viznar bash # sudo tail -f /usr/openv/netbackup/db/media/errors
08/01/11 20:03:08 LU1690 1 WRITE_ERROR LTO4_0012_4
08/01/11 20:34:48 LU1696 9 WRITE_ERROR LTO4_0012_6
08/01/11 21:05:43 LU1693 12 POSITION_ERROR LTO4_0012_8
08/01/11 22:00:53 LU1693 16 POSITION_ERROR LTO4_0012_2
08/01/11 22:40:51 LU1699 12 WRITE_ERROR LTO4_0012_8
08/02/11 01:09:13 LU1693 16 WRITE_ERROR LTO4_0012_2
08/02/11 05:06:13 LU1170 13 WRITE_ERROR LTO4_0012_0
08/02/11 05:46:45 LU1693 13 POSITION_ERROR LTO4_0012_0
bptm log fragment:
20:03:08.366 [16060] <16> io_ioctl: ioctl (MTWEOF) failed on media id LU1690, drive index 1, I/O error (bptm.c.22233)
20:03:08.366 [16060] <2> send_MDS_msg: DEVICE_STATUS 1 34821236 netbackup-sc LU1690 4003198 LTO4_0012_4 2000415 WRITE_ERROR 0 0
20:03:08.625 [16060] <2> log_media_error: successfully wrote to error file - 08/01/11 20:03:08 LU1690 1 WRITE_ERROR LTO4_0012_4
20:03:08.649 [16060] <16> write_backup: FROZE media id LU1690, could not write tape mark to begin new image
20:03:08.650 [16060] <2> send_MDS_msg: MEDIADB 1 34821236 LU1690 4003198 *NULL* 20 1312166567 1312220056 1313429226 0 1196176768 899 899 1 18 0 1 1024 0 4675
053 0
20:03:08.869 [16060] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/drive_LTO4_0012_4, from bptm.c.16250
20:03:08.869 [16060] <2> drivename_write: Called with mode 1
20:03:08.870 [16060] <2> drivename_unlock: unlocked
20:03:08.870 [16060] <2> drivename_close: Called for file LTO4_0012_4
20:03:08.870 [16060] <2> tpunmount: NOP: MEDIA_DONE 0 3035952 0 LU1690 4003198 180 {7DF013A0-BC67-11E0-8AF6-0021280D4B0E}
20:03:08.870 [16060] <2> nbjm_media_request: Passing job control to NBJM, type FAIL/9
20:03:08.870 [16060] <2> nbjm_media_request: old_media_id = LU1690, media_id = NULL
20:03:52.384 [21143] <2> bptm: INITIATING (VERBOSE = 0): -unload -dn LTO4_0012_4 -dp /dev/rmt/28cbn -dk 2000415 -m LU1690 -mk 4003198 -mds 8 -alocid 34821236
-jobid -1311951091 -jm
20:03:52.385 [21143] <2> check_touch_file: Found /usr/openv/volmgr/database/NO_TAPEALERT from (bptm.c.1153)
20:03:52.385 [21143] <2> vnet_same_host: ../../libvlibs/vnet_addrinfo.c.2926: 0: name2 is empty: 0 0x00000000
20:03:52.385 [21143] <4> bptm: emmserver_name = netbackup-sc
20:03:52.385 [21143] <4> bptm: emmserver_port = 1556
20:03:52.406 [21143] <2> Orb::init: initializing ORB EMMlib_Orb with: dbstunitq -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirective "sta
tic PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolF
actory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBDefaultInitRef '' -ORBSvcConfDirective "static
PBXIOP_Evaluator_Factory '-orb EMMlib_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBSvcConf /dev/null -ORBSvcConfDi
rective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(../Orb.cpp:824)
20:03:52.418 [21143] <2> Orb::init: caching EndpointSelectorFactory(../Orb.cpp:839)
20:03:52.444 [21143] <2> send_brm_msg: PID of bpxm = 21143
20:03:52.444 [21143] <2> nbjm_media_request: Passing job control to NBJM, type UNLOAD/6
20:03:52.444 [21143] <2> nbjm_media_request: old_media_id = NULL, media_id = LU1690
20:03:52.446 [21143] <2> Orb::init: Created anon service name: NB_21143_549438776(../Orb.cpp:696)
20:03:52.446 [21143] <2> Orb::init: endpointvalue is : pbxiop://1556:NB_21143_549438776(../Orb.cpp:713)
20:03:52.446 [21143] <2> Orb::init: initializing ORB Default_DAEMON_Orb with: Unknown -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirectiv
e "static PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBPro
tocolFactory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBDefaultInitRef '' -ORBSvcConfDirective "s
tatic PBXIOP_Evaluator_Factory '-orb Default_DAEMON_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBEndpoint pbxiop:/
/1556:NB_21143_549438776 -ORBSvcConf /dev/null -ORBSvcConfDirective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(../Orb.cpp:824)
20:03:52.447 [21143] <2> Orb::init: caching EndpointSelectorFactory(../Orb.cpp:839)
20:03:52.746 [21143] <2> RequestInitialResources: returning
20:03:52.746 [21143] <2> parse_resource_strings: MEDIADB 1 34821236 LU1690 4003198 ------ 20 1312166567 1312220056 1313429226 0 1196176768 899 899 1 18 0 0 1
024 0 4675053 0
20:03:52.746 [21143] <2> parse_resource_strings: Parsed message type 15, version 1, 21 parameters
20:03:52.746 [21143] <2> parse_resource_strings: VOLUME 1 LU1690 4003198 LU1690 Offsite HP 1240446565 24 1 1 0 0 {00000000-0000-0000-0000-000000000000} 0
20:03:52.746 [21143] <2> parse_resource_strings: Parsed message type 16, version 1, 14 parameters
20:03:52.746 [21143] <2> parse_resource_strings: DRIVE 3 LTO4_0012_4 2000415 HU19447EAP /dev/rmt/28cbn -1 -1 -1 -1 0 0 12 4 *NULL* *NULL* *NULL* *NULL* 1 0 0
1 0 0
20:03:52.747 [21143] <2> check_touch_file: Found /usr/openv/volmgr/database/NO_TAPEALERT from (jmcomm.c.194)
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 17, version 3, 23 parameters
20:03:52.747 [21143] <2> parse_resource_strings: STORAGE 1 netbackup-sc-hcart3-robot-acs-1 20 1048576 2 1 0 0 netbackup-sc viznar netbackup-sc
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 18, version 1, 11 parameters
20:03:52.747 [21143] <2> parse_resource_strings: DISKGROUP 0 6 *NULL* 6 *NULL* 6 *NULL*
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 21, version 0, 7 parameters
20:03:52.747 [21143] <2> parse_resource_strings: DISKVOLUME 0 6 *NULL* 6 *NULL* 0
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 22, version 0, 6 parameters
20:03:52.747 [21143] <2> parse_resource_strings: DISKMOUNTPOINT 0 6 *NULL*
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 23, version 0, 3 parameters
20:03:52.747 [21143] <2> nbjm_media_request: Job control returned to BPTM
20:03:52.747 [21143] <2> drivename_open: Called with Create 1, file LTO4_0012_4
20:03:52.747 [21143] <2> drivename_lock: lock established
20:03:52.780 [21143] <4> create_tpreq_file: symlink to path /dev/rmt/28cbn
20:03:52.780 [21143] <2> drivename_write: Called with mode 2
20:03:52.781 [21143] <2> really_tpunmount: tpunmount'ing /usr/openv/netbackup/db/media/tpreq/drive_LTO4_0012_4
20:03:53.648 [21143] <2> tapelib: wait_for_ltid, UnMount, timeout 2400
20:04:52.658 [21143] <2> send_MDS_msg: OP_STATUS 0 34821236 netbackup-sc 0 2 0 0 0 0 0 0 *NULL* 16
20:04:52.761 [21143] <2> send_operation_error: Decoded status = 0 from 2
20:04:52.761 [21143] <4> create_tpreq_file: symlink to path /dev/rmt/28cbn
20:04:52.824 [21143] <2> tapealert_and_release: report_attr, fl1 0x00000001, fl2 0x00000000
20:04:52.825 [21143] <2> tapealert_and_release: SCSI RELEASE
20:04:52.826 [21143] <2> drivename_unlock: unlocked
20:04:52.826 [21143] <2> drivename_close: Called for file LTO4_0012_4
20:04:52.826 [21143] <2> drivename_remove: Called
20:04:52.826 [21143] <2> main: Sending [EXIT STATUS 0] to NBJM
20:04:52.827 [21143] <2> bptm: EXITING with status 0 <----------
I opened a case with HW support, they tested the drives, replaced some of them and upgraded the firmware, BUT DRIVES GO DOWN AGAIN.
Any help will be apreciated.
Thanks.
Hello,
At the end, we solved the problem.
As Marianne suggested, we focused on devices and robots because of the SCSI error.
After several scalations on HW support, they found a bug in the library tape microcode affecting the cleaning of tapes with ACSLS. So they upgraded the microcode of the library.
Then, we noticed that ACSLS was not cleaning drive tapes for several months, so the drive were seriously degraded and all of them must be replaced.
After that, everything went fine.
Thanks everybody for your help.