β08-02-2011 12:38 AM
Hello,
I've a problem with a robotic library managed by ACSLS: Aprox 50% of drives go DOWN every night (5-6 of 10).
This is a clustered master server (also is a media server) Solaris10, 10 media servers Solaris10 with Netbackup Enterprise 7.1.0.1
SCSI errors are showed on messages:
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@36,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec16d,0 (st28):
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice] Requested Block: 1000000000 Error Block: 1000000000
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 20:03:08 viznar scsi: [ID 107833 kern.notice] ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x0
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@12,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec167,0 (st15):
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice] Requested Block: 1183 Error Block: 1183
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 20:31:47 viznar scsi: [ID 107833 kern.notice] ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x0
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@12,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec179,0 (st17):
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice] Requested Block: 1 Error Block: 1
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 21:05:42 viznar scsi: [ID 107833 kern.notice] ASC: 0x14 (recorded entity not found), ASCQ: 0x0, FRU: 0x0
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@36,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec15b,0 (st12):
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice] Requested Block: 1 Error Block: 1
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 22:00:49 viznar scsi: [ID 107833 kern.notice] ASC: 0x14 (recorded entity not found), ASCQ: 0x0, FRU: 0x0
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.warning] WARNING: /pci@12,600000/SUNW,qlc@0/fp@0,0/st@w500104f000aec179,0 (st17):
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice] Requested Block: 20114 Error Block: 20114
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice] Vendor: HP Serial Number:
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice] Sense Key: Media Error
Aug 1 22:37:50 viznar scsi: [ID 107833 kern.notice] ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x0
Also write&possition errors:
bpadm.viznar bash # sudo tail -f /usr/openv/netbackup/db/media/errors
08/01/11 20:03:08 LU1690 1 WRITE_ERROR LTO4_0012_4
08/01/11 20:34:48 LU1696 9 WRITE_ERROR LTO4_0012_6
08/01/11 21:05:43 LU1693 12 POSITION_ERROR LTO4_0012_8
08/01/11 22:00:53 LU1693 16 POSITION_ERROR LTO4_0012_2
08/01/11 22:40:51 LU1699 12 WRITE_ERROR LTO4_0012_8
08/02/11 01:09:13 LU1693 16 WRITE_ERROR LTO4_0012_2
08/02/11 05:06:13 LU1170 13 WRITE_ERROR LTO4_0012_0
08/02/11 05:46:45 LU1693 13 POSITION_ERROR LTO4_0012_0
bptm log fragment:
20:03:08.366 [16060] <16> io_ioctl: ioctl (MTWEOF) failed on media id LU1690, drive index 1, I/O error (bptm.c.22233)
20:03:08.366 [16060] <2> send_MDS_msg: DEVICE_STATUS 1 34821236 netbackup-sc LU1690 4003198 LTO4_0012_4 2000415 WRITE_ERROR 0 0
20:03:08.625 [16060] <2> log_media_error: successfully wrote to error file - 08/01/11 20:03:08 LU1690 1 WRITE_ERROR LTO4_0012_4
20:03:08.649 [16060] <16> write_backup: FROZE media id LU1690, could not write tape mark to begin new image
20:03:08.650 [16060] <2> send_MDS_msg: MEDIADB 1 34821236 LU1690 4003198 *NULL* 20 1312166567 1312220056 1313429226 0 1196176768 899 899 1 18 0 1 1024 0 4675
053 0
20:03:08.869 [16060] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/drive_LTO4_0012_4, from bptm.c.16250
20:03:08.869 [16060] <2> drivename_write: Called with mode 1
20:03:08.870 [16060] <2> drivename_unlock: unlocked
20:03:08.870 [16060] <2> drivename_close: Called for file LTO4_0012_4
20:03:08.870 [16060] <2> tpunmount: NOP: MEDIA_DONE 0 3035952 0 LU1690 4003198 180 {7DF013A0-BC67-11E0-8AF6-0021280D4B0E}
20:03:08.870 [16060] <2> nbjm_media_request: Passing job control to NBJM, type FAIL/9
20:03:08.870 [16060] <2> nbjm_media_request: old_media_id = LU1690, media_id = NULL
20:03:52.384 [21143] <2> bptm: INITIATING (VERBOSE = 0): -unload -dn LTO4_0012_4 -dp /dev/rmt/28cbn -dk 2000415 -m LU1690 -mk 4003198 -mds 8 -alocid 34821236
-jobid -1311951091 -jm
20:03:52.385 [21143] <2> check_touch_file: Found /usr/openv/volmgr/database/NO_TAPEALERT from (bptm.c.1153)
20:03:52.385 [21143] <2> vnet_same_host: ../../libvlibs/vnet_addrinfo.c.2926: 0: name2 is empty: 0 0x00000000
20:03:52.385 [21143] <4> bptm: emmserver_name = netbackup-sc
20:03:52.385 [21143] <4> bptm: emmserver_port = 1556
20:03:52.406 [21143] <2> Orb::init: initializing ORB EMMlib_Orb with: dbstunitq -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirective "sta
tic PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolF
actory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBDefaultInitRef '' -ORBSvcConfDirective "static
PBXIOP_Evaluator_Factory '-orb EMMlib_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBSvcConf /dev/null -ORBSvcConfDi
rective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(../Orb.cpp:824)
20:03:52.418 [21143] <2> Orb::init: caching EndpointSelectorFactory(../Orb.cpp:839)
20:03:52.444 [21143] <2> send_brm_msg: PID of bpxm = 21143
20:03:52.444 [21143] <2> nbjm_media_request: Passing job control to NBJM, type UNLOAD/6
20:03:52.444 [21143] <2> nbjm_media_request: old_media_id = NULL, media_id = LU1690
20:03:52.446 [21143] <2> Orb::init: Created anon service name: NB_21143_549438776(../Orb.cpp:696)
20:03:52.446 [21143] <2> Orb::init: endpointvalue is : pbxiop://1556:NB_21143_549438776(../Orb.cpp:713)
20:03:52.446 [21143] <2> Orb::init: initializing ORB Default_DAEMON_Orb with: Unknown -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirectiv
e "static PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBPro
tocolFactory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBDefaultInitRef '' -ORBSvcConfDirective "s
tatic PBXIOP_Evaluator_Factory '-orb Default_DAEMON_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBEndpoint pbxiop:/
/1556:NB_21143_549438776 -ORBSvcConf /dev/null -ORBSvcConfDirective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(../Orb.cpp:824)
20:03:52.447 [21143] <2> Orb::init: caching EndpointSelectorFactory(../Orb.cpp:839)
20:03:52.746 [21143] <2> RequestInitialResources: returning
20:03:52.746 [21143] <2> parse_resource_strings: MEDIADB 1 34821236 LU1690 4003198 ------ 20 1312166567 1312220056 1313429226 0 1196176768 899 899 1 18 0 0 1
024 0 4675053 0
20:03:52.746 [21143] <2> parse_resource_strings: Parsed message type 15, version 1, 21 parameters
20:03:52.746 [21143] <2> parse_resource_strings: VOLUME 1 LU1690 4003198 LU1690 Offsite HP 1240446565 24 1 1 0 0 {00000000-0000-0000-0000-000000000000} 0
20:03:52.746 [21143] <2> parse_resource_strings: Parsed message type 16, version 1, 14 parameters
20:03:52.746 [21143] <2> parse_resource_strings: DRIVE 3 LTO4_0012_4 2000415 HU19447EAP /dev/rmt/28cbn -1 -1 -1 -1 0 0 12 4 *NULL* *NULL* *NULL* *NULL* 1 0 0
1 0 0
20:03:52.747 [21143] <2> check_touch_file: Found /usr/openv/volmgr/database/NO_TAPEALERT from (jmcomm.c.194)
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 17, version 3, 23 parameters
20:03:52.747 [21143] <2> parse_resource_strings: STORAGE 1 netbackup-sc-hcart3-robot-acs-1 20 1048576 2 1 0 0 netbackup-sc viznar netbackup-sc
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 18, version 1, 11 parameters
20:03:52.747 [21143] <2> parse_resource_strings: DISKGROUP 0 6 *NULL* 6 *NULL* 6 *NULL*
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 21, version 0, 7 parameters
20:03:52.747 [21143] <2> parse_resource_strings: DISKVOLUME 0 6 *NULL* 6 *NULL* 0
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 22, version 0, 6 parameters
20:03:52.747 [21143] <2> parse_resource_strings: DISKMOUNTPOINT 0 6 *NULL*
20:03:52.747 [21143] <2> parse_resource_strings: Parsed message type 23, version 0, 3 parameters
20:03:52.747 [21143] <2> nbjm_media_request: Job control returned to BPTM
20:03:52.747 [21143] <2> drivename_open: Called with Create 1, file LTO4_0012_4
20:03:52.747 [21143] <2> drivename_lock: lock established
20:03:52.780 [21143] <4> create_tpreq_file: symlink to path /dev/rmt/28cbn
20:03:52.780 [21143] <2> drivename_write: Called with mode 2
20:03:52.781 [21143] <2> really_tpunmount: tpunmount'ing /usr/openv/netbackup/db/media/tpreq/drive_LTO4_0012_4
20:03:53.648 [21143] <2> tapelib: wait_for_ltid, UnMount, timeout 2400
20:04:52.658 [21143] <2> send_MDS_msg: OP_STATUS 0 34821236 netbackup-sc 0 2 0 0 0 0 0 0 *NULL* 16
20:04:52.761 [21143] <2> send_operation_error: Decoded status = 0 from 2
20:04:52.761 [21143] <4> create_tpreq_file: symlink to path /dev/rmt/28cbn
20:04:52.824 [21143] <2> tapealert_and_release: report_attr, fl1 0x00000001, fl2 0x00000000
20:04:52.825 [21143] <2> tapealert_and_release: SCSI RELEASE
20:04:52.826 [21143] <2> drivename_unlock: unlocked
20:04:52.826 [21143] <2> drivename_close: Called for file LTO4_0012_4
20:04:52.826 [21143] <2> drivename_remove: Called
20:04:52.826 [21143] <2> main: Sending [EXIT STATUS 0] to NBJM
20:04:52.827 [21143] <2> bptm: EXITING with status 0 <----------
I opened a case with HW support, they tested the drives, replaced some of them and upgraded the firmware, BUT DRIVES GO DOWN AGAIN.
Any help will be apreciated.
Thanks.
Solved! Go to Solution.
β11-11-2011 05:29 AM
Hello,
At the end, we solved the problem.
As Marianne suggested, we focused on devices and robots because of the SCSI error.
After several scalations on HW support, they found a bug in the library tape microcode affecting the cleaning of tapes with ACSLS. So they upgraded the microcode of the library.
Then, we noticed that ACSLS was not cleaning drive tapes for several months, so the drive were seriously degraded and all of them must be replaced.
After that, everything went fine.
Thanks everybody for your help.
β08-02-2011 02:11 AM
Hi,
You'll have to go back to the SCSI layer. It could be the drives, it could the fibre cards, etc etc. The fact that you're getting the errors in your systems event logs is an indication that the problem lies beneath NetBackup.
Unfortunately you'll find that HW support usually only looks the device, and if they see its working they leave. The fact of tha matter is that there is a whole path to consider, the device, the connection (SAN Fabric), the fibre card, the firmware, the drivers installed on the OS.
Also, running tape on your master is technically not supported. Please see the note
http://www.symantec.com/docs/TECH147444
β08-02-2011 03:26 AM
Can Netbackup provoke a SCSI error (e.g. because a timeout) ??
By the other hand, I've realized that the drives DOWN in Netbackup contains cleaning tapes if I ask to ACSLS:
bpadm.viznar bash # sudo tpconfig -d
Id DriveName Type Residence
Drive Path Status
****************************************************************************
0 LTO4_0012_9 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=9
/dev/rmt/27cbn UP
1 LTO4_0012_4 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=4
/dev/rmt/28cbn DOWN
9 LTO4_0012_6 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=6
/dev/rmt/15cbn DOWN
10 LTO4_0012_5 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=5
/dev/rmt/14cbn DOWN
11 LTO4_0012_7 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=7
/dev/rmt/16cbn UP
12 LTO4_0012_8 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=8
/dev/rmt/17cbn DOWN
13 LTO4_0012_0 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=0
/dev/rmt/10cbn UP
14 LTO4_0012_1 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=1
/dev/rmt/11cbn UP
15 LTO4_0012_3 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=3
/dev/rmt/13cbn DISABLED
16 LTO4_0012_2 hcart3 ACS(1) ACS=0, LSM=0, PANEL=12, DRIVE=2
/dev/rmt/12cbn DOWN
ACSSA> qu dr all
2011-08-02 12:25:23 Drive Status
Identifier State Status Volume Type
0, 0,12, 0 online in use LU1026 HP-LTO4
0, 0,12, 1 online in use LU1563 HP-LTO4
0, 0,12, 2 online in use CLNU04 HP-LTO4
0, 0,12, 3 online available HP-LTO4
0, 0,12, 4 online available HP-LTO4
0, 0,12, 5 online in use CLNU01 HP-LTO4
0, 0,12, 6 online in use CLNU02 HP-LTO4
0, 0,12, 7 online in use LU1562 HP-LTO4
0, 0,12, 8 online in use CLNU03 HP-LTO4
0, 0,12, 9 online in use LU1696 HP-LTO4
Thanks for your help.
β08-02-2011 03:45 AM
Can Netbackup provoke a SCSI error ?
Nope. NBU is reacting to device/media errors. Please see extract from Status 84 Troubleshooting Guide (attached):
As an application, NetBackup has no direct access to a device, instead relying on the operating system (OS) to handle any communication with the device. This means that during a write operation NetBackup asks the OS to write to the device and report back the success or failure of that operation. If there is a failure, NetBackup will merely report that a failure occurred, and any troubleshooting should start at the OS level. If the OS is unable to perform the write, there are three likely causes; OS configuration, a problem on the SCSI path, or a problem with the device.
β08-02-2011 05:45 AM
Thanks Marianne, very accurate answer.
Do you have any clues on drives with cleaning tapes in ACSLS that are DOWN in Netbackup ??
β08-02-2011 06:31 AM
Do you have /usr/openv/volmgr/database/NO_TAPEALERT touch-file on ALL media servers?
Please also add VERBOSE entry to /usr/openv/volmgr/vm.conf on all media servers and restart NBU.
All request to ACSLS and response received from ACSLS will be logged in each media server's /var/adm/messages.
Please post messages as well as acsss_event log (as attachments) next time a drive is DOWN'ed. NBU will DOWN a drive if an error is returned from ACSLS.
I have in the past seen that drives are getting down'ed when this message is returned from ACSLS in response to NBU tape mount request for a particular drive: : ACS status = 29, STATUS_DRIVE_IN_USE (seen in /var/adm/messages with VERBOSE entry in vm.conf).
β08-03-2011 01:39 AM
Some days ago, I put the /usr/openv/volmgr/database/NO_TAPEALERT file in media servers that manage this drives, BUT not in other media server that also see the drive (althought they don't use them). I get no success and the drives were down again.
Do you think is important that ALL media servers that see the drives must have that file even if they don't use them ??
By the other hand, I got 29 (STATUS_DRIVE_IN_USE) and 72 (STATUS_PENDING) ACS status codes on drives that go DOWN. These errors are caused by cleanning tapes that Netbackup is unable to dismount.
Attach the log file.
Regards.
β08-03-2011 02:09 AM
The problem seems to be with the LENGTH of time that the cleaning tape is mounted on the drive. NBU keeps on resubmitting the request:
Aug 1 18:59:28 viznar acsd[17796]: [ID 560350 daemon.error] ACS(1) dismount failure for volume CLNU01 on drive (0,0,12,5), ACS status = 29, STATUS_DRIVE_IN_USE
Aug 1 18:59:28 viznar acsd[17796]: [ID 188966 daemon.error] ACS(1) waiting to resubmit dismount request (attempt 2) for volume CLNU01 on drive (0,0,12,5)
........
........
Aug 1 19:18:20 viznar acsd[24055]: [ID 477730 daemon.error] ACS(1) waiting to resubmit dismount request (attempt 5) for volume CLNU01 on drive (0,0,12,5)
Aug 1 19:20:50 viznar acsd[24055]: [ID 560350 daemon.error] ACS(1) dismount failure for volume CLNU01 on drive (0,0,12,5), ACS status = 29, STATUS_DRIVE_IN_USE
Drive cleaning should not take that long - seems NBU eventually gives up and DOWN's the drive.
β09-27-2011 01:26 AM
Have you been able to find out why ACSLS drive cleaning is taking that long?
Maybe worn cleaning tapes that need to be replaced? Or maybe normal data tapes accidently labeled with CLN label and thus not auto-ejecting after cleaning has completed?
Check acsss_event log on ACSLS server for evidence of drive cleaning and the length of time taken.
β10-12-2011 02:38 AM
Be while since I've worked with ACSLS and SSO, but think there are some ACS options for vm.conf for issues like this
Was in an appendix in device config guide if I remember correctly
Regards
Michael
β11-11-2011 05:29 AM
Hello,
At the end, we solved the problem.
As Marianne suggested, we focused on devices and robots because of the SCSI error.
After several scalations on HW support, they found a bug in the library tape microcode affecting the cleaning of tapes with ACSLS. So they upgraded the microcode of the library.
Then, we noticed that ACSLS was not cleaning drive tapes for several months, so the drive were seriously degraded and all of them must be replaced.
After that, everything went fine.
Thanks everybody for your help.