Solved: ACSLS 7.3 - dismount failure down the drive

Anth105 · ‎02-10-2010

Hi all,

NBU 6.5.5
ACSLS 7.3
OS : Solaris

I have an intermittent drives going Down issue when a dismount request is submit. I have gone thru all usual steps of troubleshooting with no joy.

Upgraded the firmware on both the SL 8500 and the T10k drives
Updated the EMM mappings version
I even deleted and recreated the drives ( rm -f /dev/rmt* - devfsadm - reconfigure the sg driver - sgscan ..)
Had the faulty replaced by the vendor
Set the Media unmount delay valuie to 60

Here is an excerpt of the messages:

Feb 10 06:01:55 admatriubu01 acsd[5746]: [ID 168411 daemon.error] ACS(2) dismount failure for volume TA0098 on drive (2,1,1,15), ACS status = 56, STATUS_LIBRARY_FAILURE
Feb 10 06:02:30 admatriubu01 acsd[5878]: [ID 168411 daemon.error] ACS(2) dismount failure for volume TA0098 on drive (2,1,1,15), ACS status = 56, STATUS_LIBRARY_FAILURE

Feb 10 06:20:50 admatriubu01 acsd[9563]: [ID 168411 daemon.error] ACS(2) dismount failure for volume TA0098 on drive (2,1,1,15), ACS status = 56, STATUS_LIBRARY_FAILURE
Feb 10 06:50:09 admatriubu01 acsd[14669]: [ID 498531 daemon.error] user scsi ioctl() failed, may be timeout, errno = 5, I/O error
Feb 10 06:54:56 admatriubu01 acsd[14669]: [ID 905004 daemon.error] ACS(2) dismount failure for volume TA0255 on drive (2,2,1,15), ACS status = 29, STATUS_DRIVE_IN_USE
Feb 10 06:54:56 admatriubu01 acsd[14669]: [ID 756643 daemon.error] ACS(2) waiting to resubmit dismount request (attempt 2) for volume TA0255 on drive (2,2,1,15)
Feb 10 07:02:13 admatriubu01 acsd[14669]: [ID 905004 daemon.error] ACS(2) dismount failure for volume TA0255 on drive (2,2,1,15), ACS status = 29, STATUS_DRIVE_IN_USE
Feb 10 07:02:13 admatriubu01 acsd[14669]: [ID 338898 daemon.error] ACS(2) waiting to resubmit dismount request (attempt 3) for volume TA0255 on drive (2,2,1,15)
Feb 10 07:09:30 admatriubu01 acsd[14669]: [ID 905004 daemon.error] ACS(2) dismount failure for volume TA0255 on drive (2,2,1,15), ACS status = 29, STATUS_DRIVE_IN_USE
Feb 10 07:09:30 admatriubu01 acsd[14669]: [ID 821134 daemon.error] ACS(2) waiting to resubmit dismount request (attempt 4) for volume TA0255 on drive (2,2,1,15)
Feb 10 07:16:47 admatriubu01 acsd[14669]: [ID 905004 daemon.error] ACS(2) dismount failure for volume TA0255 on drive (2,2,1,15), ACS status = 29, STATUS_DRIVE_IN_USE
Feb 10 07:16:47 admatriubu01 acsd[14669]: [ID 403389 daemon.error] ACS(2) waiting to resubmit dismount request (attempt 5) for volume TA0255 on drive (2,2,1,15)
Feb 10 07:20:29 admatriubu01 acsd[11585]: [ID 446325 daemon.error] ACS(2) going to DOWN state, status: Timeout waiting for robotic commandFeb 10 07:22:31 admatriubu01 acsd[11585]: [ID 964522 daemon.notice] ACS(2) going to UP state

Any help will be much appreciated

Anthony

Nicolai · ‎02-10-2010

Look like a HW failure or stuck tape. Use SLconsole to verify all drives are operational from the robot side.

UPDATE:
You can download it from StorageTek/SUN/Oracle is you have signed up for a account:

Sun StorageTek SLConsole FRS 4.10 Final Release

View solution in original post

marekkedzierski · ‎02-10-2010

Did you check logs on SL8500 ?

Marianne · ‎02-10-2010

The first thing you need to verify is your drive mapping is 100% correct - the ACSLS drive address has to correspond with the correct /dev/rmt device name.

When a dismount is required, NBU first issues an unload command to the O/S device name (mt -f /dev/rmt/... rewoffl). Once the tape is in the unload posision, the command is sent to the ACSLS server to put the tape back in it's slot. If the tape is not unloaded, ACSLS will display a 'dismout failure' message.

To test, do the following:
1. mount a tape via the ACSSS user interface.
2. verify that the O/S sees the correct tape status : mt -f /dev/rmt/.... stat
3. issue O/S unload command : mt -f /dev/rmt/.... rewoffl
4. dismount the tape via ACSSS interface

If you're not familiar with ACSSS interface:
Logon to acsls server as user acsss

Home directory should be /export/home/acsss.
$ cd log     (to access log files)

Important log files
acsss_stats.log and acsss_event.log

To open the command proc window:

$ cmd_proc -l
acsss >q dr all                                        # query drive all
acsss >m A00000 0,0,10,0                       # mount tape A00000 on drive 0,0,10,0
acsss >dism A00000 0,0,10,0                      # dismount A00000 from drive 0,0,10,0
acsss >q vol A00000                                  # query status of volume A00000
acsss > display drive *,*,*,* -f type serial_num      # get drive serial number
acsss >log                                             # log off
$

Handy NetBackup Links

Anth105 · ‎02-10-2010

Marianne,

Thank you for your advise.

Here is the log entry of the acsss_event.log whilst I tried to dismount a volume

2010-02-10 16:20:21 DISMOUNT[0]:
546 N cl_log_lh_er.c 1 99
dm_lh_lib_fail: LH error type = LH_ERR_TRANSPORT_FAILURE

2010-02-10 16:20:21 ACSSA[0]:
1431 N sa_demux.c 1 296
drive 2, 2, 1, 3: Library error, Transport failure
.

2010-02-10 16:20:37 DISMOUNT[0]:
971 N mt_action_dm.c 1 1272
dm_lh_drive_busy: LH error type = LH_ERR_TRANSPORT_BUSY 2, 1, 1,14

2010-02-10 16:23:02 command process[0]:
1283 N cp_sm_error.c 1 492
cp_sm_error, line: 491, Invalid state machine state: CPS_LOGOFF, status
STATUS_PROCESS_FAILURE

Nicolai · ‎02-10-2010

Look like a HW failure or stuck tape. Use SLconsole to verify all drives are operational from the robot side.

UPDATE:
You can download it from StorageTek/SUN/Oracle is you have signed up for a account:

Sun StorageTek SLConsole FRS 4.10 Final Release

David_McMullin · ‎02-10-2010

Had this problem when one of the LSM was offline.

from cmd_Proc
q lsm all
make sure state is active

I keep a window open to my acsls server, but I use cmd_proc -l (L) which allows you to scroll back up. cmd_proc is a pain with the split screen

Anth105 · ‎02-10-2010

The SLconsole is showing 2 drives with both Health Device State are in ERROR-    I lost account of the number of time the drives have been replaced when this is reported in SL console. This lead me to believe it can not be a HW issue. I suspect a configuration related issue which could be either in NBU or at the OS level.

Since the start of the year,

cmd_proc

          Copyright 2008 Sun Microsystems, Inc. All rights reserved.
                      Use is subject to license terms.

----------------------------------ACSLS 7.3.0----------------------------------
Identifier   State            Free Cell Audit Mount Dismount Enter Eject
                               Count      C/P    C/P    C/P       C/P    C/P
   2, 0       online           0          0/0    0/0    0/0       0/0    0/0
   2, 1       online           1867       0/0    0/0    1/0       0/0    0/0
   2, 2       online           1871       0/0    0/0    0/0       0/0    0/0
   2, 3       online           0          0/0    0/0    0/0       0/0    0/0
   3, 0       online           0          0/0    0/0    0/0       0/0    0/0
   3, 1       online           1424       0/0    0/0    0/0       0/0    0/0
   3, 2       online           2318       0/0    0/0    0/0       0/0    0/0
   3, 3       online           0          0/0    0/0    0/0       0/0    0/0
ACSSA>

One more information that may help;   there are 18 media servers sharing 6 tapes drives.   I am currenlt implemeting VTL to allievate this issue.

Marianne · ‎02-10-2010

Did the 'mt' commands complete successfully?
If so, you need to log a call with your SUN/STK vendor.

**Edit**
Check ALL hardware components - not just tape drive - that includes fibre cable, switch port, etc...

From acsss interface - check mounts/dismounts without O/S interference.

Handy NetBackup Links

Anth105 · ‎02-10-2010

the mt commands came back with no error

Next step to check ALL HW components and then engage STK.

David_McMullin · ‎02-10-2010

I have seen some flaky performance from some drives - we have had drives replaced over and over - are you getting new or "refurbished" drives? We have had issues where newly replaced drives had problems right from the start.

Also - check the drive firmware - are similar models at the same rev? There are known issues at some firmware levels.

Why are all your tapes in lsm 1 and 2, none in 0 and 3? Are your drives spread among your LSM?

Anth105 · ‎02-10-2010

David

Its nearly two week since I started working on this NBU system and I found no configuration documentation on how the whole environment.

As I have stated a bit earlier I carried out all usual troubleshooting steps. ( check the firmware on STK and drives). what i havent done yet is to update the firmware on the rest of HW components.

Anth105 · ‎02-11-2010

STK engineer was onsite for a health check and he advised me it could very be drive's firmware related issue.

Further investigation of the robot dump file will confirm it.

I will keep you posted how progress

Many thanks again for all your advices

Anthony

Anth105 · ‎02-11-2010

Removed the stuck tapes
Power cycled the drives

All the drives are online and the backups are now running smoothly.

cheers

VOX

ACSLS 7.3 - dismount failure down the drive