cancel
Showing results for 
Search instead for 
Did you mean: 

ACSLS 7.3 - dismount failure down the drive

Anth105
Level 4
Certified

Hi all,

NBU 6.5.5
ACSLS  7.3
OS : Solaris

I have an intermittent drives going Down issue when a dismount request is submit.   I have gone thru all usual steps of troubleshooting with no joy.

Upgraded the firmware on both the SL 8500 and the T10k drives
Updated  the EMM mappings version
I even deleted and recreated the drives ( rm -f /dev/rmt* - devfsadm -  reconfigure the sg driver - sgscan ..)
Had the faulty replaced by the vendor
Set the Media unmount delay  valuie to 60


Here is an excerpt of the messages: 

Feb 10 06:01:55 admatriubu01 acsd[5746]: [ID 168411 daemon.error] ACS(2) dismount failure for volume TA0098 on drive (2,1,1,15), ACS status = 56, STATUS_LIBRARY_FAILURE
Feb 10 06:02:30 admatriubu01 acsd[5878]: [ID 168411 daemon.error] ACS(2) dismount failure for volume TA0098 on drive (2,1,1,15), ACS status = 56, STATUS_LIBRARY_FAILURE

Feb 10 06:20:50 admatriubu01 acsd[9563]: [ID 168411 daemon.error] ACS(2) dismount failure for volume TA0098 on drive (2,1,1,15), ACS status = 56, STATUS_LIBRARY_FAILURE
Feb 10 06:50:09 admatriubu01 acsd[14669]: [ID 498531 daemon.error] user scsi ioctl() failed, may be timeout, errno = 5, I/O error
Feb 10 06:54:56 admatriubu01 acsd[14669]: [ID 905004 daemon.error] ACS(2) dismount failure for volume TA0255 on drive (2,2,1,15), ACS status = 29, STATUS_DRIVE_IN_USE
Feb 10 06:54:56 admatriubu01 acsd[14669]: [ID 756643 daemon.error] ACS(2) waiting to resubmit dismount request (attempt 2) for volume TA0255 on drive (2,2,1,15)
Feb 10 07:02:13 admatriubu01 acsd[14669]: [ID 905004 daemon.error] ACS(2) dismount failure for volume TA0255 on drive (2,2,1,15), ACS status = 29, STATUS_DRIVE_IN_USE
Feb 10 07:02:13 admatriubu01 acsd[14669]: [ID 338898 daemon.error] ACS(2) waiting to resubmit dismount request (attempt 3) for volume TA0255 on drive (2,2,1,15)
Feb 10 07:09:30 admatriubu01 acsd[14669]: [ID 905004 daemon.error] ACS(2) dismount failure for volume TA0255 on drive (2,2,1,15), ACS status = 29, STATUS_DRIVE_IN_USE
Feb 10 07:09:30 admatriubu01 acsd[14669]: [ID 821134 daemon.error] ACS(2) waiting to resubmit dismount request (attempt 4) for volume TA0255 on drive (2,2,1,15)
Feb 10 07:16:47 admatriubu01 acsd[14669]: [ID 905004 daemon.error] ACS(2) dismount failure for volume TA0255 on drive (2,2,1,15), ACS status = 29, STATUS_DRIVE_IN_USE
Feb 10 07:16:47 admatriubu01 acsd[14669]: [ID 403389 daemon.error] ACS(2) waiting to resubmit dismount request (attempt 5) for volume TA0255 on drive (2,2,1,15)
Feb 10 07:20:29 admatriubu01 acsd[11585]: [ID 446325 daemon.error] ACS(2) going to DOWN state, status: Timeout waiting for robotic commandFeb 10 07:22:31 admatriubu01 acsd[11585]: [ID 964522 daemon.notice] ACS(2) going to UP state

Any help will be much appreciated

Anthony


 

1 ACCEPTED SOLUTION

Accepted Solutions

Nicolai
Moderator
Moderator
Partner    VIP   

Look like a HW failure or stuck tape. Use SLconsole to verify all drives are operational from the robot side.


UPDATE:

You can download it from StorageTek/SUN/Oracle is you have signed up for a account:

Sun StorageTek SLConsole FRS 4.10 Final Release

View solution in original post

12 REPLIES 12

marekkedzierski
Level 6
Partner
Did you check logs on SL8500 ?

Marianne
Level 6
Partner    VIP    Accredited Certified
The first thing you need to verify is your drive mapping is 100% correct - the ACSLS drive address has to correspond with the correct /dev/rmt device name.

When a dismount is required, NBU first issues an unload command to the O/S device name (mt -f /dev/rmt/... rewoffl). Once the tape is in the unload posision, the command is sent to the ACSLS server to put the tape back in it's slot. If the tape is not unloaded, ACSLS will display a 'dismout failure' message.

To test, do the following:
1. mount a tape via the ACSSS user interface.
2. verify that the O/S sees the correct tape status : mt -f /dev/rmt/.... stat
3. issue O/S unload command : mt -f /dev/rmt/.... rewoffl
4. dismount the tape via ACSSS interface

If you're not familiar with ACSSS interface:
Logon to acsls server as user acsss
 
Home directory should be /export/home/acsss.
$ cd log     (to access log files)
 
Important log files
acsss_stats.log and acsss_event.log
 
To open the command proc window:
 
$ cmd_proc -l
acsss >q dr all                                                # query drive all
acsss >m A00000 0,0,10,0                             # mount tape A00000 on drive 0,0,10,0
acsss >dism A00000 0,0,10,0                        #  dismount A00000 from drive 0,0,10,0
acsss >q vol A00000                                      # query status of volume A00000
acsss > display drive *,*,*,* -f type serial_num      # get drive serial number
acsss >log                                                       # log off
$

Anth105
Level 4
Certified

Marianne,


Thank you for your advise.

Here is the log entry of the acsss_event.log   whilst I tried to dismount a volume


2010-02-10 16:20:21 DISMOUNT[0]:
546 N cl_log_lh_er.c 1  99
dm_lh_lib_fail: LH error type = LH_ERR_TRANSPORT_FAILURE

2010-02-10 16:20:21 ACSSA[0]:
1431 N sa_demux.c 1  296
drive   2, 2, 1, 3: Library error, Transport failure
.

2010-02-10 16:20:37 DISMOUNT[0]:
971 N mt_action_dm.c 1  1272
dm_lh_drive_busy: LH error type = LH_ERR_TRANSPORT_BUSY   2, 1, 1,14

2010-02-10 16:23:02 command process[0]:
1283 N cp_sm_error.c 1  492
cp_sm_error, line: 491, Invalid state machine state: CPS_LOGOFF, status
STATUS_PROCESS_FAILURE

2010-02-10 16:23:02 command process[0]:
1283 N cp_sm_error.c 1  492
cp_sm_error, line: 491, Invalid state machine state: CPS_LOGOFF, status
STATUS_PROCESS_FAILURE





 

Nicolai
Moderator
Moderator
Partner    VIP   

Look like a HW failure or stuck tape. Use SLconsole to verify all drives are operational from the robot side.


UPDATE:

You can download it from StorageTek/SUN/Oracle is you have signed up for a account:

Sun StorageTek SLConsole FRS 4.10 Final Release

David_McMullin
Level 6
Had this problem when one of the LSM was offline.

from cmd_Proc
q lsm all
make sure state is active

I keep a window open to my acsls server, but I use cmd_proc -l (L) which allows you to scroll back up. cmd_proc is a pain with the split screen


Anth105
Level 4
Certified

The SLconsole is showing  2 drives  with both Health  Device State are in ERROR-    I lost account of the number of time the drives have been replaced when this is reported in SL console.  This lead me to believe it can not be a HW issue.  I suspect a configuration related issue which could be   either in NBU or at the OS level.  


Since the start of the year,

cmd_proc

          Copyright 2008 Sun Microsystems, Inc. All rights reserved.
                      Use is subject to license terms.

----------------------------------ACSLS 7.3.0----------------------------------
 Identifier   State            Free Cell  Audit  Mount  Dismount  Enter  Eject
                               Count      C/P    C/P    C/P       C/P    C/P
   2, 0       online           0          0/0    0/0    0/0       0/0    0/0
   2, 1       online           1867       0/0    0/0    1/0       0/0    0/0
   2, 2       online           1871       0/0    0/0    0/0       0/0    0/0
   2, 3       online           0          0/0    0/0    0/0       0/0    0/0
   3, 0       online           0          0/0    0/0    0/0       0/0    0/0
   3, 1       online           1424       0/0    0/0    0/0       0/0    0/0
   3, 2       online           2318       0/0    0/0    0/0       0/0    0/0
   3, 3       online           0          0/0    0/0    0/0       0/0    0/0
ACSSA>

One more information that may help;   there are 18 media servers sharing 6 tapes drives.   I am currenlt implemeting VTL to allievate this issue. 
   
 

Marianne
Level 6
Partner    VIP    Accredited Certified
Did the 'mt' commands complete successfully?
If so, you need to log a call with your SUN/STK vendor.

**Edit**
Check ALL hardware components - not just tape drive - that includes fibre cable, switch port, etc...

From acsss interface - check mounts/dismounts without O/S interference.

Anth105
Level 4
Certified
 the mt  commands came back with no error

Next step to check ALL HW components and then engage  STK.


David_McMullin
Level 6
I have seen some flaky performance from some drives - we have had drives replaced over and over - are you getting new or "refurbished" drives? We have had issues where newly replaced drives had problems right from the start.

Also - check the drive firmware - are similar models at the same rev? There are known issues at some firmware levels.

Why are all your tapes in lsm 1 and 2, none in 0 and 3? Are your drives spread among your LSM?

Anth105
Level 4
Certified
David

Its nearly two week since I started working on this  NBU system and I found no configuration documentation on how the whole environment. 

As I have stated a bit earlier I carried out all usual troubleshooting steps.  (  check the firmware on STK and drives).  what i havent  done yet is to update the firmware on  the rest of HW components. 

Anth105
Level 4
Certified

STK engineer was onsite for a health check and  he advised me it could very be  drive's firmware related issue. 

Further investigation of the robot dump file will confirm it. 

I will keep you posted how progress

Many thanks again for all your advices

Anthony
 

Anth105
Level 4
Certified

Removed the stuck tapes
Power  cycled  the drives

All the drives are online and the backups are now running smoothly.

cheers