Forum Discussion

Tulika_Shrivast's avatar
13 years ago
Solved

Scan -changer not showing the Robot on Netbackup Media Server

Hi All,

 

We have a Linux Master Server and a Linux Media Server 

 

Server Details (Media ):- 

uname -a

Linux <Server name> 2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 x86_64 x86_64 x86_64 GNU/Linux

Following command does not show the Robot.

However  all the tapes are visible.

Backups and Restores are however running fine.

scan -changer

************************************************************
*********************** SDT_CHANGER ************************
************************************************************
 
Scsi command showing below Error :-
 
 scsi_command -d /dev/sg2
scsi inquiry command failed
status 2h, key bh, ASC 4eh, ASCQ 0h
sense 0x0b, asc 0x4e, ascq 0x00 occured
 
Can somebody please help me with this??
 
How can i make my robot visible in scan command.
  • Hi Tulika,

    I think I have answered all these questions when we took a closer look at the system.

    We found the issue was intermittant - therefore, when the robot became available, NBU would be able to load/ unload tapes etc ...

    You asked,

     

    "When the SCSI command is successfull the Server detects the Robot in the SCAN Command ,else it fails."

    This is simple - the commands :

    scan / tpautoconf -r / scsi_command  etc ... are all very very similar.  Sure, the output is different, but they all work by sending scsi commands to the devices.  So, when one breaks they all break, when one works, they all work.  We found that scsi_copmmand -d <device file to robot> failed intermittantly.  For a given time period, say an hour, we found it was working for 50+ minutes, but would work for just a few minutes.

    You keep looking at NetBackup, forget it - until scsi_commnd -d (+ the other commands) are working 100% of the time you cannot consider NetBackup will work.

    We found that even when scsi_command was not working for the robot, it did always work for a tape drive ( scsi_command -d <tape drive device file> )  - this is shows that issue is related to something between the OS and the robot.

     

     I found that that the meaning of this :

     

    scsi inquiry command failed

    status 2h, key bh, ASC 4eh, ASCQ 0h
    sense 0x0b, asc 0x4e, ascq 0x00 occured
     
    is "OVERLAPPED COMMANDS ATTEMPTED"
     
    Searching the NBU database I found multiple previous calls ...
     
    (1)
    SOLUTION: 
    Refer to hardware vendor to address SCSI errors. 
    TROUBLESHOOTING STEPS: 
    Sep 17 11:54:50  scsi: [ID 107833 kern.notice] ASC: 0x4e (overlapped commands attempted), ASCQ: 0x0, FRU: 0xf5 
    Sep 17 23:01:27  last message repeated 2 times 
    Sep 18 04:42:33  scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/fibre-channel@1/st@7,0 (st23): 
    Sep 18 04:42:33  SCSI transport failed: reason 'tran_err': giving up 
     
     
    (2)
    Sep  8 11:41:48  bptm[7620]: [ID 832037 daemon.error] scsi command failed, may be timeout, scsi_pkt.us_reason = 3 
    Sep  8 11:41:48  scsi: [ID 107833 kern.warning] WARNING: /pci@8,700000/fibre-channel@2/st@4,2 (st270): 
    Sep  8 11:41:48              Error for Command: rezero/rewind           Error Level: Fatal 
    Sep  8 11:41:48  scsi: [ID 107833 kern.notice]      Requested Block: 0                         Error Block: 0 
    Sep  8 11:41:48  scsi: [ID 107833 kern.notice]      Vendor: QUANTUM                            Serial Number:  W  $ i      
    Sep  8 11:41:48  scsi: [ID 107833 kern.notice]      Sense Key: Aborted Command 
    Sep  8 11:41:48  scsi: [ID 107833 kern.notice]      ASC: 0x4e (overlapped commands attempted), ASCQ: 0x0, FRU: 0x0 
     
    SOLUTION: Customer will have their SAN group take a look at the storage area hardware.
     
    (3)
     Command overlapp: The library detected another command from an initiator 
    > while one was already 
    > in process. This is more often caused by third party monitoring software 
    > polling the peripheral devices 
    SOLUTION:  We found that customer installed 'Sun Management Center Agent' on this server. Agent was disabled  and no errors occured during monitoring period. 
     
    (4)
    "Key = 0xb, asc = 0x4e, ascq = 0x0, OVERLAPPED COMMANDS ATTEMPTED"
    inquiry error: cannot determine robot type
    slot read_element_status error
    followed by move medium failed errors.
     
    Advised customer since this was working before and is not a new library, there is nothing in NBU which will cause this issue. Issue lies either with the library or the communication path between library and server.
     
    (5)
    From Technote  S:TECH183410
    Error
    /var/log/syslog
    OVERLAPPED COMMANDS ATTEMPTED. Disconnect during command processing. SCSI Error - ASC[0x4E], ASCQ[0x00]
    Environment
    Solaris Master
    Solaris Media server
    Storageteck SL Library
    Cause
    Vendor hardware related SCSI error with cartidge access port (CAP).
    Solution
    Library Hardware vendor required fix
     
     
    None of the previous times we ave seen this issue have had NBU as a cause.
     
    We know the NBU config is correct, because it does work sometimes, if the config was wrong it would never work.
     
    The issue is outside of NetBackup
     
    Martin
     
     

     

     

     

     

     

     

  • Anonymous's avatar
    Anonymous

    Can you scsi query that device?

    # sginfo -a /dev/sg2

    And have you lost your robot from NetBackup configuration or is this first time installation issues?

    Other useful information...

    What connection type is this robotic storage? Fibre Channel? Local SCSI attached? VTL?

    NBU Version and your Linux flavour/version.

  • The scsi command is sending a industry standard scsi command to the robot.  The robot should reply with the details showing 'what type' of library it is.

    However, for some reason the library is having problem and is sending out a kind or error message:

     

     

     scsi_command -d /dev/sg2
    scsi inquiry command failed
    status 2h, key bh, ASC 4eh, ASCQ 0h
    sense 0x0b, asc 0x4e, ascq 0x00 occured
     
    This error is actually sent from the library /robot - so although there is communication, it is not working properly.
    To understand this message, you should speak with the library vendor - it is their error message.
     
    It could be worth a reboot of the library, this may resolve the issue but the library vendor should be able to explain the meaning.
     
     
    Regards,
     
    Martin
     
  • @Stuart :-

     

    Stuart even that command is failing

     

      sginfo -a /dev/sg2
    Error doing INQUIRY (1)
    No serial number (error doing INQUIRY, supported VPDs)
     
    >>> Unable to read List supported pages mode page (0x3f) [mode_sense_10]
     
    2)This is not a first time installation issue.
     
    3)The changer was very much visible before, we noticed this while working on an issue of Cleaning  Media.
     
    3)What connection type is this robotic storage?
    Fibre Channel
     
    4)Netbackup 7.1
     
     
    5)Linux :-
     
    Media Server :- Linux  2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 x86_64 x86_64 x86_64 GNU/Linux
     
    Master Server :- 
     
    Linux  2.6.32.27-0.2-default #1 SMP 2010-12-29 15:03:02 +0100 x86_64 x86_64 x86_64 GNU/Linux
     
     
     
     

     

  • @Mark 

    This error is actually sent from the library /robot - so although there is communication, it is not working properly.

     

    Ans :- Has there been an issue of communication or intermittenent communication .

     

    Why are the backups and restores all working fine. Also why should it only affect cleaning

     

    2)We have already logged a case with the vendor regarding this.

  • 1/

    If there are already tapes in the drive, NBU will still be able to run jobs, it just won't be able to unload/ change tapes.

    2/

    Whatever the issue is with the library, it could just be affecting certain slots - for example the ones that contain the cleaning tapes 

    3/ 

    Are there any 'dedicated' cleaning slots defined in the library config, for NBU to do the cleaning, these should be removed.

    It is fairly irrelevant why ceartain jobs may / may not be running.

    These are the rules :

    For reliabale operation of NBU ...

    scan should show the changer

    robtest should work

    tpautoconf -r should work

    scsi_command -d /dev/sg2 should work

    In troubleshooting, these come first - any issues with these need to be resolved before you move on and look in other areas.

     

    Martin

     

     

     

  •  

    1/

    If there are already tapes in the drive, NBU will still be able to run jobs, it just won't be able to unload/ change tapes.

    --> No there are no tapes in DRives

    2/

    Whatever the issue is with the library, it could just be affecting certain slots - for example the ones that contain the cleaning tapes 

    --> we have recently found that it is an intermittent issue.

    3/ 

    Are there any 'dedicated' cleaning slots defined in the library config, for NBU to do the cleaning, these should be removed.

    No dedicated slots ,again the cleaning is also working  intermittently..

  • Here i Noticed that this is an intermittent issue :-

     

    Can somebody suggest why is this so.

    What all i need to check.

    How to isolate the fault.

    *****SUCCESSFULL********

     

     
    media Server:~ # scsi_command -d /dev/sg2
    Inquiry data: removable dev type 8h STK     SL3000          3.00
    ****FAILED *********************
     
    media Server:~ # scsi_command -d /dev/sg2
    scsi inquiry command failed
    status 2h, key bh, ASC 4eh, ASCQ 0h
    sense 0x0b, asc 0x4e, ascq 0x00 occured
     
    When the SCSI command is successfull the Server detects the Robot in the SCAN Command ,else it fails.
     
    *********SUCCESSFUL  SCAN**********
     
    media Server:~ # scan -changer
    ************************************************************
    *********************** SDT_CHANGER ************************
    ************************************************************
    ------------------------------------------------------------
    Device Name  : "/dev/sg2"
    Passthru Name: "/dev/sg2"
    Volume Header: ""
    Port: -1; Bus: -1; Target: -1; LUN: -1
    Inquiry    : "STK     SL3000          3.00"
    Vendor ID  : "STK     "
    Product ID : "SL3000          "
    Product Rev: "3.00"
    Serial Number: "571000200821"
    WWN          : ""
    WWN Id Type  : 0
    Device Identifier: ""
    Device Type    : SDT_CHANGER
    NetBackup Robot Type: 8
    Removable      : Yes
    Device Supports: SCSI-5
    Number of Drives : 15
    Number of Slots  : 700
    Number of Media Access Ports: 52
    Drive 1 Serial Number      : "HU1037CBC3"
    Drive 2 Serial Number      : "HU1037CBD8"
    Drive 3 Serial Number      : "HUE1042L5T"
    Drive 4 Serial Number      : "HU1151L66R"
    Drive 5 Serial Number      : "HU1113FRKJ"

     

  • You say that you have you logged a call with the hardware vendor. what was their response?

    As per Martin's post above - those 'sense key' error messages are coming from the robot, not NBU.

     

  • Hi Tulika,

    I think I have answered all these questions when we took a closer look at the system.

    We found the issue was intermittant - therefore, when the robot became available, NBU would be able to load/ unload tapes etc ...

    You asked,

     

    "When the SCSI command is successfull the Server detects the Robot in the SCAN Command ,else it fails."

    This is simple - the commands :

    scan / tpautoconf -r / scsi_command  etc ... are all very very similar.  Sure, the output is different, but they all work by sending scsi commands to the devices.  So, when one breaks they all break, when one works, they all work.  We found that scsi_copmmand -d <device file to robot> failed intermittantly.  For a given time period, say an hour, we found it was working for 50+ minutes, but would work for just a few minutes.

    You keep looking at NetBackup, forget it - until scsi_commnd -d (+ the other commands) are working 100% of the time you cannot consider NetBackup will work.

    We found that even when scsi_command was not working for the robot, it did always work for a tape drive ( scsi_command -d <tape drive device file> )  - this is shows that issue is related to something between the OS and the robot.

     

     I found that that the meaning of this :

     

    scsi inquiry command failed

    status 2h, key bh, ASC 4eh, ASCQ 0h
    sense 0x0b, asc 0x4e, ascq 0x00 occured
     
    is "OVERLAPPED COMMANDS ATTEMPTED"
     
    Searching the NBU database I found multiple previous calls ...
     
    (1)
    SOLUTION: 
    Refer to hardware vendor to address SCSI errors. 
    TROUBLESHOOTING STEPS: 
    Sep 17 11:54:50  scsi: [ID 107833 kern.notice] ASC: 0x4e (overlapped commands attempted), ASCQ: 0x0, FRU: 0xf5 
    Sep 17 23:01:27  last message repeated 2 times 
    Sep 18 04:42:33  scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/fibre-channel@1/st@7,0 (st23): 
    Sep 18 04:42:33  SCSI transport failed: reason 'tran_err': giving up 
     
     
    (2)
    Sep  8 11:41:48  bptm[7620]: [ID 832037 daemon.error] scsi command failed, may be timeout, scsi_pkt.us_reason = 3 
    Sep  8 11:41:48  scsi: [ID 107833 kern.warning] WARNING: /pci@8,700000/fibre-channel@2/st@4,2 (st270): 
    Sep  8 11:41:48              Error for Command: rezero/rewind           Error Level: Fatal 
    Sep  8 11:41:48  scsi: [ID 107833 kern.notice]      Requested Block: 0                         Error Block: 0 
    Sep  8 11:41:48  scsi: [ID 107833 kern.notice]      Vendor: QUANTUM                            Serial Number:  W  $ i      
    Sep  8 11:41:48  scsi: [ID 107833 kern.notice]      Sense Key: Aborted Command 
    Sep  8 11:41:48  scsi: [ID 107833 kern.notice]      ASC: 0x4e (overlapped commands attempted), ASCQ: 0x0, FRU: 0x0 
     
    SOLUTION: Customer will have their SAN group take a look at the storage area hardware.
     
    (3)
     Command overlapp: The library detected another command from an initiator 
    > while one was already 
    > in process. This is more often caused by third party monitoring software 
    > polling the peripheral devices 
    SOLUTION:  We found that customer installed 'Sun Management Center Agent' on this server. Agent was disabled  and no errors occured during monitoring period. 
     
    (4)
    "Key = 0xb, asc = 0x4e, ascq = 0x0, OVERLAPPED COMMANDS ATTEMPTED"
    inquiry error: cannot determine robot type
    slot read_element_status error
    followed by move medium failed errors.
     
    Advised customer since this was working before and is not a new library, there is nothing in NBU which will cause this issue. Issue lies either with the library or the communication path between library and server.
     
    (5)
    From Technote  S:TECH183410
    Error
    /var/log/syslog
    OVERLAPPED COMMANDS ATTEMPTED. Disconnect during command processing. SCSI Error - ASC[0x4E], ASCQ[0x00]
    Environment
    Solaris Master
    Solaris Media server
    Storageteck SL Library
    Cause
    Vendor hardware related SCSI error with cartidge access port (CAP).
    Solution
    Library Hardware vendor required fix
     
     
    None of the previous times we ave seen this issue have had NBU as a cause.
     
    We know the NBU config is correct, because it does work sometimes, if the config was wrong it would never work.
     
    The issue is outside of NetBackup
     
    Martin