Forum Discussion

liuyl's avatar
liuyl
Level 6
7 years ago

Some fixed TD unloaded media failed after its backup jobs ended successfully!

Just for sometimes, the same TD would unload media failed after its backup jobs ended successfully!
In fact, the tape media had been unloaded because I found the corresponding drive was empty via robtest!

Job detail:

10/18/2018 10:27:24 - Info nbjm(pid=16841) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=162459\
request id:{5923C0F6-D27D-11E8-8134-20900A448F0B})
10/18/2018 10:27:24 - requesting resource rzcmsdb-hcart3-robot-tld-2
10/18/2018 10:27:24 - requesting resource jchxbak.NBU_CLIENT.MAXJOBS.rzcmsdb
10/18/2018 10:27:24 - requesting resource jchxbak.NBU_POLICY.MAXJOBS.cmsdbrst
10/18/2018 10:27:24 RB - waiting for Stop Drive Scan
10/18/2018 10:27:25 - granted resource jchxbak.NBU_CLIENT.MAXJOBS.rzcmsdb
10/18/2018 10:27:25 - granted resource jchxbak.NBU_POLICY.MAXJOBS.cmsdbrst
10/18/2018 10:27:25 - granted resource 700076
10/18/2018 10:27:25 - granted resource HP.ULTRIUM6-SCSI.004
10/18/2018 10:27:25 - granted resource rzcmsdb-hcart3-robot-tld-2
10/18/2018 10:27:25 - estimated 0 kbytes needed
10/18/2018 10:27:25 - Info nbjm(pid=16841) started backup (backupid=rzcmsdb_1539829645) job for client rzcmsdb\
policy cmsdbrst\
schedule full on storage unit rzcmsdb-hcart3-robot-tld-2
10/18/2018 10:27:26 - started process bpbrm (23763)
10/18/2018 10:27:26 - connecting
10/18/2018 10:27:27 - connected; connect time: 000:00:01
10/18/2018 10:27:29 - mounting 700076
10/18/2018 10:28:25 - mounted; mount time: 000:00:56
10/18/2018 10:28:25 - positioning 700076 to file 1
10/18/2018 10:28:27 - positioned; position time: 000:00:02
10/18/2018 10:28:27 - begin writing
10/18/2018 10:28:34 - end writing; write time: 000:00:07


robot.log:

10:27:30.147 [14863] <4> AddTldLtiReqEntry: Processing ROBOT_MOUNT request...
10:27:30.147 [14863] <2> AddTldLtiReqEntry: mount request specified Drive Name HP.ULTRIUM6-SCSI.004
10:27:30.147 [14863] <4> TldGetDriveForMountReq: Got drive id 9 for drive HP.ULTRIUM6-SCSI.004
10:27:30.147 [14863] <2> ProcessTable: found active request id 0. Child pid = 0 flags = 0 request type 1
10:27:30.147 [14863] <2> ProcessTable: ROBOT_MOUNT requested ... set this drive IN USE
10:27:30.151 [14863] <5> MountTape: TLD(2) MountTape 700076 on drive 6, from slot 216
10:27:30.151 [14863] <2> SpawnChild: TLD(2) Creating Child Process
10:27:30.189 [23775] <2> parse_tape_path: ltid request: /dev/rtape/tape122_BESTnb . Using shared memory: /dev/rtape/tape122_BESTnb
10:27:30.189 [23775] <2> parse_tape_path: Using default path
10:27:30.189 [23775] <4> io_open: Drive Path = /dev/rtape/tape122_BESTnb
10:27:30.199 [23775] <5> io_open: TLD(2) open failed in io_open, No such device or address
10:27:30.199 [23775] <5> io_open: TLD(2) unload==TRUE, but no unload, drive 6 (device 9)
10:27:30.201 [23775] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.5661: 0: fopen() failed: 2 0x00000002
10:27:30.201 [23775] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.5662: 0: fopen() failed: /usr/openv/var/host_cache/1a7/1ae1d1a7+tldcd,1,40,2,1,0+jchxbak.txt
10:27:30.231 [23775] <2> vnet_pbxConnect: pbxConnectEx Succeeded
10:27:30.231 [23775] <2> robotd_connect: Not using VxSS authentication.
10:28:24.321 [23775] <5> atl_main: TLD(2) [23775] waited 41 seconds for ready, drive 6
10:28:24.321 [23775] <4> TldChildExit: Child process terminated normally
10:28:24.343 [14863] <5> GetResponseStatus: DecodeMount: TLD(2) drive 6, Actual status: STATUS_SUCCESS
10:28:45.910 [14863] <4> AddTldLtiReqEntry: Processing ROBOT_DISMOUNT request...
10:28:45.910 [14863] <2> ProcessTable: found active request id -1. Child pid = 0 flags = 0 request type 3
10:28:45.910 [14863] <5> DismountTape: TLD(2) DismountTape 700076 from drive 6
10:28:45.910 [14863] <2> SpawnChild: TLD(2) Creating Child Process
10:28:45.914 [23801] <2> parse_tape_path: ltid request: /dev/rtape/tape122_BESTnb . Using shared memory: /dev/rtape/tape122_BESTnb
10:28:45.914 [23801] <2> parse_tape_path: Using default path
10:28:45.914 [23801] <4> io_open: Drive Path = /dev/rtape/tape122_BESTnb
10:29:08.062 [23801] <2> vnet_pbxConnect: pbxConnectEx Succeeded
10:29:08.063 [23801] <2> robotd_connect: Not using VxSS authentication.
10:29:18.446 [23801] <16> TldChildExit: Child process terminated abnormally with error 10
10:29:18.459 [14863] <5> GetResponseStatus: DecodeDismount: TLD(2) drive 6, Actual status: Unable to SCSI unload drive

 

bptm.log:

10:28:44.909 [23800] <2> nbjm_media_request: Job control returned to BPTM
10:28:44.910 [23800] <2> drivename_open: Called with Create 1, file HP.ULTRIUM6-SCSI.004
10:28:44.910 [23800] <2> drivename_lock: lock established
10:28:44.910 [23800] <4> create_tpreq_file: symlink to path /dev/rtape/tape122_BESTnb
10:28:44.910 [23800] <2> drivename_write: Called with mode 2
10:28:44.916 [23800] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from tapealert_and_release)
10:28:44.916 [23800] <2> really_tpunmount: tpunmount'ing /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM6-SCSI.004
10:28:45.910 [23800] <2> tapelib: wait_for_ltid, UnMount, timeout 2400
10:29:19.570 [23800] <2> send_MDS_msg: OP_STATUS 0 394188 rzcmsdb 10 2 0 0 0 0 0 0 *NULL* 0
10:29:19.581 [23800] <2> send_operation_error: Decoded status = 10 from 2
10:29:19.582 [23800] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jchxbak, query type 1
10:29:19.584 [23800] <2> vnet_pbxConnect: pbxConnectEx Succeeded
10:29:19.584 [23800] <2> logconnections: BPDBM CONNECT FROM 10.131.22.4.50705 TO 10.131.33.62.1556 fd = 17
10:29:19.590 [23800] <2> db_end: Need to collect reply
10:29:19.591 [23800] <16> really_tpunmount: error unloading media, TpErrno = Robot operation failed
10:29:19.591 [23800] <4> create_tpreq_file: symlink to path /dev/rtape/tape122_BESTnb
10:29:19.596 [23800] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from tapealert_and_release)
10:29:19.602 [23800] <2> tapealert_and_release: report_attr, fl1 0x00000001, fl2 0x00000000
10:29:19.607 [23800] <2> tapealert_and_release: SCSI RELEASE
10:29:19.607 [23800] <2> drivename_unlock: unlocked
10:29:19.607 [23800] <2> drivename_close: Called for file HP.ULTRIUM6-SCSI.004
10:29:19.608 [23800] <2> main: Sending [EXIT STATUS 174] to NBJM
10:29:19.608 [23800] <2> bptm: EXITING with status 174 <----------

  • It seems the connection to the robot isn't stable:

    robot log:TLD(2) open failed in io_open, No such device or address
    bptm log:  really_tpunmount: error unloading media, TpErrno = Robot operation failed

    Depending on the operating system used - I would check messages/syslog for clues as to why the OS fails to open the device driver. Check for SCSI or FC errors.

    • liuyl's avatar
      liuyl
      Level 6

      The first error with "open failed in io_open ....",  I think,  came up when the tldd process tested with unload action to determine whether the requested drive was empty before it could mount a tape media!

      The second error with "error unloading media ...." is just what I concern about, and also what means about the extra message with "tapealert_and_release: report_attr, fl1 0x00000001, fl2 0x00000000"!

      • mph999's avatar
        mph999
        Level 6

        What does robtest do ....

        Go into robtest, use s d to see which tapes are n the drives, and which slot they have come from.

         

        Try and unload the drive:

        m d2 s4 (unload drive 2 to slot 4)

        If that fails, try

        unload d2

        m d2 s4

         

  • liuyl

    There are basic requirements in NetBackup Device Configuration Guide  for HP-UX, such as : 
    Disabling SPC-2 SCSI reserve in HP-UX
    and 
    disabling the HP-UX EMS Tape Device Monitor for a SAN

    Have you checked and confirmed that this was done on all HP-UX media servers?

    If not, this could explain the issues that you are experiencing.

    • Nicolai's avatar
      Nicolai
      Moderator

      Hi Marianne

      It not needed to disable HP-UX EMS Tape Device Monitor anymore. HP fixed that issue long time ago. Further more, disabling the EMS system and you looses all SCSI sense messages, including write and read error messages.

      • liuyl's avatar
        liuyl
        Level 6

        Nicolai
        Do you think that the NBU might not report any more warnings rather than the media read/write/position errors while its TOC is defective?