Re: Some fixed TD unloaded media failed after its ...

liuyl · ‎01-05-2019

Just for sometimes, the same TD would unload media failed after its backup jobs ended successfully！
In fact, the tape media had been unloaded because I found the corresponding drive was empty via robtest！

Job detail:

10/18/2018 10:27:24 - Info nbjm(pid=16841) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=162459\
request id:{5923C0F6-D27D-11E8-8134-20900A448F0B})
10/18/2018 10:27:24 - requesting resource rzcmsdb-hcart3-robot-tld-2
10/18/2018 10:27:24 - requesting resource jchxbak.NBU_CLIENT.MAXJOBS.rzcmsdb
10/18/2018 10:27:24 - requesting resource jchxbak.NBU_POLICY.MAXJOBS.cmsdbrst
10/18/2018 10:27:24 RB - waiting for Stop Drive Scan
10/18/2018 10:27:25 - granted resource jchxbak.NBU_CLIENT.MAXJOBS.rzcmsdb
10/18/2018 10:27:25 - granted resource jchxbak.NBU_POLICY.MAXJOBS.cmsdbrst
10/18/2018 10:27:25 - granted resource 700076
10/18/2018 10:27:25 - granted resource HP.ULTRIUM6-SCSI.004
10/18/2018 10:27:25 - granted resource rzcmsdb-hcart3-robot-tld-2
10/18/2018 10:27:25 - estimated 0 kbytes needed
10/18/2018 10:27:25 - Info nbjm(pid=16841) started backup (backupid=rzcmsdb_1539829645) job for client rzcmsdb\
policy cmsdbrst\
schedule full on storage unit rzcmsdb-hcart3-robot-tld-2
10/18/2018 10:27:26 - started process bpbrm (23763)
10/18/2018 10:27:26 - connecting
10/18/2018 10:27:27 - connected; connect time: 000:00:01
10/18/2018 10:27:29 - mounting 700076
10/18/2018 10:28:25 - mounted; mount time: 000:00:56
10/18/2018 10:28:25 - positioning 700076 to file 1
10/18/2018 10:28:27 - positioned; position time: 000:00:02
10/18/2018 10:28:27 - begin writing
10/18/2018 10:28:34 - end writing; write time: 000:00:07

robot.log:

10:27:30.147 [14863] <4> AddTldLtiReqEntry: Processing ROBOT_MOUNT request...
10:27:30.147 [14863] <2> AddTldLtiReqEntry: mount request specified Drive Name HP.ULTRIUM6-SCSI.004
10:27:30.147 [14863] <4> TldGetDriveForMountReq: Got drive id 9 for drive HP.ULTRIUM6-SCSI.004
10:27:30.147 [14863] <2> ProcessTable: found active request id 0. Child pid = 0 flags = 0 request type 1
10:27:30.147 [14863] <2> ProcessTable: ROBOT_MOUNT requested ... set this drive IN USE
10:27:30.151 [14863] <5> MountTape: TLD(2) MountTape 700076 on drive 6, from slot 216
10:27:30.151 [14863] <2> SpawnChild: TLD(2) Creating Child Process
10:27:30.189 [23775] <2> parse_tape_path: ltid request: /dev/rtape/tape122_BESTnb . Using shared memory: /dev/rtape/tape122_BESTnb
10:27:30.189 [23775] <2> parse_tape_path: Using default path
10:27:30.189 [23775] <4> io_open: Drive Path = /dev/rtape/tape122_BESTnb
10:27:30.199 [23775] <5> io_open: TLD(2) open failed in io_open, No such device or address
10:27:30.199 [23775] <5> io_open: TLD(2) unload==TRUE, but no unload, drive 6 (device 9)
10:27:30.201 [23775] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.5661: 0: fopen() failed: 2 0x00000002
10:27:30.201 [23775] <2> file_to_addrinfo: ../../libvlibs/vnet_addrinfo.c.5662: 0: fopen() failed: /usr/openv/var/host_cache/1a7/1ae1d1a7+tldcd,1,40,2,1,0+jchxbak.txt
10:27:30.231 [23775] <2> vnet_pbxConnect: pbxConnectEx Succeeded
10:27:30.231 [23775] <2> robotd_connect: Not using VxSS authentication.
10:28:24.321 [23775] <5> atl_main: TLD(2) [23775] waited 41 seconds for ready, drive 6
10:28:24.321 [23775] <4> TldChildExit: Child process terminated normally
10:28:24.343 [14863] <5> GetResponseStatus: DecodeMount: TLD(2) drive 6, Actual status: STATUS_SUCCESS
10:28:45.910 [14863] <4> AddTldLtiReqEntry: Processing ROBOT_DISMOUNT request...
10:28:45.910 [14863] <2> ProcessTable: found active request id -1. Child pid = 0 flags = 0 request type 3
10:28:45.910 [14863] <5> DismountTape: TLD(2) DismountTape 700076 from drive 6
10:28:45.910 [14863] <2> SpawnChild: TLD(2) Creating Child Process
10:28:45.914 [23801] <2> parse_tape_path: ltid request: /dev/rtape/tape122_BESTnb . Using shared memory: /dev/rtape/tape122_BESTnb
10:28:45.914 [23801] <2> parse_tape_path: Using default path
10:28:45.914 [23801] <4> io_open: Drive Path = /dev/rtape/tape122_BESTnb
10:29:08.062 [23801] <2> vnet_pbxConnect: pbxConnectEx Succeeded
10:29:08.063 [23801] <2> robotd_connect: Not using VxSS authentication.
10:29:18.446 [23801] <16> TldChildExit: Child process terminated abnormally with error 10
10:29:18.459 [14863] <5> GetResponseStatus: DecodeDismount: TLD(2) drive 6, Actual status: Unable to SCSI unload drive

bptm.log:

10:28:44.909 [23800] <2> nbjm_media_request: Job control returned to BPTM
10:28:44.910 [23800] <2> drivename_open: Called with Create 1, file HP.ULTRIUM6-SCSI.004
10:28:44.910 [23800] <2> drivename_lock: lock established
10:28:44.910 [23800] <4> create_tpreq_file: symlink to path /dev/rtape/tape122_BESTnb
10:28:44.910 [23800] <2> drivename_write: Called with mode 2
10:28:44.916 [23800] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from tapealert_and_release)
10:28:44.916 [23800] <2> really_tpunmount: tpunmount'ing /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM6-SCSI.004
10:28:45.910 [23800] <2> tapelib: wait_for_ltid, UnMount, timeout 2400
10:29:19.570 [23800] <2> send_MDS_msg: OP_STATUS 0 394188 rzcmsdb 10 2 0 0 0 0 0 0 *NULL* 0
10:29:19.581 [23800] <2> send_operation_error: Decoded status = 10 from 2
10:29:19.582 [23800] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jchxbak, query type 1
10:29:19.584 [23800] <2> vnet_pbxConnect: pbxConnectEx Succeeded
10:29:19.584 [23800] <2> logconnections: BPDBM CONNECT FROM 10.131.22.4.50705 TO 10.131.33.62.1556 fd = 17
10:29:19.590 [23800] <2> db_end: Need to collect reply
10:29:19.591 [23800] <16> really_tpunmount: error unloading media, TpErrno = Robot operation failed
10:29:19.591 [23800] <4> create_tpreq_file: symlink to path /dev/rtape/tape122_BESTnb
10:29:19.596 [23800] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from tapealert_and_release)
10:29:19.602 [23800] <2> tapealert_and_release: report_attr, fl1 0x00000001, fl2 0x00000000
10:29:19.607 [23800] <2> tapealert_and_release: SCSI RELEASE
10:29:19.607 [23800] <2> drivename_unlock: unlocked
10:29:19.607 [23800] <2> drivename_close: Called for file HP.ULTRIUM6-SCSI.004
10:29:19.608 [23800] <2> main: Sending [EXIT STATUS 174] to NBJM
10:29:19.608 [23800] <2> bptm: EXITING with status 174 <----------

Nicolai · ‎01-07-2019

It seems the connection to the robot isn't stable:

robot log:TLD(2) open failed in io_open, No such device or address
bptm log: really_tpunmount: error unloading media, TpErrno = Robot operation failed

Depending on the operating system used - I would check messages/syslog for clues as to why the OS fails to open the device driver. Check for SCSI or FC errors.

liuyl · ‎01-07-2019

The first error with "open failed in io_open ....", I think, came up when the tldd process tested with unload action to determine whether the requested drive was empty before it could mount a tape media！

The second error with "error unloading media ...." is just what I concern about, and also what means about the extra message with "tapealert_and_release: report_attr, fl1 0x00000001, fl2 0x00000000"！

mph999 · ‎01-07-2019

What does robtest do ....

Go into robtest, use s d to see which tapes are n the drives, and which slot they have come from.

Try and unload the drive:

m d2 s4 (unload drive 2 to slot 4)

If that fails, try

unload d2

m d2 s4

liuyl · ‎01-08-2019

I have reviewed the corresponding logs, and find that the tape media should be unloaded succesfully with the second bptm action from 10:42:49.112 to 10:43:10.906 ！

1) syslog.log
Oct 18 10:27:30 rzcmsdb ltid[14857]: LTID - Sent ROBOTIC request, Type=1, Param2=0
Oct 18 10:27:30 rzcmsdb tldd[14863]: TLD(2) MountTape 700076 on drive 6, from slot 216
Oct 18 10:27:30 rzcmsdb tldd[23775]: TLD(2) open failed in io_open, No such device or address
Oct 18 10:27:30 rzcmsdb tldd[23775]: TLD(2) unload==TRUE, but no unload, drive 6 (device 9)
Oct 18 10:28:13 rzcmsdb tldd[23775]: Reading label on HP.ULTRIUM6-SCSI.004 (device 9, /dev/rtape/tape122_BESTnb)
Oct 18 10:28:24 rzcmsdb tldd[23775]: TLD(2) [23775] waited 41 seconds for ready, drive 6
Oct 18 10:28:24 rzcmsdb tldd[14863]: DecodeMount: TLD(2) drive 6, Actual status: STATUS_SUCCESS
Oct 18 10:28:24 rzcmsdb ltid[14857]: LTID - received ROBOT MESSAGE, Type=54, LongParam=0, Param1=9, Param2=0
Oct 18 10:28:45 rzcmsdb ltid[14857]: LTID - Sent ROBOTIC request, Type=3, Param2=9
Oct 18 10:28:45 rzcmsdb tldd[14863]: TLD(2) DismountTape 700076 from drive 6
Oct 18 10:29:18 rzcmsdb tldd[14863]: DecodeDismount: TLD(2) drive 6, Actual status: Unable to SCSI unload drive
Oct 18 10:29:23 rzcmsdb ltid[14857]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.004 (device 9)
Oct 18 10:36:20 rzcmsdb ltid[14857]: Dumping tables on request to /usr/openv/volmgr/misc/ltitables
Oct 18 10:36:20 rzcmsdb ltid[14857]: LTID - Sent ROBOTIC request, Type=27, Param2=8
Oct 18 10:42:49 rzcmsdb ltid[14857]: LTID - Sent ROBOTIC request, Type=3, Param2=9
Oct 18 10:42:49 rzcmsdb tldd[14863]: TLD(2) DismountTape ****** from drive 6
Oct 18 10:42:49 rzcmsdb tldd[24198]: TLD(2) open failed in io_open, No such device or address
Oct 18 10:43:04 rzcmsdb tldd[24198]: TLD(2) unload==TRUE, but no unload, drive 6 (device 9)
Oct 18 10:43:10 rzcmsdb tldd[14863]: DecodeDismount: TLD(2) drive 6, Actual status: STATUS_SUCCESS
Oct 18 10:43:04 rzcmsdb tldd[24198]: TLD(2) open failed in io_open, No such device or address
Oct 19 10:22:50 rzcmsdb ltid[14857]: Operator/EMM server has UP'ed drive HP.ULTRIUM6-SCSI.004 (device 9)

2) bptm.log:
10:42:48.332 [24194] <2> report_drives: FILE = /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM6-SCSI.007
10:42:48.332 [24194] <2> main: Sending [EXIT STATUS 0] to NBJM
10:42:48.332 [24194] <2> bptm: EXITING with status 0 <----------
10:42:49.112 [24197] <2> bptm: INITIATING (VERBOSE = 0): -unload -dn HP.ULTRIUM6-SCSI.004 -dp /dev/rtape/tape122_BESTnb -dk 2000708 -mds 1032 -alocid 394191 -jobid -1538020897 -jm
10:42:49.117 [24197] <4> bptm: emmserver_name = jchxbak
10:42:49.117 [24197] <4> bptm: emmserver_port = 1556
10:42:49.145 [24197] <2> Orb::init: initializing ORB EMMlib_Orb with: dbstunitq -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirective "static PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBSvcConfDirective "static PBXIOP_Evaluator_Factory '-orb EMMlib_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBSvcConf /dev/null -ORBSvcConfDirective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(Orb.cpp:758)
10:42:49.167 [24197] <2> send_brm_msg: PID of bpxm = 24197
10:42:49.174 [24197] <2> nbjm_media_request: Passing job control to NBJM, type UNLOAD/6
10:42:49.174 [24197] <2> nbjm_media_request: old_media_id = NULL, media_id =
10:42:49.199 [24197] <2> Orb::init: Created anon service name: NB_24197_-1956322928(Orb.cpp:630)
10:42:49.199 [24197] <2> Orb::init: endpointvalue is : pbxiop://1556:NB_24197_-1956322928(Orb.cpp:648)
10:42:49.199 [24197] <2> Orb::init: initializing ORB Default_DAEMON_Orb with: Unknown -ORBSvcConfDirective "-ORBDottedDecimalAddresses 0" -ORBSvcConfDirective "static PBXIOP_Factory '-enable_keepalive'" -ORBSvcConfDirective "static EndpointSelectorFactory ''" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory PBXIOP_Factory'" -ORBSvcConfDirective "static Resource_Factory '-ORBProtocolFactory IIOP_Factory'" -ORBSvcConfDirective "static PBXIOP_Evaluator_Factory '-orb Default_DAEMON_Orb'" -ORBSvcConfDirective "static Resource_Factory '-ORBConnectionCacheMax 1024 '" -ORBEndpoint pbxiop://1556:NB_24197_-1956322928 -ORBSvcConf /dev/null -ORBSvcConfDirective "static Server_Strategy_Factory '-ORBMaxRecvGIOPPayloadSize 268435456'"(Orb.cpp:758)
10:42:49.342 [24197] <2> RequestInitialResources: returning
10:42:49.353 [24197] <2> parse_resource_strings: MEDIADB 1 394191 *NULL* 0 *NULL* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10:42:49.353 [24197] <2> parse_resource_strings: Parsed message type 15, version 1, 21 parameters
10:42:49.353 [24197] <2> parse_resource_strings: VOLUME 1 *NULL* 0 *NULL* *NULL* *NULL* *NULL* 0 0 0 0 0 {00000000-0000-0000-0000-000000000000} 0
10:42:49.353 [24197] <2> parse_resource_strings: Parsed message type 16, version 1, 14 parameters
10:42:49.353 [24197] <2> parse_resource_strings: DRIVE 3 HP.ULTRIUM6-SCSI.004 2000708 F00256301F /dev/rtape/tape122_BESTnb -1 -1 -1 -1 -1 -1 -1 -1 *NULL* *NULL* *NULL* *NULL* 1 1024 0 1 0 0
10:42:49.353 [24197] <2> parse_resource_strings: Parsed message type 17, version 3, 23 parameters
10:42:49.353 [24197] <2> parse_resource_strings: STORAGE 1 *NULL* 0 0 0 0 0 0 rzcmsdb rzcmsdb *NULL*
10:42:49.353 [24197] <2> parse_resource_strings: Parsed message type 18, version 1, 11 parameters
10:42:49.353 [24197] <2> parse_resource_strings: DISKGROUP 0 6 *NULL* 6 *NULL* 6 *NULL*
10:42:49.353 [24197] <2> parse_resource_strings: Parsed message type 21, version 0, 7 parameters
10:42:49.353 [24197] <2> parse_resource_strings: DISKVOLUME 0 6 *NULL* 6 *NULL* 0
10:42:49.353 [24197] <2> parse_resource_strings: Parsed message type 22, version 0, 6 parameters
10:42:49.353 [24197] <2> parse_resource_strings: DISKMOUNTPOINT 0 6 *NULL*
10:42:49.353 [24197] <2> parse_resource_strings: Parsed message type 23, version 0, 3 parameters
10:42:49.353 [24197] <2> nbjm_media_request: Job control returned to BPTM
10:42:49.354 [24197] <2> drivename_open: Called with Create 1, file HP.ULTRIUM6-SCSI.004
10:42:49.354 [24197] <2> drivename_lock: lock established
10:42:49.354 [24197] <4> create_tpreq_file: symlink to path /dev/rtape/tape122_BESTnb
10:42:49.355 [24197] <2> drivename_write: Called with mode 2
10:42:49.363 [24197] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from tapealert_and_release)
10:42:49.363 [24197] <2> really_tpunmount: tpunmount'ing /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM6-SCSI.004
10:42:49.442 [24197] <2> tapelib: wait_for_ltid, UnMount, timeout 2400
10:43:10.870 [24197] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jchxbak, query type 1
10:43:10.882 [24197] <2> vnet_pbxConnect: pbxConnectEx Succeeded
10:43:10.882 [24197] <2> logconnections: BPDBM CONNECT FROM 10.131.22.4.51192 TO 10.131.33.62.1556 fd = 17
10:43:10.888 [24197] <2> db_end: Need to collect reply
10:43:10.889 [24197] <4> report_resource_done: VBRD 1 24197 0 HP.ULTRIUM6-SCSI.004 *NULL*
10:43:10.889 [24197] <4> create_tpreq_file: symlink to path /dev/rtape/tape122_BESTnb
10:43:10.894 [24197] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from tapealert_and_release)
10:43:10.900 [24197] <2> tapealert_and_release: report_attr, fl1 0x00000001, fl2 0x00000000
10:43:10.905 [24197] <2> tapealert_and_release: SCSI RELEASE
10:43:10.905 [24197] <2> drivename_unlock: unlocked
10:43:10.905 [24197] <2> drivename_close: Called for file HP.ULTRIUM6-SCSI.004
10:43:10.905 [24197] <2> drivename_remove: Called
10:43:10.906 [24197] <2> main: Sending [EXIT STATUS 0] to NBJM
10:43:10.906 [24197] <2> bptm: EXITING with status 0 <----------

So now I have the two wondering questions:
1) Why the first bptm unload action would failed after the previous backup writing ended successfully？
2) How the second bptm unload action initiated "automatically", and it unloaded the media successfully？

Nicolai · ‎01-09-2019

Oct 18 10:28:45 rzcmsdb tldd[14863]: TLD(2) DismountTape 700076 from drive 6
Oct 18 10:29:18 rzcmsdb tldd[14863]: DecodeDismount: TLD(2) drive 6, Actual status: Unable to SCSI unload drive
Oct 18 10:29:23 rzcmsdb ltid[14857]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.004

The sequence works as follow:

TLD mount a tape.

BPTM waits for the tape to go online. When a tape is mounted, the tape drive need to read the table of content (TOC), this will take some time. Once drive is ready, positioning and writing can start.

Upon dismount:

BPTM ask tape drive to unload tape

TLD dismout tape when drive is "offline".

However - if the TOC is damages a rewind/unload may take very long time, so EMM down drives. I am not sure if Netbackup will re-try operation where EMM has downed drive. @mph999 may be able to answer.

Try the following - down all drives in Netbackup. Load tape 700076 manual in the tape drives that matches tape122. Using the magnetic tool package run :

Run command until read BOT (Beginning of tape)

mt -f /dev/rtape/tape122_BESTnb status

then run :

mt -f /dev/dev/rtape/tape122_BESTnb offline

Monitor for how long the operation takes. Should be less that 40 seconds. Also inspect tape alert messages to see if a bad TOC is indicated.

Marianne · ‎01-09-2019

@liuyl

There are basic requirements in NetBackup Device Configuration Guide for HP-UX, such as :
Disabling SPC-2 SCSI reserve in HP-UX
and
disabling the HP-UX EMS Tape Device Monitor for a SAN

Have you checked and confirmed that this was done on all HP-UX media servers?

If not, this could explain the issues that you are experiencing.

Handy NetBackup Links

Nicolai · ‎01-09-2019

Hi Marianne

It not needed to disable HP-UX EMS Tape Device Monitor anymore. HP fixed that issue long time ago. Further more, disabling the EMS system and you looses all SCSI sense messages, including write and read error messages.

liuyl · ‎01-10-2019

@Nicolai
Do you think that the NBU might not report any more warnings rather than the media read/write/position errors while its TOC is defective？

Nicolai · ‎01-10-2019

Hi @liuyl

You should see tape alert messages in EMS/Syslog/messages if the TOC is defective or damages, but Netbackup can only report generic error media open error, media write error, media read error.

Check if it the same tape causing the media unload errors - or just suspend the tape right away.

# bpmedia -suspend - m xxxxxx

liuyl · ‎01-10-2019

Today I have successfully tested the backup jobs with the same media & TD twice！
1) The first time for unloading 700076 was successful.
2) The second time for unloading 700076 was failed, but it was "automatically" unloaded successfully after 7 mins！

So until now I think that the media should be good at least！
But why sometimes this media would only failed for unloading？

root@rzcmsdb:/#
root@rzcmsdb:/#
root@rzcmsdb:/#
root@rzcmsdb:/# tail -1000 /var/adm/syslog/syslog.log|grep -Eiv "su:|sshd|repeat|xntpd"|tail
Jan 11 12:01:34 rzcmsdb tldd[10758]: TLD(2) MountTape 700076 on drive 6, from slot 216
Jan 11 12:01:34 rzcmsdb tldd[13970]: TLD(2) open failed in io_open, No such device or address
Jan 11 12:01:34 rzcmsdb tldd[13970]: TLD(2) unload==TRUE, but no unload, drive 6 (device 9)
Jan 11 12:02:02 rzcmsdb tldd[13970]: Reading label on HP.ULTRIUM6-SCSI.004 (device 9, /dev/rtape/tape122_BESTnb)
Jan 11 12:02:14 rzcmsdb tldd[13970]: TLD(2) [13970] waited 27 seconds for ready, drive 6
Jan 11 12:02:14 rzcmsdb tldd[10758]: DecodeMount: TLD(2) drive 6, Actual status: STATUS_SUCCESS
Jan 11 12:02:14 rzcmsdb ltid[10753]: LTID - received ROBOT MESSAGE, Type=54, LongParam=0, Param1=9, Param2=0
Jan 11 12:02:39 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=3, Param2=9
Jan 11 12:02:39 rzcmsdb tldd[10758]: TLD(2) DismountTape 700076 from drive 6
Jan 11 12:03:17 rzcmsdb tldd[10758]: DecodeDismount: TLD(2) drive 6, Actual status: STATUS_SUCCESS
root@rzcmsdb:/#
root@rzcmsdb:/#
root@rzcmsdb:/#
root@rzcmsdb:/#
root@rzcmsdb:/# tail -1000 /var/adm/syslog/syslog.log|grep -Eiv "su:|sshd|repeat|xntpd"|tail
Jan 11 14:26:45 rzcmsdb tldd[17017]: TLD(2) open failed in io_open, No such device or address
Jan 11 14:26:45 rzcmsdb tldd[17017]: TLD(2) unload==TRUE, but no unload, drive 6 (device 9)
Jan 11 14:27:32 rzcmsdb tldd[17017]: Reading label on HP.ULTRIUM6-SCSI.004 (device 9, /dev/rtape/tape122_BESTnb)
Jan 11 14:27:43 rzcmsdb tldd[17017]: TLD(2) [17017] waited 41 seconds for ready, drive 6
Jan 11 14:27:43 rzcmsdb tldd[10758]: DecodeMount: TLD(2) drive 6, Actual status: STATUS_SUCCESS
Jan 11 14:27:44 rzcmsdb ltid[10753]: LTID - received ROBOT MESSAGE, Type=54, LongParam=0, Param1=9, Param2=0
Jan 11 14:28:05 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=3, Param2=9
Jan 11 14:28:05 rzcmsdb tldd[10758]: TLD(2) DismountTape 700076 from drive 6
Jan 11 14:28:37 rzcmsdb tldd[10758]: DecodeDismount: TLD(2) drive 6, Actual status: Unable to SCSI unload drive
Jan 11 14:28:41 rzcmsdb ltid[10753]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.004 (device 9)
root@rzcmsdb:/#
root@rzcmsdb:/#
root@rzcmsdb:/#
root@rzcmsdb:/#
root@rzcmsdb:/# tail -1000 /var/adm/syslog/syslog.log|grep -Eiv "su:|sshd|repeat|xntpd"|tail
Jan 11 14:28:05 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=3, Param2=9
Jan 11 14:28:05 rzcmsdb tldd[10758]: TLD(2) DismountTape 700076 from drive 6
Jan 11 14:28:37 rzcmsdb tldd[10758]: DecodeDismount: TLD(2) drive 6, Actual status: Unable to SCSI unload drive
Jan 11 14:28:41 rzcmsdb ltid[10753]: Operator/EMM server has DOWN'ed drive HP.ULTRIUM6-SCSI.004 (device 9)
Jan 11 14:45:15 rzcmsdb tldd[10758]: TLD(2) DismountTape ****** from drive 6
Jan 11 14:45:15 rzcmsdb tldd[17446]: TLD(2) open failed in io_open, No such device or address
Jan 11 14:45:30 rzcmsdb tldd[17446]: TLD(2) unload==TRUE, but no unload, drive 6 (device 9)
Jan 11 14:45:15 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=3, Param2=9
Jan 11 14:45:30 rzcmsdb tldd[17446]: TLD(2) open failed in io_open, No such device or address
Jan 11 14:45:38 rzcmsdb tldd[10758]: DecodeDismount: TLD(2) drive 6, Actual status: STATUS_SUCCESS
root@rzcmsdb:/#
root@rzcmsdb:/#
root@rzcmsdb:/#

Marianne · ‎01-10-2019

There is clearly a problem between OS and tape drive.
The OS is losing connectivity to the tape drive every now and then.

The OS (not NetBackup) reported 'no such device or address'.

All errors are for this drive: /dev/rtape/tape122_BESTnb

You need to troubleshoot physical connections between the server and the tape drive - cable(s), swith ports, gbic's, etc.

Extract from an old NBU troubleshooting doc:

As an application, NetBackup has no direct access to a device, instead relying on the operating system (OS) to handle any communication with the device. This means that during a write operation NetBackup asks the OS to write to the device and report back the success or failure of that operation. (Same applies to unload operation.) If there is a failure, NetBackup will merely report that a failure occurred, and any troubleshooting should start at the OS level. If the OS is unable to perform the write, there are three likely causes; OS configuration, a problem on the SCSI path, or a problem with the device.

Handy NetBackup Links

Nicolai · ‎01-10-2019

Remove 700076 from robot and check if problem re-occour.

mph999 · ‎01-11-2019

Opps, I'm a bit late to the party ...

"I am not sure if Netbackup will re-try operation where EMM has downed drive. @mph999 may be able to answer."

My thoughts were that if a drive is down we don;t do anything with it ... but I've never investogated that further.

liuyl · ‎01-12-2019

1) I checked the corresponding FC links between the host and TDs via portshow/porterrshow on the SAN！
And the results I have observed are all right, there is no any abnormal port link statistics.

2) The tape media 700076 itself should be no problem because I had also successfully tested it with the other TDs！
And also the other TDs with their FC links are all right regardless of the "error" with open failed in io_open, which I think is just "information".

Jan 11 12:08:31 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=1, Param2=0
Jan 11 12:08:31 rzcmsdb tldd[10758]: TLD(2) MountTape 700076 on drive 4, from slot 216
Jan 11 12:08:31 rzcmsdb tldd[14162]: TLD(2) open failed in io_open, No such device or address
Jan 11 12:08:31 rzcmsdb tldd[14162]: TLD(2) unload==TRUE, but no unload, drive 4 (device 11)
Jan 11 12:08:58 rzcmsdb tldd[14162]: Reading label on HP.ULTRIUM6-SCSI.005 (device 11, /dev/rtape/tape124_BESTnb)
Jan 11 12:09:10 rzcmsdb tldd[14162]: TLD(2) [14162] waited 27 seconds for ready, drive 4
Jan 11 12:09:10 rzcmsdb tldd[10758]: DecodeMount: TLD(2) drive 4, Actual status: STATUS_SUCCESS
Jan 11 12:09:11 rzcmsdb ltid[10753]: LTID - received ROBOT MESSAGE, Type=54, LongParam=0, Param1=11, Param2=0
Jan 11 12:09:13 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=3, Param2=11
Jan 11 12:09:13 rzcmsdb tldd[10758]: TLD(2) DismountTape 700076 from drive 4
Jan 11 12:09:51 rzcmsdb tldd[10758]: DecodeDismount: TLD(2) drive 4, Actual status: STATUS_SUCCESS
Jan 11 13:56:18 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=1, Param2=0
Jan 11 13:56:18 rzcmsdb tldd[10758]: TLD(2) MountTape 700076 on drive 7, from slot 216
Jan 11 13:56:18 rzcmsdb tldd[16371]: TLD(2) open failed in io_open, No such device or address
Jan 11 13:56:18 rzcmsdb tldd[16371]: TLD(2) unload==TRUE, but no unload, drive 7 (device 13)
Jan 11 13:57:00 rzcmsdb tldd[16371]: Reading label on HP.ULTRIUM6-SCSI.000 (device 13, /dev/rtape/tape126_BESTnb)
Jan 11 13:57:11 rzcmsdb tldd[16371]: TLD(2) [16371] waited 41 seconds for ready, drive 7
Jan 11 13:57:11 rzcmsdb tldd[10758]: DecodeMount: TLD(2) drive 7, Actual status: STATUS_SUCCESS
Jan 11 13:57:11 rzcmsdb ltid[10753]: LTID - received ROBOT MESSAGE, Type=54, LongParam=0, Param1=13, Param2=0
Jan 11 13:57:32 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=3, Param2=13
Jan 11 13:57:32 rzcmsdb tldd[10758]: TLD(2) DismountTape 700076 from drive 7
Jan 11 13:58:10 rzcmsdb tldd[10758]: DecodeDismount: TLD(2) drive 7, Actual status: STATUS_SUCCESS
Jan 11 14:04:25 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=1, Param2=0
Jan 11 14:04:25 rzcmsdb tldd[10758]: TLD(2) MountTape 700076 on drive 8, from slot 216
Jan 11 14:04:25 rzcmsdb tldd[16490]: TLD(2) open failed in io_open, No such device or address
Jan 11 14:04:25 rzcmsdb tldd[16490]: TLD(2) unload==TRUE, but no unload, drive 8 (device 12)
Jan 11 14:05:07 rzcmsdb tldd[16490]: Reading label on HP.ULTRIUM6-SCSI.006 (device 12, /dev/rtape/tape125_BESTnb)
Jan 11 14:05:18 rzcmsdb tldd[16490]: TLD(2) [16490] waited 41 seconds for ready, drive 8
Jan 11 14:05:18 rzcmsdb tldd[10758]: DecodeMount: TLD(2) drive 8, Actual status: STATUS_SUCCESS
Jan 11 14:05:18 rzcmsdb ltid[10753]: LTID - received ROBOT MESSAGE, Type=54, LongParam=0, Param1=12, Param2=0
Jan 11 14:05:20 rzcmsdb ltid[10753]: LTID - Sent ROBOTIC request, Type=3, Param2=12
Jan 11 14:05:20 rzcmsdb tldd[10758]: TLD(2) DismountTape 700076 from drive 8
Jan 11 14:05:59 rzcmsdb tldd[10758]: DecodeDismount: TLD(2) drive 8, Actual status: STATUS_SUCCESS

3) Perhaps mph999 could infer whether my following viewpoint is correct！
The first error with "open failed in io_open ....", I think, came up when the tldd process tested with unload action to determine whether the requested drive was empty before it could mount a tape media！

4) I can infer that the NBU does re-try operation where EMM has downed drive, because everytime the unloading failed media would indeed be "automatically" unloaded again，and then succeed！

mph999 · ‎01-14-2019

Perhaps mph999 could infer whether my following viewpoint is correct！
The first error with "open failed in io_open ....", I think, came up when the tldd process tested with unload action to determine whether the requested drive was empty before it could mount a tape media！

Yes - we tet that a drive is empty before using it ... which as you see, gives a false error .... - it fails to open bcause there is no tape, but that is what we are looking for.

liuyl · ‎01-15-2019

It seems that there is still no reasonable explanation for such phenomenon !

Too uncanny, and now what I could imagine is that the TD 004 should have some specific and tiny fault ..... !

liuyl · ‎04-16-2019

3) robot.log:

10:27:30.147 [14863] <4> AddTldLtiReqEntry: Processing ROBOT_MOUNT request...
10:27:30.147 [14863] <2> AddTldLtiReqEntry: mount request specified Drive Name HP.ULTRIUM6-SCSI.004
10:27:30.147 [14863] <4> TldGetDriveForMountReq: Got drive id 9 for drive HP.ULTRIUM6-SCSI.004
10:27:30.147 [14863] <2> ProcessTable: found active request id 0. Child pid = 0 flags = 0 request type 1
10:27:30.147 [14863] <2> ProcessTable: ROBOT_MOUNT requested ... set this drive IN USE
10:27:30.151 [14863] <5> MountTape: TLD(2) MountTape 700076 on drive 6, from slot 216
10:27:30.151 [14863] <2> SpawnChild: TLD(2) Creating Child Process
10:28:24.343 [14863] <5> GetResponseStatus: DecodeMount: TLD(2) drive 6, Actual status: STATUS_SUCCESS
10:28:45.910 [14863] <4> AddTldLtiReqEntry: Processing ROBOT_DISMOUNT request...
10:28:45.910 [14863] <2> ProcessTable: found active request id -1. Child pid = 0 flags = 0 request type 3
10:28:45.910 [14863] <5> DismountTape: TLD(2) DismountTape 700076 from drive 6
10:28:45.910 [14863] <2> SpawnChild: TLD(2) Creating Child Process
10:29:18.459 [14863] <5> GetResponseStatus: DecodeDismount: TLD(2) drive 6, Actual status: Unable to SCSI unload drive
10:36:20.565 [14863] <2> ProcessTable: found active request id -1. Child pid = 0 flags = 0 request type 27
10:36:20.565 [14863] <4> GetDumpTablesFilename: DumpTables Filename is [/usr/openv/volmgr/misc/ltitables]
10:42:49.439 [14863] <4> AddTldLtiReqEntry: Processing ROBOT_DISMOUNT request...
10:42:49.439 [14863] <2> ProcessTable: found active request id -1. Child pid = 0 flags = 0 request type 3
10:42:49.439 [14863] <5> DismountTape: TLD(2) DismountTape ****** from drive 6
10:42:49.439 [14863] <2> SpawnChild: TLD(2) Creating Child Process
10:43:10.196 [14863] <5> GetResponseStatus: DecodeDismount: TLD(2) drive 6, Actual status: STATUS_SUCCESS

Here, what the request type 27 means, versus the previous request type 1/3 ？

VOX

Some fixed TD unloaded media failed after its backup jobs ended successfully！