cancel
Showing results for 
Search instead for 
Did you mean: 

How to understand the meaning of the downStucnts

liuyl
Level 6

1) What conditions/factors would lead to the downStu action happen?
    For example, some Tape Drive need to be cleaned, tape_alert2 = 1

2) Which background process/daemon undertakes the downStu action?

3) How to manually operate/check the downStu actions, like vmoprcmd?


05/12/2020 12:14:16.386 [DefaultQueue::queueRequest] queueing RB RBrelease : id={97CA4022-9406-11EA-85D7-63D0963F0EA9} -- retry count=-1(CallbackQueue.cpp:1384)
05/12/2020 12:14:16.386 [DefaultQueue::handle_input] sending RB RBrelease : id={97CA4022-9406-11EA-85D7-63D0963F0EA9}(CallbackQueue.cpp:1510)
05/12/2020 12:14:16.387 [EMMProvider::release(mdar)] received release of mediaId=0332L6, driveName=HP.ULTRIUM6-SCSI.007, STU=gwzycdb3-hcart3-robot-tld-1
05/12/2020 12:14:16.388 V-118-93 [ResBroker_i::release(id)] releasing allocation ID {97CA4022-9406-11EA-85D7-63D0963F0EA9}, status 0
05/12/2020 12:14:16.388 [DefaultQueue::handle_input] (0x28ca960) CallbackQueue now empty(CallbackQueue.cpp:1594)
05/12/2020 12:14:16.392 [downStuCnts] INITIATING:
05/12/2020 12:14:16.392 [downStuCnts] allocationKey = 1551504
05/12/2020 12:14:16.392 [downStuCnts] alloc_med_list = 4001129
05/12/2020 12:14:16.392 [downStuCnts] alloc_drv_list = 2000697
05/12/2020 12:14:16.395 V-118-217 [EMMAccess::downStuCnts] request to down STU counts (STU gwzycdb3-hcart3-robot-tld-1, Allocation Key 1551504) returned 0
05/12/2020 12:14:16.395 [downStuCnts] EXIT INFO:
05/12/2020 12:14:16.395 [downStuCnts] alloc_med_list = 4001129
05/12/2020 12:14:16.395 [downStuCnts] alloc_drv_list = 2000697
05/12/2020 12:14:16.395 [downStuCnts] EXIT STATUS = 0 (EMM_ERROR_Success, Success)

05/12/2020 12:14:16.396 [setAvailable] INITIATING:
05/12/2020 12:14:16.397 [setAvailable] force = 0, unloadFlags = 0
05/12/2020 12:14:16.397 [setAvailable] allocationKey = 1551504
05/12/2020 12:14:16.397 [setAvailable] alloc_med_list = 4001129
05/12/2020 12:14:16.397 [setAvailable] alloc_drv_list = 2000697
05/12/2020 12:14:16.399 V-118-219 [EMMAccess::setAvailable] request to set media-drive pair available (Media ID 0332L6, Drive Name HP.ULTRIUM
6-SCSI.007, Allocation Key 1551504, Force false, Unload Flags 0) returned 2005017
05/12/2020 12:14:16.399 [Error] V-143-1501 Drive HP.ULTRIUM6-SCSI.007 cannot be made available, it needs to be cleaned, tape_alert2 = 1
05/12/2020 12:14:16.399 [setAvailable] EXIT INFO:
05/12/2020 12:14:16.399 [setAvailable] EXIT STATUS = 2005017 (EMM_ERROR_MDS_NeedUnmount, Media needs to be unmounted from a drive)

10 REPLIES 10

mph999
Level 6
Employee Accredited

tpconfig -d would show the drive state, but there is not really any command that tells you 'why' - there are many reasons it could happen.

Activity Monitor and bptm log covering the period the drive went down are in my view, the most likely to give either a reason or some good clues.  For more involved issues you are probably looking at the media manger logs next, robots and tpcommand along with OS messages log.

1) I feel that the downStu action should do all the corresponding TDs of the STU, but not just some one TD.
2) And also the downStu action should just intervene TDs' status at the vmd layer, but not the tpconfig layer.
    Notes: because all the corresponding TDs'status are UP via tpconfig -l/d while not visible via vmoprcmd.

mph999
Level 6
Employee Accredited

.

mph999
Level 6
Employee Accredited

You need to give an example.

downStuCnts is an internal NetBackup function, undocumeneted, related to 'media / device ' selection -  determining what resources are available, part of nbemm.   It doesn't touch anything at the tpconfig level - what is configured, is still configured.

It's an internal part of the code, and apart from what I have written above, I'm not describing it any further.

What you need to concern yourself with, is what I have written previusly - looking for issues with the drives/ media  - bptm log is a good starting place.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@liuyl wrote:

1) What conditions/factors would lead to the downStu action happen?
    For example, some Tape Drive need to be cleaned, tape_alert2 = 1

2) Which background process/daemon undertakes the downStu action?

3) How to manually operate/check the downStu actions, like vmoprcmd?

 


@liuyl 

Maybe you could tell us which log you posted? 
Maybe it's the same process? 

My guess is that 'down stu count' is not DOWN in the same sense as when a tape drive is being DOWN'ed, but rather 'reduce the count/number of available devices in the STU'.  

So, you can see when the resource broker assigns a drive:
05/12/2020 12:14:16.392 [downStuCnts] INITIATING:
05/12/2020 12:14:16.392 [downStuCnts] allocationKey = 1551504
05/12/2020 12:14:16.392 [downStuCnts] alloc_med_list = 4001129
05/12/2020 12:14:16.392 [downStuCnts] alloc_drv_list = 2000697 
... the STU count is being decreased (count is downed by one). 

Same is done when a drive needs cleaning (and therefore not available) - count is decreased again.

When a drive is released, it is '[setAvailable]'. So, the count is increased again. 

So, to manually check assign and release of Resource Broker actions, you could check output of 
nbrbutil -dump

Just my bit of logic here - I do not have insight into background processes. 

1) From the following vxlogview at that time, we may see that the STU of gwzycdb3-hcart3-robot-tld-1 could not be used ......

05/12/2020 12:14:31.589 [Info] V-116-82 received start user backup request for client=gwzycdb3, policy=SGPT-PORTAL-arch, schedule=?, stream #
=0, client type=Oracle (4), parent jobid=-1
05/12/2020 12:14:31.589 V-116-213 [ExtReqTask::run_specificPolicy] User backup request for client=gwzycdb3, policy=SGPT-PORTAL-arch, schedule
=Default-Application-Backup submitted for execution

05/12/2020 12:14:31.590 V-116-215 [BaseJob::run] jobid=2077546 submitted to nbjm for processing
05/12/2020 12:14:31.591 [Info] V-117-85 backup job submission request (jobid=2077546) for client gwzycdb3, policy SGPT-PORTAL-arch, schedule
Default-Application-Backup VBRQ 1 0 2077546 gwzycdb3 SGPT-PORTAL-arch Default-Application-Backup



05/12/2020 12:14:31.633 [allocateTwin] INITIATING:
05/12/2020 12:14:31.633 [allocateTwin] masterServer = jcbak, client = gwzycdb3, jobType = 1, capabilityFlags = 32, fatPipePreference = 0, sta
tusOnly = 0, numberOfCopies = 1, kbytesNeeded = 0
05/12/2020 12:14:31.633 [allocateTwin] Twin_Record: STUIdentifier = gwzycdb3-hcart3-robot-tld-1, STUIdentifierType = 1, PoolName = SG_ORAPT,
MediaSharingGroup = *ANY*, RetentionLevel = 5, RequiredMediaServer = , PreferredMediaServer = , PreferredFilerForNDMPSnapDupe = , RequiredDis
kVolumeMediaId = , RequiredStorageUnitName = , GetMaxFreeSpaceSTU = 0, CkptRestart = 0, CkptRestartSTUType = 0, CkptRestartSTUSubType = -1, C
kptRestartSTUName = , CkptRestartMediaServer = , CkptRestartDiskGroupName = , CkptRestartDiskGroupServerType = , MpxEnabled = 0, MustUseLocal
MediaServer = 0
05/12/2020 12:14:31.647 [select_from_one_stu] cannot select storage unit, media server is offline, name = gwzycdb3-hcart3-robot-tld-1
05/12/2020 12:14:31.647 [allocateTwin] EXIT INFO:
05/12/2020 12:14:31.647 [allocateTwin] rbActions = 1
05/12/2020 12:14:31.647 [allocateTwin] canRetryForLocalMediaServer = TRUE
05/12/2020 12:14:31.647 [allocateTwin] REQUEST NEEDS TO BE RETRIED LATER
05/12/2020 12:14:31.647 [allocateTwin] RetryInfo_Record: reason = 3 (TAPE MEDIA SERVER IS NOT ACTIVE), mediaServerName = gwzycdb3, robotNum =
1, robotType = 8, mediaTypeName = NetBackup HCART3, driveTypeName = NetBackup HCART3, mediaKey = 0, driveKey = 0, mediaId = , driveName = ,
poolName = SG_ORAPT, stuName = gwzycdb3-hcart3-robot-tld-1, timeout = 0, scanHostName = , diskGroupName = , diskVolumeName = , retryInfoStrin
g = MDS_RETRY 3 gwzycdb3 1 8 *NULL* *NULL* SG_ORAPT gwzycdb3-hcart3-robot-tld-1 *NULL* 0 *NULL* *NULL*
05/12/2020 12:14:31.647 [allocateTwin] EXIT STATUS = 0 (EMM_ERROR_Success, Success)

 


2) And also at that time,  all of the corresponding TDs had been invisible ......

#
#
# vmoprcmd -dp ds -h gwzycdb3

DRIVE STATUS

Drv DrivePath Status Label Ready
#
#

 

3) From the following bptm.log,  we find no more corresponding error/alert clues matched at that time.


12:14:08.224 [55375] <2> io_write_back_header: drive index 2, empty_file, file num = 70, mpx_headers = 0, copy 1
12:14:08.226 [55375] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM6-SCSI.006, from ../bptm.c.8709
12:14:11.562 [55275] <2> io_terminate_tape: absolute block position prior to writing empty header is 15737612, copy 1
12:14:11.562 [55275] <2> io_terminate_tape: block position check: actual 15737612, expected 15737612
12:14:11.563 [55275] <2> send_MDS_msg: MEDIADB 1 1551504 0332L6 4001129 *NULL* 20 1587258690 1589256663 1597291863 0 1000092096 136 136 5 14 0 0 1024 0 15737612 0
12:14:11.575 [55275] <2> io_open: SCSI RESERVE
12:14:11.577 [55275] <2> io_open: file /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM6-SCSI.007 successfully opened (mode 0)
12:14:11.577 [55275] <2> io_ioctl: command (2)MTBSF 2 0x0 from (../bptm.c.8870) on drive index 3
12:14:12.671 [55375] <2> io_terminate_tape: absolute block position prior to writing empty header is 12795871, copy 1
12:14:12.671 [55375] <2> io_terminate_tape: block position check: actual 12795871, expected 12795871
12:14:12.671 [55375] <2> send_MDS_msg: MEDIADB 1 1551505 0157L6 4001001 *NULL* 20 1587047141 1589256664 1597291864 0 811918464 68 68 5 14 0 0 1024 0 12795871 0
12:14:12.683 [55375] <2> io_open: SCSI RESERVE
12:14:12.686 [55375] <2> io_open: file /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM6-SCSI.006 successfully opened (mode 0)
12:14:12.686 [55375] <2> io_ioctl: command (2)MTBSF 2 0x0 from (../bptm.c.8870) on drive index 2
12:14:15.991 [55275] <2> io_ioctl: command (1)MTFSF 1 0x0 from (../bptm.c.8872) on drive index 3
12:14:16.009 [55275] <2> io_close: closing /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM6-SCSI.007, from ../bptm.c.8878
12:14:16.009 [55275] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from io_terminate_tape)
12:14:16.010 [55275] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
12:14:16.010 [55275] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
12:14:16.010 [55275] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 1
12:14:16.013 [55275] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
12:14:16.032 [55275] <2> logconnections: PROXY CONNECT FROM 10.131.33.167.26315 TO 10.131.33.60.1556 fd = 0
12:14:16.032 [55275] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.43240 TO 127.0.0.1.13613 fd = 0
12:14:16.032 [55275] <2> db_getdata: timeout is 7200 seconds
12:14:16.056 [55275] <2> db_end: Need to collect reply
12:14:16.056 [55275] <2> db_getdata: timeout is 7200 seconds
12:14:16.063 [55275] <4> report_throughput: VBRT 2 55275 5 1 HP.ULTRIUM6-SCSI.007 0332L6 0 1 0 7093280 7093280 9275 7 70570 159 3260 3685 15 110836 30 (../bptm.c.19657
)
12:14:16.063 [55275] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
12:14:16.063 [55275] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
12:14:16.063 [55275] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 77
12:14:16.064 [55275] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
12:14:16.080 [55275] <2> logconnections: PROXY CONNECT FROM 10.131.33.167.16784 TO 10.131.33.60.1556 fd = 0
12:14:16.080 [55275] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.13051 TO 127.0.0.1.33762 fd = 0
12:14:16.080 [55275] <2> db_getdata: timeout is 7200 seconds
12:14:16.105 [55275] <2> db_end: Need to collect reply
12:14:16.105 [55275] <2> db_getdata: timeout is 7200 seconds
12:14:16.154 [55275] <2> send_MDS_msg: MEDIADB 1 1551504 0332L6 4001129 *NULL* 20 1587258690 1589256663 1597291863 0 1007185376 137 137 5 14 0 0 1024 0 15737612 0
12:14:16.165 [55275] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
12:14:16.165 [55275] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
12:14:16.165 [55275] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 1
12:14:16.166 [55275] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
12:14:16.180 [55275] <2> logconnections: PROXY CONNECT FROM 10.131.33.167.34877 TO 10.131.33.60.1556 fd = 0
12:14:16.180 [55275] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.17845 TO 127.0.0.1.56869 fd = 0
12:14:16.181 [55275] <2> db_getdata: timeout is 7200 seconds
12:14:16.214 [55275] <2> db_end: Need to collect reply
12:14:16.214 [55275] <2> db_getdata: timeout is 7200 seconds
12:14:16.220 [55275] <4> write_backup: successfully wrote backup id gwzycdb3_1589256663, copy 1, fragment 1, 7093280 Kbytes at 90386.738 Kbytes/sec
12:14:16.220 [55275] <2> notify: executing - /usr/openv/netbackup/bin/backup_notify bptm gwzycdb3_1589256663
12:14:16.236 [55275] <2> Media_dispatch_signal: calling child_wait (../common.c:4143) delay 0 seconds
12:14:16.236 [55275] <2> updateEMM_freespace: updateEMM_freespace() (1 0)
12:14:16.236 [55275] <2> updateEMM_freespace: 0, gwzycdb3-hcart3-robot-tld-1 (1 0 2 1)
12:14:16.236 [55275] <2> bptm: Calling tpunmount for media 0332L6
12:14:16.236 [55275] <2> drivename_write: Called with mode 1
12:14:16.236 [55275] <2> drivename_unlock: unlocked
12:14:16.236 [55275] <2> drivename_close: Called for file HP.ULTRIUM6-SCSI.007
12:14:16.236 [55275] <2> send_MDS_msg: MEDIA_DONE 0 -1587834936 0 0332L6 4001129 180 {97CA4022-9406-11EA-85D7-63D0963F0EA9}
12:14:16.236 [55275] <2> packageBptmResourceDoneMsg: msg (MEDIA_DONE 0 -1587834936 0 0332L6 4001129 180 {97CA4022-9406-11EA-85D7-63D0963F0EA9})
12:14:16.236 [55275] <2> packageBptmResourceDoneMsg: keyword MEDIA_DONE version 0 jobid -1587834936 copyNum 0 mediaId 0332L6 mediaKey 4001129 unloadDelay 180 allocId {
97CA4022-9406-11EA-85D7-63D0963F0EA9}
12:14:16.237 [55275] <2> packageBptmResourceDoneMsg: returns 0
12:14:16.241 [55275] <2> JobInst::sendIrmMsg: returning
12:14:16.241 [55275] <2> bptm: EXITING with status 0 <----------
12:14:16.242 [55275] <2> cleanup: Detached from BPBRM shared memory
12:14:16.792 [62385] <2> bptm: INITIATING (VERBOSE = 0): -unload -dn HP.ULTRIUM6-SCSI.007 -dp /dev/nst0 -dk 2000697 -m 0332L6 -mk 4001129 -mds 8 -alocid 1551504 -jobid
-1587834938 -jm
12:14:16.792 [62385] <4> bptm: emmserver_name = jcbak
12:14:16.793 [62385] <4> bptm: emmserver_port = 1556
12:14:16.799 [62385] <2> Orb::init: Enabling ORBNativeCharCodeSet UTF-8(Orb.cpp:953)


12:14:17.197 [55375] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 77
12:14:17.199 [55375] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
12:14:17.214 [55375] <2> logconnections: PROXY CONNECT FROM 10.131.33.167.27136 TO 10.131.33.60.1556 fd = 0
12:14:17.214 [55375] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.44809 TO 127.0.0.1.28025 fd = 0
12:14:17.214 [55375] <2> db_getdata: timeout is 7200 seconds
12:14:17.238 [55375] <2> db_end: Need to collect reply
12:14:17.238 [55375] <2> db_getdata: timeout is 7200 seconds
12:14:17.288 [55375] <2> send_MDS_msg: MEDIADB 1 1551505 0157L6 4001001 *NULL* 20 1587047141 1589256664 1597291864 0 818924704 69 69 5 14 0 0 1024 0 12795871 0
12:14:17.296 [55375] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
12:14:17.296 [55375] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
12:14:17.296 [55375] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 1
12:14:17.297 [55375] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
12:14:17.312 [55375] <2> logconnections: PROXY CONNECT FROM 10.131.33.167.20184 TO 10.131.33.60.1556 fd = 0
12:14:17.312 [55375] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.26733 TO 127.0.0.1.39853 fd = 0
12:14:17.312 [55375] <2> db_getdata: timeout is 7200 seconds
12:14:17.344 [55375] <2> db_end: Need to collect reply
12:14:17.344 [55375] <2> db_getdata: timeout is 7200 seconds
12:14:17.351 [55375] <4> write_backup: successfully wrote backup id gwzycdb3_1589256664, copy 1, fragment 1, 7006240 Kbytes at 90406.596 Kbytes/sec
12:14:17.351 [55375] <2> notify: executing - /usr/openv/netbackup/bin/backup_notify bptm gwzycdb3_1589256664
12:14:17.368 [55375] <2> Media_dispatch_signal: calling child_wait (../common.c:4143) delay 0 seconds
12:14:17.368 [55375] <2> updateEMM_freespace: updateEMM_freespace() (1 0)
12:14:17.368 [55375] <2> updateEMM_freespace: 0, gwzycdb3-hcart3-robot-tld-1 (1 0 2 1)
12:14:17.368 [55375] <2> bptm: Calling tpunmount for media 0157L6
12:14:17.368 [55375] <2> drivename_write: Called with mode 1
12:14:17.368 [55375] <2> drivename_unlock: unlocked
12:14:17.368 [55375] <2> drivename_close: Called for file HP.ULTRIUM6-SCSI.006
12:14:17.368 [55375] <2> send_MDS_msg: MEDIA_DONE 0 -1587834937 0 0157L6 4001001 180 {9815B11A-9406-11EA-9E82-7C49AF0E7F26}
12:14:17.368 [55375] <2> packageBptmResourceDoneMsg: msg (MEDIA_DONE 0 -1587834937 0 0157L6 4001001 180 {9815B11A-9406-11EA-9E82-7C49AF0E7F26})
12:14:17.368 [55375] <2> packageBptmResourceDoneMsg: keyword MEDIA_DONE version 0 jobid -1587834937 copyNum 0 mediaId 0157L6 mediaKey 4001001 unloadDelay 180 allocId {
9815B11A-9406-11EA-9E82-7C49AF0E7F26}
12:14:17.368 [55375] <2> packageBptmResourceDoneMsg: returns 0
12:14:17.373 [55375] <2> JobInst::sendIrmMsg: returning
12:14:17.373 [55375] <2> bptm: EXITING with status 0 <----------
12:14:17.374 [55375] <2> cleanup: Detached from BPBRM shared memory
12:14:17.566 [62385] <2> tapelib: wait_for_ltid, UnMount, timeout 2400
12:14:17.897 [62497] <2> bptm: INITIATING (VERBOSE = 0): -unload -dn HP.ULTRIUM6-SCSI.006 -dp /dev/nst1 -dk 2000696 -m 0157L6 -mk 4001001 -mds 8 -alocid 1551505 -jobid
-1587834939 -jm

 



12:14:18.113 [62497] <4> create_tpreq_file: symlink to path /dev/nst1
12:14:18.113 [62497] <2> drivename_write: Called with mode 2
12:14:18.114 [62497] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from tapealert_and_release)
12:14:18.114 [62497] <2> really_tpunmount: tpunmount'ing /usr/openv/netbackup/db/media/tpreq/drive_HP.ULTRIUM6-SCSI.006
12:14:18.566 [62497] <2> tapelib: wait_for_ltid, UnMount, timeout 2400
12:15:10.119 [62497] <8> vnet_get_user_credential_path: [vnet_vxss.c:1621] status 35 0x23
12:15:10.119 [62497] <8> vnet_check_user_certificate: [vnet_vxss_helper.c:4123] vnet_get_user_credential_path failed 35 0x23
12:15:10.120 [62497] <2> ConnectionCache::connectAndCache: Acquiring new connection for host jcbak, query type 1
12:15:10.122 [62497] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
12:15:10.140 [62497] <2> logconnections: PROXY CONNECT FROM 10.131.33.167.16209 TO 10.131.33.60.1556 fd = 17
12:15:10.140 [62497] <2> logconnections: BPDBM CONNECT FROM 127.0.0.1.9815 TO 127.0.0.1.34708 fd = 17
12:15:10.141 [62497] <2> db_getdata: timeout is 7200 seconds
12:15:10.165 [62497] <2> db_end: Need to collect reply
12:15:10.165 [62497] <2> db_getdata: timeout is 7200 seconds
12:15:10.180 [62497] <4> report_resource_done: VBRD 1 62497 0 HP.ULTRIUM6-SCSI.006 0157L6
12:15:10.180 [62497] <4> create_tpreq_file: symlink to path /dev/nst1
12:15:10.181 [62497] <2> process_tapealert: TapeAlert returned 0x00000000 0x00000000 (from tapealert_and_release)
12:15:10.182 [62497] <2> tapealert_and_release: report_attr, fl1 0x00000001, fl2 0x00000000
12:15:10.183 [62497] <2> tapealert_and_release: SCSI RELEASE
12:15:10.183 [62497] <2> drivename_unlock: unlocked
12:15:10.183 [62497] <2> drivename_close: Called for file HP.ULTRIUM6-SCSI.006
12:15:10.183 [62497] <2> drivename_remove: Called
12:15:10.183 [62497] <2> main: Sending [EXIT STATUS 0] to NBJM
12:15:10.183 [62497] <2> bptm: EXITING with status 0 <----------

 

mph999
Level 6
Employee Accredited

there is nothing wrong with 'downStu ' - you need to investigate why your media server was unavailable.  

There is also no more obviuos corresponding error/alert messages within the following logs at that time

So strange that the master server suddenly only lost the media server STU at the time of 12:14:31. 


ltid.log

12:12:12.519 [3712] <4> create_tpreq_file: symlink to path /dev/nst1
12:12:12.531 [3712] <6> WriteEntry: Updating drive HP.ULTRIUM6-SCSI.006 at path /dev/nst1 on attach host
12:12:12.554 [3712] <4> send_request_answer: Pid=55375, Type=0, ReturnType=12, DataType=0, MsgRes.SC=0
12:14:17.565 [3712] <4> LtidProcCmd: Pid=62385, Data.Pid=62385, Type=12, Param1=0, Param2=0, LongParam=0
12:14:17.565 [3712] <4> tpunmount_withMntPtr: drive has been assigned
12:14:17.565 [3712] <4> add_unload_to_queue: Unload for media id 0332L6 added to queue (PID 62385, OP 2)
12:14:17.566 [3712] <2> SendRobotReq: LTID - Sent ROBOTIC request, Type=3, Param2=3
12:14:18.566 [3712] <4> LtidProcCmd: Pid=62497, Data.Pid=62497, Type=12, Param1=0, Param2=0, LongParam=0
12:14:18.566 [3712] <4> tpunmount_withMntPtr: drive has been assigned
12:14:18.566 [3712] <4> add_unload_to_queue: Unload for media id 0157L6 added to queue (PID 62497, OP 2)
12:14:18.566 [3712] <2> SendRobotReq: LTID - Sent ROBOTIC request, Type=3, Param2=2
12:15:09.571 [3712] <4> LtidProcCmd: Pid=3783, Data.Pid=3783, Type=56, Param1=2, Param2=0, LongParam=0
12:15:09.571 [3712] <6> WriteEntry: Updating drive HP.ULTRIUM6-SCSI.006 at path /dev/nst1 on attach host
12:15:09.626 [3712] <4> LtidProcCmd: Sending UNLOAD DONE msg to BPTM (dev 2, PID 62497)
12:15:09.626 [3712] <4> send_request_answer: Pid=62497, Type=0, ReturnType=12, DataType=0, MsgRes.SC=0
12:15:18.627 [3712] <4> LtidProcCmd: Pid=3783, Data.Pid=3783, Type=56, Param1=3, Param2=0, LongParam=0
12:15:18.628 [3712] <6> WriteEntry: Updating drive HP.ULTRIUM6-SCSI.007 at path /dev/nst0 on attach host
12:15:18.683 [3712] <4> LtidProcCmd: Sending UNLOAD DONE msg to BPTM (dev 3, PID 62385)
12:15:18.683 [3712] <4> send_request_answer: Pid=62385, Type=0, ReturnType=12, DataType=0, MsgRes.SC=0
12:16:54.692 [3712] <4> LtidProcCmd: Pid=67985, Data.Pid=67985, Type=100, Param1=0, Par


robots.log

12:14:17.566 [3783] <5> DismountTape: TLD(1) DismountTape 0332L6 from drive 8
12:14:17.567 [62475] <4> io_open: Drive Path = /dev/nst0
12:14:18.566 [3783] <3> Robot_dispatch_signal: calling LtiReqSig/lti_func() (../tldd.c:925) delay 0 seconds
12:14:18.566 [3783] <4> AddTldLtiReqEntry: Processing ROBOT_DISMOUNT request...
12:14:18.567 [3783] <5> DismountTape: TLD(1) DismountTape 0157L6 from drive 7
12:14:18.568 [62573] <4> io_open: Drive Path = /dev/nst1
12:15:02.235 [62573] <2> vnet_same_host_and_update: [vnet_addrinfo.c:3021] Comparing name1:[jcbak] name2:[gwzycdb3]
12:15:02.236 [62573] <2> vnet_same_host_and_update: [vnet_addrinfo.c:3021] Comparing name1:[jcbak] name2:[localhost]

 

daemon.log

11:47:02.952 [3715] <16> no_daemon_exists: file already locked
11:47:02.952 [3715] <16> vmd: vmd: Unable to get daemon lock file
11:47:02.952 [3715] <16> vmd: terminating - another daemon already exists (89)
12:16:53.719 [26265] <2> vnet_pbxAcceptSocket_ex: Accepted sock[14] from 10.131.33.60:56231
12:16:53.720 [26265] <2> daemon_select_and_accept_ex: pbx setsockopt SO_KEEPALIVE succeeded
12:16:53.720 [26265] <2> daemon_proxy_proto: Preparing to do daemon protocol for (10.131.33.167:1556 <- 10.131.33.60:56231)
12:16:53.737 [26265] <2> vmd: TCP_NODELAY
12:16:53.737 [26265] <4> peer_hostname_ipi: Connection from host jcbak, 10.131.33.60, port 56231
12:16:53.737 [26265] <4> peer_hostname_ipi: Connection to host localhost, 127.0.0.1, port 41129
12:16:53.738 [26265] <2> isHostidMatchesWithHostname: compared hostid:24568dc0-d974-481b-959d-45c6917387e5 and hostname:jcbak for master:jcba
k. Match-Result:true

mph999
Level 6
Employee Accredited

gwzycdb3 is unavailable for whatever reason - some connection issue most likely.

vmoprcmd will not report for that server, it's unavailable ...

The logs are simply reflecting the unavailability of the server ...

05/12/2020 12:14:31.647 [select_from_one_stu] cannot select storage unit, media server is offline, name = gwzycdb3-hcart3-robot-tld-1
05/12/2020 12:14:31.647 [allocateTwin] EXIT INFO:
05/12/2020 12:14:31.647 [allocateTwin] rbActions = 1
05/12/2020 12:14:31.647 [allocateTwin] canRetryForLocalMediaServer = TRUE
05/12/2020 12:14:31.647 [allocateTwin] REQUEST NEEDS TO BE RETRIED LATER
05/12/2020 12:14:31.647 [allocateTwin] RetryInfo_Record: reason = 3 (TAPE MEDIA SERVER IS NOT ACTIVE), mediaServerName = gwzycdb3, robotNum =

Afterwards I found the following possible cause or clue happened at 12:12:46:

12:12:46.279 [20184] <4> vmoprcmd: INITIATING
12:12:46.305 [20184] <2> vmoprcmd: argv[0] = vmoprcmd
12:12:46.305 [20184] <2> vmoprcmd: argv[1] = -reset_ltid_restarted_bit
12:12:46.305 [20184] <2> vmoprcmd: argv[2] = -h
12:12:46.305 [20184] <2> vmoprcmd: argv[3] = gwzycdb3

Would this action result in the stuck status of the ltid daemon at the gwzycdb3 ?