Re: The doubt about the OPR status under the vmopr...

liuyl · ‎06-30-2022

Although it might be an ancient or classic "problem" about TDs status, we do sometimes can find the OPR status of some TDs under the vmoprcmd devmon inexplicably.

So my question is what ever the OPR status means about... ?

# vmoprcmd -devmon ds|grep jcsjdb3
jcsjdb3 /dev/nst6 TLD
jcsjdb3 /dev/nst7 TLD
jcsjdb3 /dev/nst8 TLD
jcsjdb3 /dev/nst10 TLD
jcsjdb3 /dev/nst11 TLD
jcsjdb3 /dev/nst9 ACTIVE
jcsjdb3 /dev/nst0 TLD
jcsjdb3 /dev/nst3 TLD
jcsjdb3 /dev/nst5 TLD
jcsjdb3 /dev/nst4 TLD
jcsjdb3 /dev/nst2 OPR
jcsjdb3 /dev/nst1 TLD

StefanosM · ‎06-30-2022

OPR means that the drive is in "operator" control. You can load and unload tapes by hand.
If the drive is a standalone drive, it is the correct status
If the drive is part of a library, then there is a problem with the library or with netbackup configuration.

You can start a drive with OPR mode, if you want to use this drive manually, whatever the reason

liuyl · ‎06-30-2022

Could it be more details for such scenario with some problematic TL or wrong configuration .... ？

pats_729 · ‎06-30-2022

Such issues resolves after a complete power cycle of Tape Library or you would end up working with Tape Library vendor.

liuyl · ‎07-01-2022

1) Does it mean that there is no other way to reset the OPR status from the NBU layer ？
2) What is the difference between the OPR and AVR status under the robotic controlled ？

mph999 · ‎07-01-2022

Set the drive down, and then up ... it should come out of OPR mode.

AVR is different - if a drive is standalone / not in a robot it should be in AVR mode.

If a drive is robotic, it should be in TLD or ACS mode, depending on the robot type. If a TLD drive goes to AVR mode, and, that drive is on the robot control host then it's almost certain that the connection to the library has been lost, any other drives in that robot will also show as AVR. Any shared drives on 'remote' media servers will also show AVR.

If a shared drive on a media server, where the media server is NOT the robot control hosts goes to AVR, but the same drives are still TLD on the robot control host itself, then there is a comms issue between the remote media server and the RCH - specifically, tldcd on the remote media cannot communicate with tldcd on the RCH.

ACS is a bit different, as there is no robot control host, instead each media server communicates to the acs server which is 3rd party (Oracle), and takes the place of a robot control host. Here, if drives that were ACS goto AVR, then communication has been lost to the ACS server and multiple media servers may or may not be affected depending on the cause.

There are other reasons robotic drives can goto AVR mode, but the the above is by far the most common.

liuyl · ‎07-01-2022

It is a bit strange that neither restart the ltid nor directly up the TD can eliminate the OPR status.
Or maybe it should firstly down the OPR TD ... ？

mph999 · ‎07-02-2022

Hmm that is a bit odd - you don't see OPR statue very often, and what I suggested has worked for me in the past.

It's not a NetBackup config issue as far as I can see as nothing has changed here - you do get the odd issue where everything looks good, but deleting and reconfiguring doe resolve an issue, but I don't think we are at that point.

Can you power cycle the drive, a proper power cycle, not a 'reset' from the library console.

liuyl · ‎07-03-2022

The most pivotal problem is why sometimes one TD under the same good robotic control would inexplicably become OPR status.
In other words, what the trigger condition should be, so that the ltid process would automatically put the TD status into OPR ... ？

mph999 · ‎07-03-2022

I don’t actually know what could cause a drive to go Opr aside of it being set in the GUI manually - I really haven’t seen it often.

Can you confirm if it’s always the same drive ?

However, let me see what I can find out.

liuyl · ‎07-03-2022

Not always the same ...

mph999 · ‎07-04-2022

Lets see if we can get something from logs:

Create the reqlib log dir:

/usr/openv/volmgr/debug/reqlib

Add the word VERBOSE into /usr/openv/volmgr/vm.conf

Stop and restart ltid (/usr/openv/volmgr/bin/stopltid, wait a few moments and then /usr/openv/volmgr/bin/ltid -v )

/usr/openv/volmgr/debug/reqlib/<logfile> should contain content ...

Set any drives in opr statues down:

/usr/openv/volmgr/bin/volmgr -down <drive instance number>

(The drive instance number is the number to the left of the drive name as seen in tpconfig -d )

Set drive back up using command line:

/usr/openv/volmgr/bin/volmgr -up <drive instance number>

Wait for one of the drives to goto opr status, collect the reqlib log for that day,

Look in the log for the string: upopr

liuyl · ‎07-05-2022

Here is the complete volmgr debug logs for this OPR issue....
But it seems that there is almost no any significant information ...

18:02:26.313 [32708] <2> vnet_same_host_and_update: [vnet_addrinfo.c:3021] Comparing name1:[jcbak] name2:[jcbak]
18:02:26.313 [32708] <4> get_emm_server_info: EMM server = jcbak; EMM port = 1556
18:02:26.335 [32708] <2> get_master_server_name: Master server name = jcbak
18:02:46.610 [471] <4> vmoprcmd: INITIATING
18:02:46.636 [471] <2> vmoprcmd: argv[0] = vmoprcmd
18:02:46.636 [471] <2> vmoprcmd: argv[1] = -h
18:02:46.636 [471] <2> vmoprcmd: argv[2] = jcsjdb3
18:02:46.636 [471] <2> vmoprcmd: argv[3] = -up
18:02:46.636 [471] <2> vmoprcmd: argv[4] = 9
18:02:46.636 [471] <2> vmdb_start_oprd: received request to start oprd, nosig = yes
18:02:46.636 [471] <2> vnet_same_host_and_update: [vnet_addrinfo.c:3021] Comparing name1:[jcsjdb3] name2:[localhost]
18:02:46.637 [471] <2> vnet_sortaddrs: [vnet_addrinfo.c:4170] sorted addrs: 1 0x1

mph999 · ‎07-05-2022

OK, I think the best way forward on this, now the logs are in place is to set a cronjob that runs vmoprcmd to a file every 30 mins. eg vmoprcmd >/tmp/vm_out_$(date '+%Y-%m-%d_%H:%M').txt

Then, when you notice a drive in OPR, you can see within 30 mins when this changed, and thus get the relevant log file, and the vmoprcmd output file + the one before the change.

I ran a quick test today, and looking on a VTL when I set the drive manually to OPR, no scsi commands were sent that I could see, so it's totally a NetBackup 'thing', not hardware (I think).

liuyl · ‎07-05-2022

Surely, it must be a logical status of TDs only in the NBU layer !

BTW, no any significant information from my above attachment of the voldbg logs ...... ?

mph999 · ‎07-06-2022

Yes, I believe it is at the NBU layer only, but I know of no way that a drive can be set to OPR other than manually.

Hence, by logging vmoprcmd output to a file where the file and contains the date/ time run say every 30 mins you can capture the drive ‘changing’ from TLD to OPR, then with the accompanying log we can see if it shows anything.

liuyl · ‎07-06-2022

I think that the corresponding developers at the backline should know about this mechanism ......

liuyl · ‎07-07-2022

18:02:46.812 [9948] <4> oprd: INITIATING

18:02:46.812 [9948] <2> mm_getnodename: cached_hostname jcsjdb3, cached_method 3

18:02:46.818 [9948] <2> mm_getnodename: (3) hostname jcsjdb3 (from mm_master_config.mm_server_name)

18:02:46.818 [9948] <2> oprd: got CONTINUE, connecting to ltid

18:02:47.594 [9948] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded

18:02:47.654 [9948] <8> do_pbx_service: [vnet_connect.c:2581] via PBX VNETD CONNECT FROM 10.131.29.105.29439 TO 10.131.33.60.1556 fd = 9

18:02:47.655 [9948] <8> vnet_vnetd_get_security_info: [vnet_vnetd.c:2823] VN_REQUEST_GET_SECURITY_INFO 9 0x9

18:02:47.674 [9948] <8> vnet_vnetd_disconnect: [vnet_vnetd.c:201] VN_REQUEST_DISCONNECT 1 0x1

18:02:47.679 [9948] <2> process_requests: 2 9 -1 *NULL*

18:02:47.680 [9948] <2> process_requests: oprd received string 2 9 -1 *NULL*

18:02:47.680 [9948] <4> fix_serial_number: initiating with drive index 9

18:02:47.680 [9948] <4> fix_serial_number: drive index 9 is UP - cannot be swapped <<< Here what this means about ...... ?
18:02:48.636 [9948] <2> process_requests: TERMINATE

18:02:48.636 [9948] <2> process_requests: received TERMINATE request

mph999 · ‎07-07-2022

18:02:47.680 [9948] <4> fix_serial_number: drive index 9 is UP - cannot be swapped <<< Here what this means about ...... ?

NBU tracks drives via serial number, this to me just looks like it is confirming the serial number for the drive. It's a level <4> log line, so not of any concern.

What we need is the details as I previously requested - vmoprcmd run to a dated file from cron (every 30 mins is fine) and that way we can see (within 30 mins) when the drive shows OPR, and then the corresponding logs for before, during and after that time.

VOX

The doubt about the OPR status under the vmoprcmd devmon output