Re: Media server -- Drive not visible when i run ... - Page 2

Krishma · ‎07-24-2017

Hi,

Need your help . We have a master/ media server on LINUX with NBU version 6.5 . I am not able to bring UP a drive . It says :

oprd returned abnormal status (96)
IPC Error: Daemon may not be running

Here are the logs :

#############################

[root@veritas log]# grep "DOWN" /var/log/messages-20170716
Jul 10 11:24:53 veritas ltid[18095]: Operator/EMM server has DOWN'ed drive IBM.ULTRIUM-HH6.000 (device 0)

Jul 19 09:52:21 veritas ltid[21804]: ltid can not be started while resources are assigned to the host.

/usr/openv/volmgr/debug - Logs

09:56:22.491 [22231] <2> wait_oprd_ready: oprd response: EXIT_STATUS 278
09:56:22.491 [22231] <2> put_string: cannot write data to network: Broken pipe (32)
09:56:22.491 [22231] <16> send_string: unable to send data to socket: Broken pipe (32), stat=-5

09:56:22.491 [22233] <16> oprd: device management error: IPC Error: Daemon may not be running

[root@veritas bin]# /usr/openv/volmgr/bin/vmoprcmd

HOST STATUS
Host Name Version Host Status
========================================= ======= ===========
veritas 656000 ACTIVE-DISK

PENDING REQUESTS

<NONE>

DRIVE STATUS

Drive Name Label Ready RecMID ExtMID Wr.Enbl. Type
Host DrivePath Status
=============================================================================

[root@veritas ltid]# /usr/openv/volmgr/bin/tpconfig -l
Device Robot Drive Robot Drive Device Second
Type Num Index Type DrNum Status Comment Name Path Device Path
robot 0 - TLD - - - - /dev/sg6
drive - 0 dlt3 1 DOWN - IBM.ULTRIUM-HH6.000 /dev/nst2
drive - 1 dlt3 2 UP - IBM.ULTRIUM-HH6.001 /dev/nst5
drive - 2 dlt3 4 UP - IBM.ULTRIUM-HH6.002 /dev/nst1
drive - 3 dlt3 3 UP - IBM.ULTRIUM-HH6.003 /dev/nst3
drive - 4 dlt3 5 UP - IBM.ULTRIUM-HH6.004 /dev/nst4
drive - 5 dlt3 6 UP - IBM.ULTRIUM-HH6.005 /dev/nst0

[root@veritas bin]# /usr/openv/volmgr/bin/tpconfig -d
Id DriveName Type Residence
Drive Path Status
****************************************************************************
0 IBM.ULTRIUM-HH6.000 dlt3 TLD(0) DRIVE=1
/dev/nst2 DOWN
1 IBM.ULTRIUM-HH6.001 dlt3 TLD(0) DRIVE=2
/dev/nst5 UP
2 IBM.ULTRIUM-HH6.002 dlt3 TLD(0) DRIVE=4
/dev/nst1 UP
3 IBM.ULTRIUM-HH6.003 dlt3 TLD(0) DRIVE=3
/dev/nst3 UP
4 IBM.ULTRIUM-HH6.004 dlt3 TLD(0) DRIVE=5
/dev/nst4 UP
5 IBM.ULTRIUM-HH6.005 dlt3 TLD(0) DRIVE=6
/dev/nst0 UP

[root@veritas ~]# /usr/openv/volmgr/bin/vmoprcmd -up 0
oprd returned abnormal status (96)
IPC Error: Daemon may not be running

--------------------------------------------------------------------------------------------------------------------

Reports :

[root@veritas ~]# /usr/openv/volmgr/bin/tpautoconf -report_disc
======================= New Device (Robot) =======================
Inquiry = "SPECTRA PYTHON 2000"
Serial Number = 9111005888
Robot Path = /dev/sg9
Number of Drives = 6
Number of Slots = 120
Number of Media Access Ports = 8
Drive = 1, Drive Name = /dev/nst5, Serial Number = 1012005888
Drive = 2, Drive Name = /dev/nst2, Serial Number = 1013005888
Drive = 3, Drive Name = /dev/nst1, Serial Number = 1014005888
Drive = 4, Drive Name = /dev/nst3, Serial Number = 1015005888
Drive = 5, Drive Name = /dev/nst0, Serial Number = 1016005888
Drive = 6, Drive Name = /dev/nst4, Serial Number = 1017005888
======================= New Device (Drive) =======================
Inquiry = "IBM ULTRIUM-HH6 G9P1"
Serial Number = 1012005888
Drive Path = /dev/nst5
Found as drive=1 in new robot
Robot's inquiry = "SPECTRA PYTHON 2000"
Robot's Serial Number = 9111005888
Robot's device path = /dev/sg9
======================= New Device (Drive) =======================
Inquiry = "IBM ULTRIUM-HH6 G9P1"
Serial Number = 1017005888
Drive Path = /dev/nst4
Found as drive=6 in new robot
Robot's inquiry = "SPECTRA PYTHON 2000"
Robot's Serial Number = 9111005888
Robot's device path = /dev/sg9
======================= New Device (Drive) =======================
Inquiry = "IBM ULTRIUM-HH6 G9P1"
Serial Number = 1015005888
Drive Path = /dev/nst3
Found as drive=4 in new robot
Robot's inquiry = "SPECTRA PYTHON 2000"
Robot's Serial Number = 9111005888
Robot's device path = /dev/sg9
======================= New Device (Drive) =======================
Inquiry = "IBM ULTRIUM-HH6 G9P1"
Serial Number = 1013005888
Drive Path = /dev/nst2
Found as drive=2 in new robot
Robot's inquiry = "SPECTRA PYTHON 2000"
Robot's Serial Number = 9111005888
Robot's device path = /dev/sg9
======================= New Device (Drive) =======================
Inquiry = "IBM ULTRIUM-HH6 G9P1"
Serial Number = 1014005888
Drive Path = /dev/nst1
Found as drive=3 in new robot
Robot's inquiry = "SPECTRA PYTHON 2000"
Robot's Serial Number = 9111005888
Robot's device path = /dev/sg9
======================= New Device (Drive) =======================
Inquiry = "IBM ULTRIUM-HH6 G9P1"
Serial Number = 1016005888
Drive Path = /dev/nst0
Found as drive=5 in new robot
Robot's inquiry = "SPECTRA PYTHON 2000"
Robot's Serial Number = 9111005888
Robot's device path = /dev/sg9
[root@veritas ~]#

[root@veritas ltid]# /usr/openv/volmgr/bin/tpconfig -l
Device Robot Drive Robot Drive Device Second
Type Num Index Type DrNum Status Comment Name Path Device Path
robot 0 - TLD - - - - /dev/sg6
drive - 0 dlt3 1 DOWN - IBM.ULTRIUM-HH6.000 /dev/nst2
drive - 1 dlt3 2 UP - IBM.ULTRIUM-HH6.001 /dev/nst5
drive - 2 dlt3 4 UP - IBM.ULTRIUM-HH6.002 /dev/nst1
drive - 3 dlt3 3 UP - IBM.ULTRIUM-HH6.003 /dev/nst3
drive - 4 dlt3 5 UP - IBM.ULTRIUM-HH6.004 /dev/nst4
drive - 5 dlt3 6 UP - IBM.ULTRIUM-HH6.005 /dev/nst0

[root@veritas volmgr]# /usr/openv/volmgr/bin/tpautoconf -t
TPAC60 IBM ULTRIUM-HH6 G9P1 1012005888 -1 -1 -1 -1 /dev/nst5 - -
TPAC60 IBM ULTRIUM-HH6 G9P1 1017005888 -1 -1 -1 -1 /dev/nst4 - -
TPAC60 IBM ULTRIUM-HH6 G9P1 1015005888 -1 -1 -1 -1 /dev/nst3 - -
TPAC60 IBM ULTRIUM-HH6 G9P1 1013005888 -1 -1 -1 -1 /dev/nst2 - -
TPAC60 IBM ULTRIUM-HH6 G9P1 1014005888 -1 -1 -1 -1 /dev/nst1 - -
TPAC60 IBM ULTRIUM-HH6 G9P1 1016005888 -1 -1 -1 -1 /dev/nst0 - -

Tried this ....
###########################

Ran nbrbutil -resetMediaServer <media server>
Ran nbrbutil -resetall
Restarted NetBackup daemons on media server
Rebooted Media/Master Server
Rebooted Tape Library
Did not find any problems with Tape Library

Thanks,

Krishma.

Krishma · ‎07-31-2017

It has Fibre Channel connection.

Krishma · ‎07-31-2017

Can you please send me the link for device configuration guide which will help me to do this task .

Marianne · ‎07-31-2017

I am not sure that the Device Config Guide contains info about deleting devices and running Device Config Wizard. This will rather be in Admin Guide I.

Online manuals for NBU 6.5 and 7.0 have been removed. You may try the 7.1 manuals (still available although EOSL as well).
See this extract from my Handy NBU links blog:

NBU 6.5 Documentation: All NBU 6.x documentation have been removed - NBU 6.x has reached EOSL in 2012.

NBU 7.0 Documentation: http://www.veritas.com/docs/000005902
(Also removed in the meantime)

NBU 7.1 Documentation: http://www.veritas.com/docs/000012173

Handy NetBackup Links

Krishma · ‎07-31-2017

I am unable to run Device Config Wizard from Admin Console as when i select

select Media and Device Management > Device Monitor

I get Message : oprd returned abnormal status (96) .

What is the solution for that ?

X2 · ‎07-31-2017

In your NetBackup Administration console, click on the master server name (root of tree in left panel), and on the right pane you will see the list of different wizards. Run the Configure Storage Devices wizard to reconfigure robots and drives.

Krishma · ‎07-31-2017

Right now , after clicking on Configure Storage Devices wizard , i am getting this error -

oprd returned abnormal status (96)

Do you think after deleting existing devices , i will not get this error ?

Before deleting the current devices , i want to make sure that i am able to use Configure Storage Devices wizard without error.

Marianne · ‎08-01-2017

@Krishma

Please bear in mind that NBU 6.5 is VERY old (went EOSL in 2012) and that documentation is no longer available.

We are trying to help based on memory and experience.

There is an old article by Omar Villa regarding "EMM database error (196)" :
https://vox.veritas.com/t5/Articles/Troubleshooting-EMM-Error-196/ta-p/810462

You will see that he points out device config mismatch as a possible cause.

With the Media Manager processes in the current state, nothing is going to work as far as device config is concerned.

The way I see things is that you need to delete the current devices first of all.
When all the devices are deleted and NBU/MM processes restarted, you should only see vmd under Media Manager processes.
This will enable you to run Device Config wizard.

If none of my suggestions are working, it means something serious is wrong with your EMM relational databases.

You will need a Support person to have a proper look at all of your log files - not just snippets. (I am not aware of any forum user that has the time to do this for you.)

Depending of what exactly the problem is, you may need to recover from the last successful Catalog backup.

As you know - your version of NBU is no longer supported.
So, if none of these suggestions help to fix the problem, you need to go to your management and tell them that you need assistance as a matter of urgency.
Make it their problem.
They chose not to update the software and probably the maintenance contract...

Handy NetBackup Links

Krishma · ‎08-01-2017

Thank you so much for you time and patience. Really appreciate it . Our management team is currently working with sales to upgrade this product and renew the maintaience . That process is taking longer . In the meantime , i am trying my best to fix this issue.

X2 · ‎08-01-2017

@KrishmaI understand that you want to make sure that the Device Configuration Wizard will work with the current error after deleting the devices. That is a careful approach that I would take usually. However, in your case, you are not able to use the tape library. If you delete the devices and Device Configuration Wizard fails, you are still the same state that NetBackup has an issue and cannot find the devices where it "thinks" they are.

Also, as @Marianne said, NetBackup 6.5 is EOSL. Even if you contact Veritas Support, they will not help you with the current install. You will need to upgrade in the end to stay current. For now, see if you can find out by following the added logging and reconfiguring the devices as @Marianne and @quebek suggested.

Krishma · ‎08-02-2017

Thank you . In next couple of weeks we will be upgrading Netbackup and we will have support as well.

In the meantime , i will proceed with delete all existing devices in NBU config and then runnning NBU Device Config Wizard.

Can someone please confirm if this is the right command to delete all the existing devices . Do I have to follow some additional steps to delete the devices?

/usr/openv/netbackup/bin/admincmd/nbemmcmd -deletealldevices -machinename server_name -machinetype media

(You can delete all devices from a media server.The media server can be up, down, or failed and unrecoverable)

Thank you all for your guidance !!!

Marianne · ‎08-03-2017

Yes. nbemmcmd command will be best.

Please let us know if it works.
Restart NBU after deleting the devices.
You should only have vmd running under MM processes.

PS:
Have you checked all possible issues described in Omar Villa's article?

Handy NetBackup Links

mph999 · ‎08-03-2017

You can try deleting the devices for a particular media server. If this doesn;t work, I would delete ALL the devices in one go:

nbemmcmd -deletealldevices -allrecords (don't run this yet)

This (at least in my experience) completely clears the 'device' tables in NBDB when other methods have failed (which is why it exists as a command).

Krishma · ‎08-04-2017

Yesterday I deleted all existing devices in NBU config and then ran NBU Device Config Wizard. It was successful .
Host is now active for both DISK and TAPE .
I see these MM Processes -
MM Processes
------------
root 19156 1 0 10:35 pts/0 00:00:00 /usr/openv/volmgr/bin/ltid
root 19166 1 0 10:35 pts/0 00:00:00 vmd -v
root 19733 19156 1 10:36 pts/0 00:00:00 tldd -v
root 19734 19156 1 10:36 pts/0 00:00:00 avrd -v
root 19737 1 0 10:36 pts/0 00:00:00 tldcd -v

I ran one successful backup yesterday .

Now i am back to my original problem --- Database doesn't stay up . Database stays up for 10-15 minutes and then it goes down again.

I checked couple of log files . What other log files can i check ?

This problem started 3 weeks ago after i took out 8 tapes (which were going to expire soon) from Tape Library and loaded new tapes (in scratch pool)
Not sure if i missed any step there ....Not sure either if this is related to the problem .. Any suggestions?

/var/log/messages :

Aug 3 11:24:40 veritas kernel: ACPI Error: No handler for Region [IPMI] (ffff88042653d420) [IPMI] (20090903/evregion-319)
Aug 3 11:24:40 veritas kernel: ACPI Error: Region IPMI(7) has no handler (20090903/exfldio-295)
Aug 3 11:24:40 veritas kernel: ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PMI0._GHL] (Node ffff88082603ba38), AE_NOT_EXIST
Aug 3 11:24:40 veritas kernel: ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PMI0._PMC] (Node ffff88082603ba88), AE_NOT_EXIST
Aug 3 11:36:35 veritas tldcd[4355]: System error occurred - Unknown error 4294967293
Aug 3 13:13:34 veritas tldcd[9550]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 13:53:27 veritas SQLAnywhere(nb_veritas): *** ERROR *** Assertion failed: 102300 (9.0.2.3900)#012File associated with given page id is invalid or not open
Aug 3 14:15:48 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 14:21:41 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 14:25:34 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 14:25:38 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 14:26:33 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 14:28:54 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 14:28:57 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 14:28:59 veritas SQLAnywhere(nb_veritas): Fatal error: database error
Aug 3 15:04:09 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 15:11:51 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 15:11:54 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 15:14:23 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1
Aug 3 15:24:00 veritas SQLAnywhere(nb_veritas): Fatal error: database error
Aug 3 15:29:34 veritas tldcd[14045]: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1

I. 08/03 13:06:32. Now accepting requests
I. 08/03 13:26:32. Starting checkpoint of "NBDB" (NBDB.db) at Thu Aug 03 2017 13:26
I. 08/03 13:26:33. Finished checkpoint of "NBDB" (NBDB.db) at Thu Aug 03 2017 13:26
I. 08/03 13:46:33. Starting checkpoint of "NBDB" (NBDB.db) at Thu Aug 03 2017 13:46
I. 08/03 13:46:33. Finished checkpoint of "NBDB" (NBDB.db) at Thu Aug 03 2017 13:46
E. 08/03 13:53:27. *** ERROR *** Assertion failed: 102300 (9.0.2.3900)
E. 08/03 13:53:27. File associated with given page id is invalid or not open

I. 08/03 14:25:36. Starting checkpoint of "NBDB" (NBDB.db) at Thu Aug 03 2017 14:25
I. 08/03 14:25:36. Finished checkpoint of "NBDB" (NBDB.db) at Thu Aug 03 2017 14:25
I. 08/03 14:28:58. Starting checkpoint of "NBDB" (NBDB.db) at Thu Aug 03 2017 14:28
E. 08/03 14:28:59. Fatal error: database error

/usr/openv/volmgr/debug/robots

15:28:44.803 [16822] <5> inquiry: inquiry() function processing library SPECTRA PYTHON 2000:
15:28:44.803 [16822] <6> read_element_status_drive: RES drive 2
15:28:45.139 [16822] <6> tape_in_drive: valid = 1, sel = 4107, barcode = (168230L6 )
15:28:45.139 [16822] <6> read_element_status_slot: RES storage element 12

15:29:34.464 [13958] <5> GetResponseStatus: DecodeMount: TLD(0) drive 2, Actual status: Unable to open drive
15:29:34.773 [14045] <3> mode_sense: <tldcd.c:7172> Device geometry: NumDrives = 6 at address 256
15:29:34.773 [14045] <3> mode_sense: --> NumSlots = 120 at address 4096
15:29:34.773 [14045] <3> mode_sense: --> NumTransports = 1 at address 1
15:29:34.773 [14045] <3> mode_sense: --> NumIE = 8 at address 16
15:29:34.774 [14045] <6> inquiry: <tldcd.c:7020> Read device table for SPECTRA PYTHON 2000, type 8, slots 120 and ie 8
15:29:34.774 [14045] <4> MmDeviceMappings::GetRobotAttributes
: <MmDeviceMappings.cpp:974> search robot list (length=388) for SPECTRA PYTHON, type 8

15:29:34.774 [14045] <4> MmDeviceMappings::GetRobotAttributes
: <MmDeviceMappings.cpp:1227> found match: "Spectralogic Python" SPECTRA PYTHON
15:29:34.774 [14045] <5> inquiry: inquiry() function processing library SPECTRA PYTHON 2000:
15:29:34.774 [14045] <6> listen_loop: accept: newfd = -1, error = 4, timersig = 1
15:29:34.774 [14045] <6> listen_loop: tldcd.c.2695, newfd = INVALID_SOCKET, newfd=-1, timersig=1, error=4, EINTR=4, selectret=-1

Thanks !!!

Krishma · ‎08-04-2017

I am not able to view Omar Villa's article completely (show more.. doesn't work) ..

Krishma · ‎08-05-2017

What are the posiible causes for Emm database shutting down after 1-2 hours of clean startup ?

There are no disk space issues .

Thanks,

Krishma.

Marianne · ‎08-05-2017

You will agree that this is the 1st time you give us this crucial information:
" Emm database shut down after 1-2 hours of clean startup.. "
The device path changes happened during reboot because there is no persistent binding between hba and OS and side-tracked us for way too long.

We need to see logs in order to find the reason for emm going down. Full logs. Not snippets.

server.log is the 1st one we need. Please copy to server.txt and upload.

Second is emm log. Extract unified log info with vxlogview command for the period leading up to and including when emm went down.
In the example below, we extract info for the last 15 minutes.
vxlogview -o 111 -d all -t 00:15:00 > /tmp/nbemm.txt
Please upload nbemm.txt.

There might be more, but this is a good start.

PS:
When last did a successful full catalog backup run?

Handy NetBackup Links

Krishma · ‎08-07-2017

Marianne,

Thank you for looking into this. I am attaching -

- server.log file

-nbemm log file (8/4/2017 - between 2:30 PM and 6:30 PM) - This was the timeperiod when database was UP and then again went down .

- nbemm log file of last 15 minutes (8/7/2017)

Last full catalog backup happened last week on Aug 4 , when we fixed device path issue and database was UP for 1-2 hours.

Please let me know if you need additional log files / Information.

Thanks,

Krishma.

Krishma · ‎08-07-2017

Attaching the files ...

Marianne · ‎08-07-2017

I do not see server or emm logs.

Please remember to upload them as .txt files.

Handy NetBackup Links

Krishma · ‎08-07-2017

Also, this is in /var/log/messages -

Aug 4 15:50:33 veritas xinetd[2786]: EXIT: vnetd status=0 pid=3045 duration=0(sec)
Aug 4 15:50:33 veritas SQLAnywhere(nb_veritas): *** ERROR *** Assertion failed: 102300 (9.0.2.3900)#012File associated with
given page id is invalid or not open
Aug 4 15:50:34 veritas abrt[3048]: saved core dump of pid 30206 (/usr/openv/db/bin/dbsrv9) to /var/spool/abrt/ccpp-2017-08-0
4-15:50:33-30206.new/coredump (604459008 bytes)
Aug 4 15:50:34 veritas abrtd: Directory 'ccpp-2017-08-04-15:50:33-30206' creation detected
Aug 4 15:50:34 veritas abrtd: Executable '/usr/openv/db/bin/dbsrv9' doesn't belong to any package
Aug 4 15:50:34 veritas abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2017-08-04-15:50:33-30206 (res:2), deleting
Aug 4 15:50:36 veritas xinetd[2786]: EXIT: vnetd status=0 pid=3044 duration=4(sec)
Aug 4 15:50:53 veritas xinetd[2786]: START: nrpe pid=3083 from=::ffff:149.21.228.175
Aug 4 15:50:53 veritas xinetd[2786]: EXIT: nrpe status=0 pid=3083 duration=0(sec)

VOX

Media server -- Drive not visible when i run vmoprcmd command