Re: Netbackup Disk Backup Failover Issue with erro...

rclayton · ‎01-11-2023

Hi all,

We are experiencing an unusual issue that I've spent a few days battling to no avail.

We currently run Netbackup ES 7.1 running on Solaris which speaks to multiple Linux Netbackup 7.6.0.3 Media servers. These Media servers rotate External HDD each month in a 4 monthly cycle.

For this we created 4 storage units, 1 for each disk and a storage unit group and set that group to failover.

On 06 Jan 2023 for our entire estate of Media servers (41 in total) all backups started to get stuck and show "Reason: Disk media server is not active, Media server: <MEDIA SERVER NAME>"

I've removed the media server name for security.

I'm new to this disk backup server configuration and so worked through some of the forums here where I found the following and applied it

/usr/openv/netbackup/bin/admincmd/nbemmcmd -updatehost -machinename <MEDIA SERVER> -machinetype media -machinestateop set_disk_active -masterserver <MASTER SERVER>

This initially appears to solve the issue however I'm now left with an issue with the failover aspect.

If I set the current disk connected to the media server first in the list in the storage Unit Group, it finds the disk and backups as normal. If I move that disk to any other position other than first position it fails with error 83 and it looks like the failover is not working.

When I check the disk status using nbdevquery it shows that all disks are up and I can set all but the current active disk to down and again this appears to work fine regardless of the position in the storage unit group.

From reading it appears that all but the current active disk should be in a down state in order for MDS to locate the correct disk to write to however that status is not changing when we conduct a disk change. This has worked fine for many years and has just stopped all of a sudden.

Our current workaround involves setting all disks to UP and moving the current connected disk to position 1 in the storage unit group.

Any help on the automatic updated and failover issue on this would be greatly appreciated.

davidmoline · ‎01-11-2023

Hi @rclayton

I suspect if everything was working and suddenly broke everywhere that something was done on or about Jan 6 that caused the problem (maybe a Linux or less likley a Solaris update of some kind). Find that, revert and see if it fixes the problem. Are the external HDD's attached via USB?

Now - the environment itself - you know you are running unsupported versions of NetBackup, and if I understand correctly doing something that also is not supported (removing the storage from a disk STU). What are you trying to achieve by rotating the external HDD's as there is probably a more elegant solution.

You should also be looking to upgrade to a supported version of NetBackup.

Cheers
David

rclayton · ‎01-12-2023

I've checked for any patches to any of the backup environment around that time and there are none.

Yes the external HDD are attached by USB 3.0

Unfortunately the solution isn't mine and is directed from above. The idea behind the external USB disk rotation is to maintain off site backups over 3 month period. This is all directed from above. This is the system we have and we have to make do for now I'm afraid.

This solution has worked fine for us for several years and has only just failed.

Due to our environment and cross compatibility upgrading is not an option for us at this time but is in the line for future action. Until i really need to find a solution to this one. Your right in that the version we have is unsupported hence im here trying to find answers to the issue i have rather than approaching Veritas.

Thanks

Deb_Wilmot · ‎01-19-2023

Normally when disks are marked offline randomly like that there is a problem with network name resolution or connectivity. Do you know if any network changes were made recently (to include routers, switches, etc)?

It also might be related to security policies - I recently had a number of VM's stop working due to GPO changes in the org.

Here is some information on how to troubleshoot the issue with servers going offline:

Complete the following on the master server:

1. Increase debug and diagnostic levels of nbemm , corba and the NetBackup libraries to 6 by running the following commands:

/usr/openv/netbackup/bin/vxlogcfg-a -p 51216 -o 156 -s DiagnosticLevel=6 -s DebugLevel=6

/usr/openv/netbackup/bin/vxlogcfg-a -p 51216 -o 111 -s DiagnosticLevel=6 -s DebugLevel=6

/usr/openv/netbackup/bin/vxlogcfg-a -p 51216 -o 137 -s DiagnosticLevel=6 -sDebugLevel=6

2. Create the following directory:

/usr/openv/netbackup/logs/admin

3. In the /usr/openv/netbackup/bp.conf , add the following:
VERBOSE = 5

4. Stop and start nbemm:

/usr/openv/netbackup/bin/nbemm-terminate
Make sure it's down by running bpps -a | grepnbemm . If needed, repeat the previous nbemm -terminate command until the process is down.
To restart the process run: /usr/openv/netbackup/bin/nbemm

Complete the following on the media server:
5. Increase debug and diagnosticlevels of nbemm and corba to 6 by running the following commands:

/usr/openv/netbackup/bin/vxlogcfg-a -p 51216 -o 156 -s DiagnosticLevel=6 -s DebugLevel=6

/usr/openv/netbackup/bin/vxlogcfg-a -p 51216 -o 111 -s DiagnosticLevel=6 -sDebugLevel=6

6. Create the following directories:

/usr/openv/volmgr/debug

/usr/openv/volmgr/debug/daemon

7. Create a /usr/openv/volmgr/vm.conf file and add VERBOSE to the file.

8. Reproduce the issue.

Review of the resulting /usr/openv/volmgr/debug/daemon log should show the incoming connection from the master server on the primary interface, in this example" mastclust ":

13:29:10.179[14608.19392] <2> emmlib_initialize: (-) Connection attempt #<0>

13:29:10.179[14608.19392] <4> ValidateConnectionID: (-) Created new Connection ID0

13:29:10.179[14608.19392] <2> emmlib_initializeEx: (-) Connecting to the Server<mastclust> Port <1556>,

The connect back was made to the sending IP address, which is the IP associated with" mastnode2 ":

13:29:10.194[14608.19392] <2> TAO: TAO (14608|19392) -PBXIOP_Connector::make_connection, to<x.x.x.c:1556:EMM>

13:29:10.210[14608.19392] <2> TAO: TAO (14608|19392) PBXIOP connection to peer<x.x.x.c:1556> on 612

When the connect back is received by the master server, it is returning the "mastnode2.local" name, which is unknown to the media server:

13:29:10.210[14608.19392] <2> TAO: TAO (%P|%t) - Transport::handle_input(): bytes readfrom socket - HEXDUMP 528 bytes

4749 4f 50 01 00 00 01 00 00 02 04 00 00 00 00GIOP............

0000 00 01 00 00 00 03 00 00 00 1b 49 44 4c 3a............IDL:

5665 72 69 74 61 73 2f 45 4d 4d 2f 53 65 72 76Veritas/EMM/Serv

6572 3a 31 2e 30 00 20 00 00 00 01 4f 43 49 01 er:1.0.....OCI.

0000 01 cc 00 01 02 00 00 00 00 10 73 63 74 63 ...I........mast -----> packetinformation with name

6c62 6b 30 32 2e 6c 6f 63 61 6c 00 06 14 3a e8 node2.local...:¿ ----->packet information with name continued

This name was not resolvable on the media server as an alias of the master, so the media server doesn't respond to the master query. This results in the media server changing state and backups not running to the storage units associated with the media server.

In the above example adding a mastnode2.local and mastnode1.local interface to the media server /etc/hosts file resolved the issue and allowed the media servers to stay "Active for Disk and Tape" .

Hopefully this will help.

VOX

Netbackup Disk Backup Failover Issue with error 83