Super Weird Issue: Only DB Backups Failing (Oracle...

winter_prince · ‎11-28-2014

Environment:

Master: NetBackup-AIX53 7.5.0.7

Media Server Environment : Mixed (Windows, UNIX, Linux). Mixed Versions as well Ranging from 6.5 to 7.5
Media Type: Mixed (Tape, Disk, VTLs) with appliance and Deduplication (Puredisk and EMC DataDomain)
Client: Mixed (Windows, UNIX, Linux). Mixed Versions as well Ranging from 6.5 to 7.5 (Counting 1000 and more)

Problems in such complicated environment are normal and usual, but recently we are facing an unusual issue:

Since last week, backups running in Master Server (Activity Monitor) are getting hung in "connecting" status and eventually fails with Status codes "13" and "6"

The weird thing is this happens only to DB backups (SQL and Oracle), to all such backups which are running at that time... Nearly 1000s of jobs.

All other backups such as Windows FS backups, UNIX FS backups works fine at the same time for same clients.

We roped in Symantec Support and they were also not able to fix the issue. On final resort, Restarting the Master Server Services fixed the issue. The issue re-occured again and symantec wasnt able to find much on this occasion as well.

Anyone encountered this before ? Any idea what might be the root cause ?

Regards,
Prince.

Marianne · ‎11-28-2014

You need to check client -> master server connectivity at the time when connectivity issues are seen.
Trying to troubleshoot after the fact without logs is impossible.

2 things are needed for DB backups:
- forward and reverse lookup between master and client
- port connectivity on 1556 or 13724 between master and client
(The client -> master server connection does not happen for filesystem backups)

Troubleshooting steps:

On master, check that bprd log folder exists. If not, create the folder and restart NBU.
On client, check that bpcd log folder exists.

When issue is seen, test comms as follows:

On DB client:
bpclntcmd -pn
This command will initiate a connection to the master, similar to DB backup connection.
Check bprd log on the master to see if the connection was received by the master and how the incoming IP address from the client was resolved by the master.

On master, run this command to simulate connect-back from master to client:
bptestbpcd -client <client-name> -verbose -debug
Check output of command and bpcd log on the client

Handy NetBackup Links

Michael_G_Ander · ‎11-28-2014

My guess would be some sort of exhuastion on the master server.

Is there anything to see in nmon ? Thinking memory especially virtual memory, disk I/O latency

Ask about disk I/O latency because we had some strange issues when the disk hosting the NBDB and catalog latency had gotten above the recommend 20 ms latency.

Think there is something about the number of sessions to NBDB too, cannot see why that should effect only database backups though. But database backups is generally more sensitive to problems than file backups.

And the usually question anything changed on the master/disk system last week ?

Can your start a manual backup when this issue exists ? Just to confirm see if the bprd

If the master has been running for a while, it might be worth to go through the sizing of it for the current load to see if it should be up for the job.

I would do scheduled bpps too see if any netbackup process memory usage grew abnormally

Hope this helps you

The standard questions: Have you checked: 1) What has changed. 2) The manual 3) If there are any tech notes or VOX posts regarding the issue

watsons · ‎11-28-2014

Database backups tend to use more network sockets than other backups. If needed, increase the server port ranges (subject to system resource) to have more ports allocated.

Also check using netstat -an on your master server to see if network connections are experiencing a lot of TIME_WAIT. If so, try to look for a way to reduce the TCP keepalive time:

http://www.symantec.com/docs/TECH125896

There are also a few other technotes about system tuning in 7.5 master server, with regards to certain backup types:

http://www.symantec.com/docs/TECH202840

winter_prince · ‎12-07-2014

Thank you guys,

The issue re-occured thrice... After some deep diagonsis symantec has suggested to implement the following items:

Increase File Descriptors on the master and media servers (Currently 10000).
Rebuild NBDB database on our EMM/Master server.
Reduce Maximum job per client on master server from 100 to 50.

Will implement this and keep you posted on the outcome..

Regards,
Prince.

VOX

Super Weird Issue: Only DB Backups Failing (Oracle And SQL)