Backup Failure on Multiple Clients with Exit Statu...

KameliaB · ‎05-05-2020

I have received this case over a month ago and have been wracking the mind of my senior engineer at a solution. We appear to be at a standstill and possibly a fresh set of eyes may help. You may see that we're missing a part of the conversation in the attached logs, we've attempted to get the right logs three times. Each time we had to wait until the completion of a backup, which took a long time. Unfortunately, I cannot go back and ask for more logs from the customer, he's been patient thus far for a resolution, not for more logs...understandably so.

During a meeting we saw the customer conducts a checkpoint every 7 minutes. One of his clients, bkoweb26, began to backup successfully after creating an exclusions list. Also noticed a timeout of infinity in bpbrm, see snippet. We have made suggestions and recommendations, below. Some were made, but not all.

Snippet of the backup log:

Mar 28, 2020 5:53:14 AM - Error bpbrm (pid=10289380) db_FLISTsend failed: network connection broken (40)
Mar 28, 2020 5:53:15 AM - Info bpbrm (pid=7471230) sending message to media manager: STOP BACKUP bkoweb26_1585391368
Mar 28, 2020 5:53:17 AM - Info bpbrm (pid=7471230) media manager for backup id bkoweb26_1585391368 exited with status 150: termination requested by administrator
Mar 28, 2020 5:53:17 AM - end writing; write time: 0:23:42
network connection broken  (40)

Snippet of Media bpbrm

00:07:52.377 [18415728.1] <2> db_getdata: timeout is 0 (infinite)
00:07:52.390 [18415728.1] <2> db_end: Need to collect reply
00:07:52.390 [18415728.1] <2> db_getdata: timeout is 0 (infinite)

Environment info:

Master: nmbackup01, configured on third-party, NBU version 8.1.1, Platform: AIX ver. 7.1

Media: nmbpmed05, configured on third-party server, NBU version 8.1.1, Platform AIX ver. 7.1

Clients: nmocmi02, bkoweb26 and bkoweb25

Logs attached:

Clients (nmocmi02, nmsplkstore01): bpbkar

Master (nmbackup01): bpbrm, bptm

Media (nmbmed03, nmbpmed05): bpbrm, bptm

Recommendations:

Increase checkpoints to every 60min. (not done)
Decrease timeouts to 7200 instead of infinite (not done). I'm thinking these are CLIENT_READ_TIMEOUT or CLIENT_CONNECT_TIMEOUT = 7200 in bp.conf
Since exclusions helped one client succeed, I wondered if the test in this technote could apply to this situation: https://www.veritas.com/support/en_US/article.100003560
Check for communication related patches (tried but did not find any online)
Run bppllist and bpplinfo on media. Run bpgetconfig from a failing and a successful client, to compare. (not done)

Any other recommendation you may have, will be very much appreciated!

Hamza_H · ‎05-05-2020

Hello,
Could you please take note of this post..

https://vox.veritas.com/t5/NetBackup/Status-41-Network-Connection-Timed-out-on-full-jobs-only/td-p/8...

EthanH · ‎05-05-2020

Have you checked the PBX logs for the clients?

Use vxlogview -p 50936 -o 103, or check the /var/adm/syslog for the PBX logs written by the O/S.

Would also help to check the bprd logs on the master. You'll see the connection attempts to the master, and you may see the cause of the network disruption.

Status 40 is generally related to another process terminating the connection, like a firewall or something on the system interrupting the connection as opposed to something timing out.

Marianne · ‎05-06-2020

I am prepare to go through ONE set of logs for the failure that you posted, but we need to see ALL text in job details to show timestamps and PIDs

Can you please point out which logs exactly contain the specific job, timestamps and PID reflected here:

 5:53:15 AM - Info bpbrm (pid=7471230)

Important to look at one set of logs when you troubleshoot - see which client and which media server exactly, what the timestamps and PIDs are, then follow the process flow in the relevant logs.

For example, it does not help to compare this activity monitor entry with the bpbrm log snippet that happened at a different time and with different PIDs. Are these even the same media server?:

Mar 28, 2020 5:53:14 AM - Error bpbrm (pid=10289380) db_FLISTsend failed: network connection broken (40)
Mar 28, 2020 5:53:15 AM - Info bpbrm (pid=7471230) sending message to media manager: STOP BACKUP bkoweb26_1585391368

00:07:52.377 [18415728.1] <2> db_getdata: timeout is 0 (infinite)

Which timeouts exactly are configured as 0 (infinite) ?
IMHO, there is hardly ever any reason to increase Cliet Connect and Client Read timeouts to more than 1800 (The default is 300, which are in most instances sufficient.)

About status 40:

The error says the connection is broken outside of NBU - therefore not a NBU timeout or failure.

Level 3 bpbrm and bptm logs on media server will confirm if data was received from client in a continuous stream when network failure occurred.

Always best to involve network and firewall team in situations like these - they need to monitor the port connections while the backup is running.

Handy NetBackup Links

KameliaB · ‎05-06-2020

@Hamza_H

The issue you posted closely resembles the issue at hand because (I failed to mention) only FULL backups are failing with this exit status. INCs are successfully backing up.

To answer the questions you posed in that discussion:

Master and Media are separate.
Unsure about a FW between the client, master and media, the question alone was taboo. However, we received a bptestnetconn output and all is well. See attached.
I also believe this backup is failing due to the amount of data in a full vs. an inc. It makes sense that the media is waiting for the set timeout but a FW may not be set to the same timeout and has already dropped the connection.
I will run with your suggestions: activate accelerator and run a full, try a test dedup, check switches/router/clients NIC if they are on full duplex and not half.

I'll keep you updated of any progress.

KameliaB · ‎05-06-2020

@EthanH

Thank you, I will place this on the list of things to request as well. It will be helpful to find something pointing to FW, if it is the culprit, so we can finally engage their networking team.

KameliaB · ‎05-06-2020

@Marianne

And here lies the frustration...I kept getting half the conversation no matter how many times I re-requested logs. So the job detail I posted was the original job log. I can post the whole log, but it seems moot since that bkoweb26 client is now backing up successfully.

"Which timeouts exactly are configured as 0 (infinite)?"

I have no idea, nor does my senior engineer. I only mentioned having it decreased to 7200 because it appears all others on his server are set at that, aside from the infinite value, which I am guessing was purposely performed.

Do you think checkpoints are an issue? If not, I will stop harping him on that.

"The error says the connection is broken outside of NBU."

Could you point me to where you could deduce that? That is helpful to know.

The logs attached are the most recent bpbrm and bptm logs at a verbose level 3. I'd really love to involve their networking team, but we received pushback, until we have concrete evidence it's network or FW related.

EthanH · ‎05-06-2020

@KameliaB

I've been in that situation before, it is quite frustrating to constantly need the burden of proof to receive logs.

Can you post the Detailed Status from the failing client?

Would also help to see the output from bptestbpcd -client <client name> -verbose -debug

This will give you a detailed output of the connection attempt through BPCD to the client.

What kind of O/S is the client? If it's Windows, look at Windows Defender/Firewall. If it's Linux, take a look at the iptables.

Is this a recent occurrence, or has this client always had issues? Lots of background info needed, but based off of what you've said you should focus your troubleshooting on the client.

Are you using Accelerator for this backup?

KameliaB · ‎05-07-2020

@EthanH
Thank you. I have requested a bptestbpcd, bpgetconfig -e, vxlogview and bppllist -allpolicies from the failing client and master. It turned out @Marianne was right, we've been looking at the wrong media server, which we realized when we did not see the client listed in the media's policies. So, I mustered up the courage to ask for the full gammut of logs (again) by hostname, and not by item, this time. Stay tuned for that info...

The media and master are running AIX and the clients are running RHEL7.7. I did not look into iptables because smaller fulls and incs are completing successfully. Something must have changed because this is the first time this client is having any issues (I've checked into the last year). They use Accelerator in some of their policies, but I did not see it in this client's log. I will find out soon though.

I believe everyone here is on the same page and know it's a network or FW issue. I have a strong feeling it's the FW timeout, but I HAVE to give the proof before suggesting it. So, your keen eyes and patience on this will be much appreciated!

Tape_Archived · ‎05-10-2020

This is communication gap or issue within master and media servers. I have faced similar situation and really hard to pin-point the problem.

Please verify if you are using multiple nic's or ip's to communicate with media servers. Try using only one IP or nic on the master server to communicate with all the media server and check if it helps.

KameliaB · ‎05-14-2020

@Tape_Archived

I just received a response. They have a dedicated NIC for backups with only a single IP configured.

KameliaB · ‎05-14-2020

@EthanH

I have received a response today. Only the vxlogs for days of failure and that he only has a dedicated NIC for backups, no multiple IP configs.

He said he had "no binary file is created for bpgetconfig" and left it at that. I have the vxlogs you requested, I looked at it and didn't know what to really look for or how to tell if something is off. Please let me know if you see anything out of the ordinary.

VOX

Backup Failure on Multiple Clients with Exit Status 40