I have received this case over a month ago and have been wracking the mind of my senior engineer at a solution. We appear to be at a standstill and possibly a fresh set of eyes may help. You may see that we're missing a part of the conversation in the attached logs, we've attempted to get the right logs three times. Each time we had to wait until the completion of a backup, which took a long time. Unfortunately, I cannot go back and ask for more logs from the customer, he's been patient thus far for a resolution, not for more logs...understandably so.
During a meeting we saw the customer conducts a checkpoint every 7 minutes. One of his clients, bkoweb26, began to backup successfully after creating an exclusions list. Also noticed a timeout of infinity in bpbrm, see snippet. We have made suggestions and recommendations, below. Some were made, but not all.
Snippet of the backup log:
Mar 28, 2020 5:53:14 AM - Error bpbrm (pid=10289380) db_FLISTsend failed: network connection broken (40) Mar 28, 2020 5:53:15 AM - Info bpbrm (pid=7471230) sending message to media manager: STOP BACKUP bkoweb26_1585391368 Mar 28, 2020 5:53:17 AM - Info bpbrm (pid=7471230) media manager for backup id bkoweb26_1585391368 exited with status 150: termination requested by administrator Mar 28, 2020 5:53:17 AM - end writing; write time: 0:23:42 network connection broken (40)
Snippet of Media bpbrm
00:07:52.377 [18415728.1] <2> db_getdata: timeout is 0 (infinite) 00:07:52.390 [18415728.1] <2> db_end: Need to collect reply 00:07:52.390 [18415728.1] <2> db_getdata: timeout is 0 (infinite)
Master: nmbackup01, configured on third-party, NBU version 8.1.1, Platform: AIX ver. 7.1
Media: nmbpmed05, configured on third-party server, NBU version 8.1.1, Platform AIX ver. 7.1
Clients: nmocmi02, bkoweb26 and bkoweb25
Clients (nmocmi02, nmsplkstore01): bpbkar
Master (nmbackup01): bpbrm, bptm
Media (nmbmed03, nmbpmed05): bpbrm, bptm
Any other recommendation you may have, will be very much appreciated!
Have you checked the PBX logs for the clients?
Use vxlogview -p 50936 -o 103, or check the /var/adm/syslog for the PBX logs written by the O/S.
Would also help to check the bprd logs on the master. You'll see the connection attempts to the master, and you may see the cause of the network disruption.
Status 40 is generally related to another process terminating the connection, like a firewall or something on the system interrupting the connection as opposed to something timing out.
I am prepare to go through ONE set of logs for the failure that you posted, but we need to see ALL text in job details to show timestamps and PIDs
Can you please point out which logs exactly contain the specific job, timestamps and PID reflected here:
5:53:15 AM - Info bpbrm (pid=7471230)
Important to look at one set of logs when you troubleshoot - see which client and which media server exactly, what the timestamps and PIDs are, then follow the process flow in the relevant logs.
For example, it does not help to compare this activity monitor entry with the bpbrm log snippet that happened at a different time and with different PIDs. Are these even the same media server?:
Mar 28, 2020 5:53:14 AM - Error bpbrm (pid=10289380) db_FLISTsend failed: network connection broken (40) Mar 28, 2020 5:53:15 AM - Info bpbrm (pid=7471230) sending message to media manager: STOP BACKUP bkoweb26_1585391368
00:07:52.377 [18415728.1] <2> db_getdata: timeout is 0 (infinite)
Which timeouts exactly are configured as 0 (infinite) ?
IMHO, there is hardly ever any reason to increase Cliet Connect and Client Read timeouts to more than 1800 (The default is 300, which are in most instances sufficient.)
About status 40:
The error says the connection is broken outside of NBU - therefore not a NBU timeout or failure.
Level 3 bpbrm and bptm logs on media server will confirm if data was received from client in a continuous stream when network failure occurred.
Always best to involve network and firewall team in situations like these - they need to monitor the port connections while the backup is running.
The issue you posted closely resembles the issue at hand because (I failed to mention) only FULL backups are failing with this exit status. INCs are successfully backing up.
To answer the questions you posed in that discussion:
I'll keep you updated of any progress.
And here lies the frustration...I kept getting half the conversation no matter how many times I re-requested logs. So the job detail I posted was the original job log. I can post the whole log, but it seems moot since that bkoweb26 client is now backing up successfully.
"Which timeouts exactly are configured as 0 (infinite)?"
I have no idea, nor does my senior engineer. I only mentioned having it decreased to 7200 because it appears all others on his server are set at that, aside from the infinite value, which I am guessing was purposely performed.
Do you think checkpoints are an issue? If not, I will stop harping him on that.
"The error says the connection is broken outside of NBU."
Could you point me to where you could deduce that? That is helpful to know.
The logs attached are the most recent bpbrm and bptm logs at a verbose level 3. I'd really love to involve their networking team, but we received pushback, until we have concrete evidence it's network or FW related.
I've been in that situation before, it is quite frustrating to constantly need the burden of proof to receive logs.
Can you post the Detailed Status from the failing client?
Would also help to see the output from bptestbpcd -client <client name> -verbose -debug
This will give you a detailed output of the connection attempt through BPCD to the client.
What kind of O/S is the client? If it's Windows, look at Windows Defender/Firewall. If it's Linux, take a look at the iptables.
Is this a recent occurrence, or has this client always had issues? Lots of background info needed, but based off of what you've said you should focus your troubleshooting on the client.
Are you using Accelerator for this backup?
Thank you. I have requested a bptestbpcd, bpgetconfig -e, vxlogview and bppllist -allpolicies from the failing client and master. It turned out @Marianne was right, we've been looking at the wrong media server, which we realized when we did not see the client listed in the media's policies. So, I mustered up the courage to ask for the full gammut of logs (again) by hostname, and not by item, this time. Stay tuned for that info...
The media and master are running AIX and the clients are running RHEL7.7. I did not look into iptables because smaller fulls and incs are completing successfully. Something must have changed because this is the first time this client is having any issues (I've checked into the last year). They use Accelerator in some of their policies, but I did not see it in this client's log. I will find out soon though.
I believe everyone here is on the same page and know it's a network or FW issue. I have a strong feeling it's the FW timeout, but I HAVE to give the proof before suggesting it. So, your keen eyes and patience on this will be much appreciated!
This is communication gap or issue within master and media servers. I have faced similar situation and really hard to pin-point the problem.
Please verify if you are using multiple nic's or ip's to communicate with media servers. Try using only one IP or nic on the master server to communicate with all the media server and check if it helps.
I have received a response today. Only the vxlogs for days of failure and that he only has a dedicated NIC for backups, no multiple IP configs.
He said he had "no binary file is created for bpgetconfig" and left it at that. I have the vxlogs you requested, I looked at it and didn't know what to really look for or how to tell if something is off. Please let me know if you see anything out of the ordinary.