03-15-2021 02:40 PM
Hi guys,
I´ve been experiencing some issues with my Netbackup 7.7.3. (yes, we are planning to upgrade) I have 8 appliances 2.7.3.
A few days ago I had to perform a factory reset on one of my appliances, after this I´m seeing a lot of error 54 during 12 pm and 0:00 am (weird ah?). Not only with this appliance but with all of them.
Before the factory reset, I had no problems.
In order to have the jobs running normally, I have to shut down the services in this appliance. During the rest of the backup window, I have no problem even with this appliance.
I hope I was clear.
03-15-2021 04:56 PM
Hi @AndresV
So a status 54 is a timeout connecting to client issue. When you performed the factory reset, did you retain the network settings? Even if so were there any bp.conf paramters that you may have had modified (in particular the timeout related ones) that were not restored after the reset?
Next - how are these errors appearing (not so clear to me)? Are they backup issues (status 54) with random clients that are using this media server which occur at 0:00 and 12:00? At what stage are the backups?
Have you verifiedd that you have restored ALL the required SERVER entries into the media server bp.conf?
Does the appliance behave correctly during the rest of the time?
Lots of possibilities - I think for us to help, we need more information.
David
03-16-2021 07:21 AM
Hi David,
I appreciate your answer. I didn't retain the network settings because I had a ifconfig backup, /etc/hosts backup and bp.conf backup.
We run differential jobs every 2 hours but the jobs which have to run 12 pm and 0:00 am start failing with error 54, not only jobs on this appliance but with all appliances. The next differential jobs run fine on every appliance
The rest of the day the appliances work fine including the one which was reset.
Yes, I checked the entries in the bp.conf file and in the host properties -> media server -> servers tab
03-16-2021 07:32 AM
Can you please share all text in Job Details of a status 54 job?
It is okay if you want to replace server names with generic names, e.g. Master, MediaAppl1, Client1, etc.
03-16-2021 07:48 AM
Thanks Marianne,
03/16/2021 00:25:43 - Info nbjm (pid=6104) starting backup job (jobid=4675867) for client ECBPPRQ67, policy ECBPPRQ67_SQL_Diferencial, schedule Default-Application-Backup
03/16/2021 00:25:43 - Info nbjm (pid=6104) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=4675867, request id:{8271B3A4-C324-4B35-BB4C-FB9380C16506})
03/16/2021 00:25:43 - requesting resource STUG_NB1-NB2_DIFF
03/16/2021 00:25:43 - requesting resource ECBPPRNBM1.NBU_CLIENT.MAXJOBS.ECBPPRQ67
03/16/2021 00:25:43 - requesting resource ECBPPRNBM1.NBU_POLICY.MAXJOBS.ECBPPRQ67_SQL_Diferencial
03/16/2021 00:25:52 - granted resource ECBPPRNBM1.NBU_CLIENT.MAXJOBS.ECBPPRQ67
03/16/2021 00:25:52 - granted resource ECBPPRNBM1.NBU_POLICY.MAXJOBS.ECBPPRQ67_SQL_Diferencial
03/16/2021 00:25:52 - granted resource MediaID=@aaacq;DiskVolume=PureDiskVolume;DiskPool=dp_disk_ecbpprnb3;Path=PureDiskVolume;StorageServer=ecbpprnb3;MediaServer=ecbpprnb3
03/16/2021 00:25:52 - granted resource stu_disk_ecbpprnb3
03/16/2021 00:26:35 - Info bpbrm (pid=303787) ECBPPRQ67 is the host to backup data from
03/16/2021 00:26:35 - Info bpbrm (pid=303787) reading file list for client
03/16/2021 00:26:39 - Info bpbrm (pid=303787) listening for client connection
03/16/2021 00:26:40 - Info bpbrm (pid=303787) INF - Client read timeout = 300
03/16/2021 00:27:31 - estimated 0 kbytes needed
03/16/2021 00:27:31 - Info nbjm (pid=6104) started backup (backupid=ECBPPRQ67_1615872406) job for client ECBPPRQ67, policy ECBPPRQ67_SQL_Diferencial, schedule Default-Application-Backup on storage unit stu_disk_ecbpprnb3
03/16/2021 00:27:34 - started process bpbrm (pid=303787)
03/16/2021 00:27:38 - connecting
03/16/2021 00:31:40 - Error bpbrm (pid=303787) listen for client timeout during accept from data listen socket after 60 seconds
03/16/2021 00:31:41 - Info dbclient (pid=0) done. status: 54: timed out connecting to client
03/16/2021 00:32:43 - end writing
timed out connecting to client (54)
ECBPPRNB3 is my Netbackup Appliance which was reset
03-16-2021 08:03 AM
Something else is happening here - in this particular job, the media server was waiting for comms from the client, but did not receive anything:
03/16/2021 00:26:39 - Info bpbrm (pid=303787) listening for client connection
03/16/2021 00:26:40 - Info bpbrm (pid=303787) INF - Client read timeout = 300
03/16/2021 00:31:41 - Info dbclient (pid=0) done. status: 54: timed out connecting to client
All we see is the timeout that happened after 300 sec (5 minutes)
You will need to see what is happening on the client at that time that could cause a delay in the start of SQL backup.
Check dbclient log, Event Viewer Application Log, SQL Errorlog and VDI log.
I cannot see anything that could be related to Appliance reset.
Maybe Client Connect and Client Read Timeout was longer on the Appliance to accommodate other processes running on clients roundabout midnight that could delay backup processing.
03-16-2021 08:24 AM
well, the problem is all differential jobs fail which means all clients fail with the same error.
If I shut down the services on this particular appliance the jobs run fine even at 00;00 am and 12 pm.
03-16-2021 11:38 PM
Unfortunately we can only work with the evidence we have.
To troubleshoot an issue with the Appliance OS (where the issue probably is), you would need to log a call with Veritas Support.
As you already know, version 2.7.3 is EOL. Veritas would also ask you to upgrade.
Maybe you should focus efforts on upgrading the environment?
If the issue persists past the upgrade, you will be able to log a Support call.
In the meantime, try to increase timeouts to 900 or even 1800.
03-17-2021 01:15 AM
Hi @AndresV
While I think @Marianne has the most sensible suggestion and you should start planning an upgrade, I'm curious about what might be the cause of the problem you are seeing. I struggle to see why an issue on the one appliance would cause widespread issues as you describe and only at those specific times.
Can you answer some questions please?
To me it seems like one of the house keeping tasks the appliance runs is causing the issue, but why it affects all backups at that time eludes me for now. I don't have ready access to an appliance to see what might be running at that time (but I'll fire up a lab when I have a spare moment and see what I can see).
David
03-17-2021 06:25 AM
@davidmoline wrote:
I struggle to see why an issue on the one appliance would cause widespread issues as you describe and only at those specific times.
Can you answer some questions please?
- If you remove the problem appliance from backups, how do you manage the lack of the media server and associated disk pool? Is this a manual process to change the storage or what?
David
Several good questions from David. When you start the troubleshooting, start with the above - remove, if possible, the appliance on which you performed the factory reset. Do backups run normally after that? If yes, you know where to concentrate to find the cause.
03-17-2021 07:01 AM
Thank you David,
03-17-2021 03:27 PM
Hi @AndresV
Thanks for your answers - but of course this leads to more questions....
Can you provide details of a job that has failed on a different appliance also please?
Assuming you can manage the data on this appliance (discarding or duplicating elsewhere if needed), an interrim option could be to reimage the appliance (rather than factory reset) using the USB stick that should have been provided. I still think upgrading to a current supported version of NetBackup is your best bet (but certainly more painful than fixing a single appliance).
David