cancel
Showing results for 
Search instead for 
Did you mean: 

Error 54 every 12 hours

AndresV
VIP
Partner    VIP   

Hi guys,

I´ve been experiencing some issues with my Netbackup 7.7.3. (yes, we are planning to upgrade) I have 8 appliances 2.7.3.

A few days ago I had to perform a factory reset on one of my appliances, after this I´m seeing a lot of error 54 during 12 pm and 0:00 am (weird ah?). Not only with this appliance but with all of them.

Before the factory reset, I had no problems.

In order to have the jobs running normally, I have to shut down the services in this appliance. During the rest of the backup window, I have no problem even with this appliance.

I hope I was clear.

11 REPLIES 11

davidmoline
Level 6
Employee

Hi @AndresV 

So a status 54 is a timeout connecting to client issue. When you performed the factory reset, did you retain the network settings? Even if so were there any bp.conf paramters that you may have had modified (in particular the timeout related ones) that were not restored after the reset? 

Next - how are these errors appearing (not so clear to me)? Are they backup issues (status 54) with random clients that are using this media server which occur at 0:00 and 12:00? At what stage are the backups?

Have you verifiedd that you have restored ALL the required SERVER entries into the media server bp.conf?

Does the appliance behave correctly during the rest of the time? 

Lots of possibilities - I think for us to help, we need more information.

David

AndresV
VIP
Partner    VIP   

Hi David,

I appreciate your answer. I didn't retain the network settings because I had a ifconfig backup, /etc/hosts backup and bp.conf backup.

We run differential jobs every 2 hours but the jobs which have to run 12 pm and 0:00 am start failing with error 54, not only jobs on this appliance but with all appliances. The next differential jobs run fine on every appliance

The rest of the day the appliances work fine including the one which was reset.

Yes, I checked the entries in the bp.conf file and in the host properties -> media server -> servers tab

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@AndresV 

Can you please share all text in Job Details of a status 54 job?

It is okay if you want to replace server names with generic names, e.g. Master, MediaAppl1, Client1, etc.

AndresV
VIP
Partner    VIP   

Thanks Marianne,

 

03/16/2021 00:25:43 - Info nbjm (pid=6104) starting backup job (jobid=4675867) for client ECBPPRQ67, policy ECBPPRQ67_SQL_Diferencial, schedule Default-Application-Backup

03/16/2021 00:25:43 - Info nbjm (pid=6104) requesting STANDARD_RESOURCE resources from RB for backup job (jobid=4675867, request id:{8271B3A4-C324-4B35-BB4C-FB9380C16506})

03/16/2021 00:25:43 - requesting resource STUG_NB1-NB2_DIFF

03/16/2021 00:25:43 - requesting resource ECBPPRNBM1.NBU_CLIENT.MAXJOBS.ECBPPRQ67

03/16/2021 00:25:43 - requesting resource ECBPPRNBM1.NBU_POLICY.MAXJOBS.ECBPPRQ67_SQL_Diferencial

03/16/2021 00:25:52 - granted resource  ECBPPRNBM1.NBU_CLIENT.MAXJOBS.ECBPPRQ67

03/16/2021 00:25:52 - granted resource  ECBPPRNBM1.NBU_POLICY.MAXJOBS.ECBPPRQ67_SQL_Diferencial

03/16/2021 00:25:52 - granted resource  MediaID=@aaacq;DiskVolume=PureDiskVolume;DiskPool=dp_disk_ecbpprnb3;Path=PureDiskVolume;StorageServer=ecbpprnb3;MediaServer=ecbpprnb3

03/16/2021 00:25:52 - granted resource  stu_disk_ecbpprnb3

03/16/2021 00:26:35 - Info bpbrm (pid=303787) ECBPPRQ67 is the host to backup data from

03/16/2021 00:26:35 - Info bpbrm (pid=303787) reading file list for client

03/16/2021 00:26:39 - Info bpbrm (pid=303787) listening for client connection

03/16/2021 00:26:40 - Info bpbrm (pid=303787) INF - Client read timeout = 300

03/16/2021 00:27:31 - estimated 0 kbytes needed

03/16/2021 00:27:31 - Info nbjm (pid=6104) started backup (backupid=ECBPPRQ67_1615872406) job for client ECBPPRQ67, policy ECBPPRQ67_SQL_Diferencial, schedule Default-Application-Backup on storage unit stu_disk_ecbpprnb3

03/16/2021 00:27:34 - started process bpbrm (pid=303787)

03/16/2021 00:27:38 - connecting

03/16/2021 00:31:40 - Error bpbrm (pid=303787) listen for client timeout during accept from data listen socket after 60 seconds

03/16/2021 00:31:41 - Info dbclient (pid=0) done. status: 54: timed out connecting to client

03/16/2021 00:32:43 - end writing

timed out connecting to client  (54)

ECBPPRNB3 is my Netbackup Appliance which was reset

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Something else is happening here - in this particular job, the media server was waiting for comms from the client, but did not receive anything:

03/16/2021 00:26:39 - Info bpbrm (pid=303787) listening for client connection

03/16/2021 00:26:40 - Info bpbrm (pid=303787) INF - Client read timeout = 300

03/16/2021 00:31:41 - Info dbclient (pid=0) done. status: 54: timed out connecting to client

All we see is the timeout that happened after 300 sec (5 minutes)

You will need to see what is happening on the client at that time that could cause a delay in the start of SQL backup.
Check dbclient log, Event Viewer Application Log, SQL Errorlog and VDI log.

I cannot see anything that could be related to Appliance reset.
Maybe Client Connect and Client Read Timeout was longer on the Appliance to accommodate other processes running on clients roundabout midnight that could delay backup processing.

 

AndresV
VIP
Partner    VIP   

well, the problem is all differential jobs fail which means all clients fail with the same error.

If I shut down the services on this particular appliance the jobs run fine even at 00;00 am and 12 pm.

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@AndresV 

Unfortunately we can only work with the evidence we have.

To troubleshoot an issue with the Appliance OS (where the issue probably is), you would need to log a call with Veritas Support.
As you already know, version 2.7.3 is EOL. Veritas would also ask you to upgrade.

Maybe you should focus efforts on upgrading the environment?
If the issue persists past the upgrade, you will be able to log a Support call.

In the meantime, try to increase timeouts to 900 or even 1800.

Hi @AndresV 

While I think @Marianne has the most sensible suggestion and you should start planning an upgrade, I'm curious about what might be the cause of the problem you are seeing. I struggle to see why an issue on the one appliance would cause widespread issues as you describe and only at those specific times. 

Can you answer some questions please?

  1. Are the backups that are failing all SQL backups - and are they transaction log backups? The connection setup for this type of backup is somewhat different to a standard backup).
  2. For the job details you provided, I see it was using the media server (and disk pool) from the server ecbpprnb3. Is this the problem server?
  3. Are the other backups that fail at the same time using this media server in any way (load balancing, disk pool etc)? 
  4. If you remove the problem appliance from backups, how do you manage the lack of the media server and associated disk pool? Is this a manual process to change the storage or what?
  5. Although I cant see why this might be time related are you sure you restored all network routing correctly to the appliance?
  6. When you performed the factory reset, did you retain the backup data on the appliance?
  7. Can the data on that appliance be dsicarded or recreated? - a possible solution might be to reimage the appliance and start from scratch.
  8. Have you re-installed all EEBs that were installed on the appliance (and are installed on the other appliances)?
  9. What was the reason for performing the factory reset in the first place?

To me it seems like one of the house keeping tasks the appliance runs is causing the issue, but why it affects all backups at that time eludes me for now. I don't have ready access to an appliance to see what might be running at that time (but I'll fire up a lab when I have a spare moment and see what I can see). 

David

X2
Moderator
Moderator
   VIP   

@davidmoline wrote:

 I struggle to see why an issue on the one appliance would cause widespread issues as you describe and only at those specific times. 

Can you answer some questions please?

  1. If you remove the problem appliance from backups, how do you manage the lack of the media server and associated disk pool? Is this a manual process to change the storage or what?

David


Several good questions from David. When you start the troubleshooting, start with the above - remove, if possible, the appliance on which you performed the factory reset. Do backups run normally after that? If yes, you know where to concentrate to find the cause.

AndresV
VIP
Partner    VIP   

Thank you David,

  1. Are the backups that are failing all SQL backups - and are they transaction log backups? The connection setup for this type of backup is somewhat different to a standard backup). Answer: All the jobs are differential backups which have to run every 2 hours (yesterday failed at 00:00 am, 8:00 am and 12 pm) 
  2. For the job details you provided, I see it was using the media server (and disk pool) from the server ecbpprnb3. Is this the problem server? Answer, Yes that is the problem appliance, but jobs fail not only with this appliance but all appliances at this particular hours.
  3. Are the other backups that fail at the same time using this media server in any way (load balancing, disk pool etc)? Answer: the jobs use STU.
  4. If you remove the problem appliance from backups, how do you manage the lack of the media server and associated disk pool? Is this a manual process to change the storage or what? Answer: I kill the services in this appliance and I change the policy to another STU.
  5. Although I cant see why this might be time related are you sure you restored all network routing correctly to the appliance? Answer, I have another 7 appliances so I compared, first I have GW wrong set so I changed but the problem is still there.
  6. When you performed the factory reset, did you retain the backup data on the appliance? Answer, Nop.
  7. Can the data on that appliance be dsicarded or recreated? - a possible solution might be to reimage the appliance and start from scratch. Answer, I already discarded everything and perform a factory reset.
  8. Have you re-installed all EEBs that were installed on the appliance (and are installed on the other appliances)? Answer: Yes I did however I cant find an EEB related to this problem.
  9. What was the reason for performing the factory reset in the first place? Answer: Data corruption.

Hi @AndresV 

Thanks for your answers - but of course this leads to more questions....

  • So if I understand correctly from your replies, the failed jobs use media server load balancing to write to the STU on the problem appliance - can you confirm this please?.
  • What is the nature of all the data that is backed up by these differential backups every 2 hours? (SQL, File, is it compressed or encrypted)
  • Besides the 2 hourly differential backups what else happens around these problem times in NetBackup (other backup jobs, tape duplications)?
  • Okay - so tell us a bit more about the configuration of the problem appliance - is it the same as was originally delivered or has it been upgraded subsequently (officially or otherwise)? How much storage and how much RAM is in the appliance? Is all or most of the storage utilised as MSDP?
  • For the backups that are running does anything different happen at the source around the problem times (extra processing leading to more data for instance to be backed up)?
  • Have you checked that there are no underlying hardware issues on this problem appliance (RAID set issues for instance - CLISH: Support->Test Hardware)?
  • When you modify the policies to avoid using the problem appliance do you change everything to use a single different STU?
  • Have you just tried changing the policy to the different STU without shutting down NBU on the problem appliance? Does it cause backup failures?

Can you provide details of a job that has failed on a different appliance also please?

Assuming you can manage the data on this appliance (discarding or duplicating elsewhere if needed), an interrim option could be to reimage the appliance (rather than factory reset) using the USB stick that should have been provided. I still think upgrading to a current supported version of NetBackup is your best bet (but certainly more painful than fixing a single appliance). 

David