Forum Discussion

sshagent's avatar
sshagent
Level 5
13 years ago
Solved

backups hanging

Is anyone experiencing lots of hung backups?  

We're not using anything fancy, just regular backups to tape(or disk), no advanced features and such.  Linux master, media and clients.  Running 7.5.0.3 so nowhere to go patch wise.

Basically my backups start off, and then they just don't seem to be doing anything.  From what i've seen it looks like bpbkar is disappearing.  

The end of the bpbkar log for a currently hung backup has this...

 

21:11:57.746 [14008] <4> bpbkar: INF - Processing /path/to/bond23/sbp_141/sbp_141_0250
23:14:00.278 [14008] <16> bpbkar: ERR - bpbkar FATAL exit status = 23: socket read failed
23:14:00.278 [14008] <4> bpbkar: INF - EXIT STATUS 23: socket read failed
23:14:00.278 [14008] <4> bpbkar: INF - setenv FINISHED=0
 
...but surely if bpbkar dies, the rest of the processes should abort and error out the exit code....which isnt happening.
If i check the job via vxlogs or bperror there is no sign of that bpbkar error.  
 
Oh its probably worth mentioning there are no firewalls involved either.  Has me puzzled.  Most backups go through, but some don't ( on seemingly random clients and media servers )
 
thanks for your time
 
 
 
 
 
 
  • There is a slight performance hit using CPR, but not huge.

    Every x(180) minutes it will want to perform a checkpoint and will prepare itself for that, but can only actually do it at file/folder boundaries - so it must finish backing up the file it is on before it can actually save the checkpoint.

    That is the reason i asked about the size / possible time that last listed file may take to back up.

    Assuming there are no firewalls etc. the only thing the 2 hours can be put down to, as far as i can see, is the keep alive timeout. The settings i gave earlier may well overcome that for you allowing you to maintain the 180 CPR. By default CPR is 15 minutes and most of my customers use 30 to 60 minutes.

    Perhaps having the CPR in excess of the 2 hour keep alive is the issue so it may be worth having at less that 120.

    Hope this helps

  • That file is just shy of 1gb.  I ran a while/do/done loop (100loops) to copy that file to see if it would randomly hang or such, but is just flied through them with no hassles.

     

     

  • None of the backups are hanging this morning, which is a lovely sight!  So 'seems' that disabling CPR has been a workaround. 

    Does the CPR 180 minutes mean more workload than say 30 minutes?  As i wonder whether i should see where the fault lies once i have a decent backup set.  Presumably every X(180 currently) minutes it will update with its progress...but by having less frequent updates of the CPR info, surely that means more data to be parsed?  What im getting at, is whether i should look into whatever process is struggling to do the CPR updates or not.  Any idea what process would handle that?

    Its no major issues to leave CPR disabled until a patch comes along, but i would like to be able to diagnose this completely and report it in so others don't experience this.

  • There is a slight performance hit using CPR, but not huge.

    Every x(180) minutes it will want to perform a checkpoint and will prepare itself for that, but can only actually do it at file/folder boundaries - so it must finish backing up the file it is on before it can actually save the checkpoint.

    That is the reason i asked about the size / possible time that last listed file may take to back up.

    Assuming there are no firewalls etc. the only thing the 2 hours can be put down to, as far as i can see, is the keep alive timeout. The settings i gave earlier may well overcome that for you allowing you to maintain the 180 CPR. By default CPR is 15 minutes and most of my customers use 30 to 60 minutes.

    Perhaps having the CPR in excess of the 2 hour keep alive is the issue so it may be worth having at less that 120.

    Hope this helps