Solved: backups hanging

sshagent · ‎08-21-2012

Is anyone experiencing lots of hung backups?

We're not using anything fancy, just regular backups to tape(or disk), no advanced features and such. Linux master, media and clients. Running 7.5.0.3 so nowhere to go patch wise.

Basically my backups start off, and then they just don't seem to be doing anything. From what i've seen it looks like bpbkar is disappearing.

The end of the bpbkar log for a currently hung backup has this...

21:11:57.746 [14008] <4> bpbkar: INF - Processing /path/to/bond23/sbp_141/sbp_141_0250

23:14:00.278 [14008] <16> bpbkar: ERR - bpbkar FATAL exit status = 23: socket read failed

23:14:00.278 [14008] <4> bpbkar: INF - EXIT STATUS 23: socket read failed

23:14:00.278 [14008] <4> bpbkar: INF - setenv FINISHED=0

...but surely if bpbkar dies, the rest of the processes should abort and error out the exit code....which isnt happening.

If i check the job via vxlogs or bperror there is no sign of that bpbkar error.

Oh its probably worth mentioning there are no firewalls involved either. Has me puzzled. Most backups go through, but some don't ( on seemingly random clients and media servers )

thanks for your time

Mark_Solutions · ‎08-24-2012

There is a slight performance hit using CPR, but not huge.

Every x(180) minutes it will want to perform a checkpoint and will prepare itself for that, but can only actually do it at file/folder boundaries - so it must finish backing up the file it is on before it can actually save the checkpoint.

That is the reason i asked about the size / possible time that last listed file may take to back up.

Assuming there are no firewalls etc. the only thing the 2 hours can be put down to, as far as i can see, is the keep alive timeout. The settings i gave earlier may well overcome that for you allowing you to maintain the 180 CPR. By default CPR is 15 minutes and most of my customers use 30 to 60 minutes.

Perhaps having the CPR in excess of the 2 hour keep alive is the issue so it may be worth having at less that 120.

Hope this helps

View solution in original post

Marianne · ‎08-21-2012

Check bpbrm and bptm logs on media server to see up to when data was received from client.

What is Client Read Timeout on media server(s)?

Handy NetBackup Links

Mark_Solutions · ‎08-21-2012

This could be a number of things and from what you have said it would need to be pinned down on the client initially.

You mention of firewall sounded good but you say that they are not active - so look at any Anti-virus next to see if it is eating the processes.

If the job does not fail then the Media Server is not getting that broken connection status from the client which you would expect it to after 2 hours .... 2 hours now that rings bells!!

Keep Alive times have a default of 2 hours - as do firewalls!!

It may be the clients are doing a lot of processing if no data goes back to the media server within that time but the bpbkar crashes out after 2 hours indicating that it is either sending data in the wrong direction (multiple networks?) or its keep alive time, interval and probes may need to be adjusted.

So check the networks - check if any data has passed at all - double check again for firewalls and check the keepalive settings for the clients.

You may need to increase the client and media server logging levels and see what is in the bpcd and bpbkar logs on the client as well as the bpbrm and bpcd on the media server

Hope this helps

sshagent · ‎08-21-2012

There isn't a firewall, this is all backup network.

My client read timeout is 7200, so might explain the time...but then the backup would fail rather than being active without doing anything. Is a bit on the excessive side, but historically was needed. Could probably bring this down, but I'm not necessariyl convinced NetBackup is using this timeout value as its not actually erroring.

This is all linux so no anti virus involved, i am getting the disk application checked out in case its objecting to something.

bpbkar

21:11:57.746 [14008] <4> bpbkar: INF - Processing /path/to/bond23/sbp_141/sbp_141_0250
23:14:00.278 [14008] <16> bpbkar: ERR - bpbkar FATAL exit status = 23: socket read failed
23:14:00.278 [14008] <4> bpbkar: INF - EXIT STATUS 23: socket read failed
23:14:00.278 [14008] <4> bpbkar: INF - setenv FINISHED=0

bpbrm
00:13:56.948 [13985] <2> bpbrm send_parent_msg: WROTE nbmedia1_1345472039 600000 0 53971.800 0
00:13:58.948 [13985] <2> bpbrm read_parent_msg: read from parent
00:14:03.448 [13985] <2> bpbrm read_media_msg: read from media manager: WROTE nbmedia1_1345472039 600000 0 53979.470 0
00:14:03.448 [13985] <2> bpbrm send_parent_msg: WROTE nbmedia1_1345472039 600000 0 53979.470 0
00:14:09.448 [13985] <2> bpbrm read_media_msg: read from media manager: WROTE nbmedia1_1345472039 600000 0 53987.894 0
00:14:09.448 [13985] <2> bpbrm send_parent_msg: WROTE nbmedia1_1345472039 600000 0 53987.894 0
00:23:58.966 [13985] <2> bpbrm read_parent_msg: read from parent
00:33:58.985 [13985] <2> bpbrm read_parent_msg: read from parent
00:43:59.003 [13985] <2> bpbrm read_parent_msg: read from parent
00:53:59.021 [13985] <2> bpbrm read_parent_msg: read from parent

..those read_parent_msg repeats forever

bptm
00:11:43.578 [13989] <2> write_data: Received checkpoint for backup id nbmedia1_1345472038, calculated blocks: 2224455927 blocks in cpr: 2224455943
00:11:43.593 [13989] <2> write_data: Received checkpoint for backup id nbmedia1_1345472038, calculated blocks: 2224455927 blocks in cpr: 2224455943
00:11:43.608 [13989] <2> write_data: Received checkpoint for backup id nbmedia1_1345472038, calculated blocks: 2224455927 blocks in cpr: 2224455943
00:14:44.534 [19660] <2> drivename_checklock: PID 13989 has lock
00:14:44.540 [19660] <2> report_drives: PID = 13989
00:24:43.725 [20387] <2> drivename_checklock: PID 13989 has lock
00:24:43.725 [20387] <2> report_drives: PID = 13989

Mark_Solutions · ‎08-21-2012

That could be partly it ... client read times out at 2hours and the client will know that when it is sent the job hence the bpbkar stopping - but you would expect it to report this back to media server unless it is just that the keep alive which also defaults on most systems to 2 hours just stops that communication from happening

Still tends to indicate there is something wrong - perhaps as you say a disk issue causing nothing to happen for so long.

It does look like it has got stuck somewhere so try reducing the timeout to see if the true error will get reported back before the keep alive times everywhere expire.

sshagent · ‎08-21-2012

Agreed. I've reduced the timeout to 3600. So if failures happens tonight that are 2 hour timeout, we know the problem is elsewhere.

Thanks for your time

sshagent · ‎08-22-2012

So the backups failed again with the 2 hour delay(time out), on a different path of the backup this time.

Is there anyway to find out what bpbkar is doing for that 2 hours (and a few seconds) when nothing happens? I'd like to be able to confirm that bpbkar is not doing anything and is at fault?

The area its reading from is definitely available and is high performance disk which pretty much our entire company uses so if there were occasional hang ups it would have been noticed elsewhere?

Mark_Solutions · ‎08-22-2012

I think you would need to put the bpbkar logging up to maixmum

If i remember correctly then on the client you can also create a file in the netbackup directory named bpbkar_path_tr (no exptension and nothing in it)

This will then log exactly what bpbak is doing and list every file it looks at - the last one in the list when it fails is the file it has issues with

Hope this helps

sshagent · ‎08-22-2012

I had done the VERBOSE = 5 this morning, but just touched that file and started a backup off.

Thanks for your time Mark

sshagent · ‎08-23-2012

00:16:56.478 [7826] <2> bpbkar SelectFile: INF - path = E_sle_162_0302_UV_project_lookdev_v2.1027.tex
00:17:35.083 [7826] <2> bpbkar SelectFile: INF - cwd = /path/to/bond23/sle_162/sle_162_0302/elements/E_sle_162_0302_UV_project_lookdev_v2/16384x16384
00:17:35.083 [7826] <2> bpbkar SelectFile: INF - path = E_sle_162_0302_UV_project_lookdev_v2.1028.exr
00:17:47.917 [7826] <2> bpbkar SelectFile: INF - cwd = /path/to/bond23/sle_162/sle_162_0302/elements/E_sle_162_0302_UV_project_lookdev_v2/16384x16384
00:17:47.917 [7826] <2> bpbkar SelectFile: INF - path = E_sle_162_0302_UV_project_lookdev_v2.1028.tex
00:18:37.731 [7826] <2> bpbkar PrintFile: CPR - 2532322104 7826 1345645094 1345677517 306854 0 1 655 243090546 1206 656 23 127 0 750000 0 0 0 146 /path/to/bond23/sle_162/sle_162_0302/elements/E_sle_162_0302_UV_project_lookdev_v2/16384x16384/E_sle_162_0302_UV_project_lookdev_v2.1028.tex
02:18:37.964 [7826] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 23: socket read failed
02:18:37.964 [7826] <4> bpbkar Exit: INF - EXIT STATUS 23: socket read failed
02:18:37.965 [7826] <4> bpbkar Exit: INF - setenv FINISHED=0

So from what I've googled that bpbkar PrintFile: CPR, is check point recovery. So i guess it gets to CPR, passes that info off to another processs which locks up and never returns back to bpbkar? Then after 2 hours its terminated as a part of some KEEP ALIVE timeout. Is there logging for CPR?

CPR is set at 180 minutes for some reason, i'll try disabling it entirely and running the backup. If that works i guess thats proving the issue.

Mark_Solutions · ‎08-23-2012

Do you know how large that last listed file is and how long it may take to back up?

The checkpoint is listed but will not be saved until that file has been backed up.

If the file could take more than 2 hours the keep alive could override the 180 minutes for the CPR - does that make any sense??

Not sure of the exact setting for Linux but on SUSE (N5000 Appliances) I use the following settings - which are worth doing on the Master, Media and Client and do not need a re-start to take effect:

to check the setting:

# cat /proc/sys/net/ipv4/tcp_keepalive_time

7200

# cat /proc/sys/net/ipv4/tcp_keepalive_intvl

75

# cat /proc/sys/net/ipv4/tcp_keepalive_probes

9

to change the setting:

# echo 510 > /proc/sys/net/ipv4/tcp_keepalive_time

# echo 3 > /proc/sys/net/ipv4/tcp_keepalive_intvl

# echo 3 > /proc/sys/net/ipv4/tcp_keepalive_probes

To keep persistent after a reboot see below – use vi editor:

The changes would be rendered persistent with an addition such as the following to /etc/sysctl.conf

## Keepalive at 8.5 minutes

# start probing for heartbeat after 8.5 idle minutes (default 7200 sec)

net.ipv4.tcp_keepalive_time=510

# close connection after 4 unanswered probes (default 9)

net.ipv4.tcp_keepalive_probes=3

# wait 45 seconds for reponse to each probe (default 75

net.ipv4.tcp_keepalive_intvl=3

These don’t need a restart to take effect but run :

chkconfig boot.sysctl on

to ensure they are persistent

Hope this helps

sshagent · ‎08-23-2012

That file is just shy of 1gb. I ran a while/do/done loop (100loops) to copy that file to see if it would randomly hang or such, but is just flied through them with no hassles.

sshagent · ‎08-24-2012

None of the backups are hanging this morning, which is a lovely sight! So 'seems' that disabling CPR has been a workaround.

Does the CPR 180 minutes mean more workload than say 30 minutes? As i wonder whether i should see where the fault lies once i have a decent backup set. Presumably every X(180 currently) minutes it will update with its progress...but by having less frequent updates of the CPR info, surely that means more data to be parsed? What im getting at, is whether i should look into whatever process is struggling to do the CPR updates or not. Any idea what process would handle that?

Its no major issues to leave CPR disabled until a patch comes along, but i would like to be able to diagnose this completely and report it in so others don't experience this.

Mark_Solutions · ‎08-24-2012

There is a slight performance hit using CPR, but not huge.

Every x(180) minutes it will want to perform a checkpoint and will prepare itself for that, but can only actually do it at file/folder boundaries - so it must finish backing up the file it is on before it can actually save the checkpoint.

That is the reason i asked about the size / possible time that last listed file may take to back up.

Assuming there are no firewalls etc. the only thing the 2 hours can be put down to, as far as i can see, is the keep alive timeout. The settings i gave earlier may well overcome that for you allowing you to maintain the 180 CPR. By default CPR is 15 minutes and most of my customers use 30 to 60 minutes.

Perhaps having the CPR in excess of the 2 hour keep alive is the issue so it may be worth having at less that 120.

Hope this helps

VOX

backups hanging