04-25-2014 01:09 PM
Has anyone ever experienced two backup jobs trying to use the same PID bpbrm process on a media server. It hangs both jobs resulting in failures. Aren't backup jobs supposed to have their own bpbrm process on the media server? Is this a problem related to multi-plexing? Do jobs piggy back on top of other bpbrm processes?
Solved! Go to Solution.
05-01-2014 06:48 AM
My TCP keep alive settings were too low. I increased them yesterday and the problem seems to have cleared up.
tcp_keepalive_time - was 7200 and I changed it to 600000
tcp_keepalive_intvl - was 75 - kept it the same
tcp_keepalive_probes - was 9 change it to 20
Thanks for all your input!
04-25-2014 06:24 PM
Never seen this. Which NBU version?
Please show us example in Activity Monitor Details tab and Media Server's bptm and bpbrm logs?
Upload log files as File attachments.
04-26-2014 05:03 AM
Likley a bpbrm has spawned - check if bpbrm has a paren pid or sub childs
04-28-2014 07:59 AM
I'll enable logging and collect them. I can see bpbrm starting with a parent ID and then issuing daughter sub-PID's. Those look correct to me. When the problem occurs the activity monitor has the PID in the details section but up at the top of the details tab next to "Job PID" in the GUI its blank. THen I go on the media server and ps -ef |grep for the PID number in the details section and it's a PID for another job that already completed.
04-28-2014 08:01 AM
Sorry...My master is RHEL 6 and Netbackup 7.5.0.6 and my media servers are Solaris 10 with 7.5.0.6 also.
04-28-2014 10:02 AM
Does job2 start before job 1 has finished ... I've never heard of a PID being 'shared' - I woulld doubt this is even possible, until your post came along ...
Is it possible that job1 fails, job2 starts after this and just happenes to re-use a PID, making it look like it was used at the same time ....
04-28-2014 04:41 PM
What do you mean by "details tab next to 'Job PID' in the GUI is blank", a screenshot would be clear for us.
A better way to confirm that it used the same PID, go to your master server:
# cd /usr/openv/netbackup/db/jobs/trylogs
# grep "pid:" *.t > /tmp/pids.txt
View the output to see if you can spot any common child PID, note that parent PID can be the same.
Also I think it's possible to have the same child PID - but NOT at the same time, that means the first PID needs to be completed before the next job can use back the same PID.
04-29-2014 11:59 AM
If I suspend the jobs in activity monitor and then go kill all the bpbrm processes on the media server and then resume them one at a time, the first one acquires resources and starts streaming data to tape. Once a resume one of the jobs that was fighting for the same PID they hang each other and 0 data is streamed to tape and they both just sit there until the timeout settings fail them.
I'm starting to look at my TCP keep alive settings. I've never seen this problem before with an all Unix environment. I even turned up my logging yesterday and only allowed these clients to run and nothing relevant/useful came out in the logs.
04-29-2014 04:31 PM
Thanks for the confirmation.
You mention mpx - are your backups mpx ? I'm kinda guessing they are as you mentioned mpx in your first post.
If the backups are mpx - could you try and reproduce the problem with non-mpx backups and see if it stlll happenes - quickest way to see if mpx is part of this issue or not.
05-01-2014 06:48 AM
My TCP keep alive settings were too low. I increased them yesterday and the problem seems to have cleared up.
tcp_keepalive_time - was 7200 and I changed it to 600000
tcp_keepalive_intvl - was 75 - kept it the same
tcp_keepalive_probes - was 9 change it to 20
Thanks for all your input!