cancel
Showing results for 
Search instead for 
Did you mean: 

Mismatched/Crossed PID #'s between Master and Media server

LucSkywalker195
Level 4
Certified

Has anyone ever experienced two backup jobs trying to use the same PID bpbrm process on a media server. It hangs both jobs resulting in failures. Aren't backup jobs supposed to have their own bpbrm process on the media server? Is this a problem related to multi-plexing? Do jobs piggy back on top of other bpbrm processes?

1 ACCEPTED SOLUTION

Accepted Solutions

LucSkywalker195
Level 4
Certified

My TCP keep alive settings were too low. I increased them yesterday and the problem seems to have cleared up.

tcp_keepalive_time - was 7200 and I changed it to 600000

tcp_keepalive_intvl - was 75 - kept it the same

tcp_keepalive_probes - was 9 change it to 20

Thanks for all your input!

View solution in original post

9 REPLIES 9

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

Never seen this. Which NBU version?

Please show us example in Activity Monitor Details tab and Media Server's bptm and bpbrm logs?
Upload log files as File attachments.

Nicolai
Moderator
Moderator
Partner    VIP   

Likley a bpbrm has spawned - check if bpbrm has a paren pid or sub childs

LucSkywalker195
Level 4
Certified

I'll enable logging and collect them. I can see bpbrm starting with a parent ID and then issuing daughter sub-PID's. Those look correct to me. When the problem occurs the activity monitor has the PID in the details section but up at the top of the details tab next to "Job PID" in the GUI its blank. THen I go on the media server and ps -ef |grep for the PID number in the details section and it's a PID for another job that already completed.

LucSkywalker195
Level 4
Certified

Sorry...My master is RHEL 6 and Netbackup 7.5.0.6 and my media servers are Solaris 10 with 7.5.0.6 also.

mph999
Level 6
Employee Accredited

Does job2 start before job 1 has finished ...  I've never heard of a PID being 'shared' - I woulld doubt this is even possible, until your post came along ... 

Is it possible that job1 fails, job2 starts after this and just happenes to re-use a PID, making it look like it was used at the same time ....

watsons
Level 6

What do you mean by "details tab next to 'Job PID' in the GUI is blank", a screenshot would be clear for us.

A better way to confirm that it used the same PID, go to your master server:

# cd /usr/openv/netbackup/db/jobs/trylogs

# grep "pid:" *.t   > /tmp/pids.txt

View the output to see if you can spot any common child PID, note that parent PID can be the same. 

Also I think it's possible to have the same child PID - but NOT at the same time, that means the first PID needs to be completed before the next job can use back the same PID.

LucSkywalker195
Level 4
Certified

If I suspend the jobs in activity monitor and then go kill all the bpbrm processes on the media server and then resume them one at a time, the first one acquires resources and starts streaming data to tape. Once a resume one of the jobs that was fighting for the same PID they hang each other and 0 data is streamed to tape and they both just sit there until the timeout settings fail them.

I'm starting to look at my TCP keep alive settings. I've never seen this problem before with an all Unix environment. I even turned up my logging yesterday and only allowed these clients to run and nothing relevant/useful came out in the logs.

mph999
Level 6
Employee Accredited

Thanks for the confirmation.

You mention mpx - are your backups mpx ?  I'm kinda guessing they are as you mentioned mpx in your first post.

If the backups are mpx - could you try and reproduce the problem with non-mpx backups and see if it stlll happenes - quickest way to see if mpx is part of this issue or not.

 

 

LucSkywalker195
Level 4
Certified

My TCP keep alive settings were too low. I increased them yesterday and the problem seems to have cleared up.

tcp_keepalive_time - was 7200 and I changed it to 600000

tcp_keepalive_intvl - was 75 - kept it the same

tcp_keepalive_probes - was 9 change it to 20

Thanks for all your input!