cancel
Showing results for 
Search instead for 
Did you mean: 

NetBackup unexpectedly going offline - VCS

mrmadej
Level 4
Partner Accredited

Hello,

I have looked on many forums but can not find nothing helpfull.

I have NetBackup 7.6.0.2 on VCS with GCO on RedHat 6.

Since last upgrade I have issue with restarting the NBU by VCS. It looks like one of monitored NBU processes is not responding.

I can not find what kind of process exactly is not responding.

AGENT_DEBUG.log shows nothing and also VCS engine_A.log and NetBackup_A.log are without detailed information.

I have enabled debug logging on VCS but without expected result.

 

Can someone help me how to find that particular process?

Many thanks in advance
Madej

27 REPLIES 27

mrmadej
Level 4
Partner Accredited

Yes, I was thinking about that. But since the last failure the problem did not occured. I changed only the parameters you suggested to change. Maybe this solved our issue. I have to observe for next few weeks.

Thanks

Marianne
Level 6
Partner    VIP    Accredited Certified

The bprd log that was posted a couple of days ago was already quite big.
Log entries between 3:50 and 4:39 produced a log with more than 180 000 lines! 
At level 0 logging!

So, level 5 for extended period in this environment would probably not be a good idea.....

IMHO, Support call with EEB request sounds like something I would do.....


 

mph999
Level 6
Employee Accredited

bprd is down, but when you look it is not ....

Hmm, sounds liike eTrack 3504280

This is resolved in 7.6.1 which is released Q1 2015.  In the meantime,  I can suggest that to avoid the issue, you simply to stop monitoring bprd within the cluster. I appreciate this isn't ideal, but you may decide that the risk of bprd failing and not causing a cluster failover, is less than the problem happening again. 

If you wish to do this : 

1. Go to /usr/openv/netbackup/bin/cluster 
2. Edit the NBU_RSP file to remove the entry bprd form the line beginning PROBE_PROCS 

Looking into this further, I believe the following 'might' avoid the issue, with bprd still being monitored, and that is to turn up the logging for bprd to verbose 5.  Perhaps not the best idea if bprd gets large, unless you put a cron script in to clear it every couple of hours or similar.

Martin

mrmadej
Level 4
Partner Accredited

Today the problem occured again. 

I opened the case with Support. And wating.

 

Regards

Madej

jim_dalton
Level 6

I would go with looking to see if it actually stops ie bprd has really died or the cluster fwork is just having a hard time determining if it has  or hasnt.

Take the resource out of being managed as above, then knock up a script to do something useful even if its a simple loop to ps -p <the pid of bprd>, and write the info to a log

Or maybe you have something a bt classier to monitor proceess, like serveralive. Whatever way, something that reports on the process and its state.

Ideally when it gets unresponsive thats the time to get strace on it.

If its really died then its going to be a netbackup/OS issue, otherwise its going to be a cluster/OS issue. From what I've read is sounds to be latter, possibly cause by being super-busy.

Jim

mph999
Level 6
Employee Accredited

Did you mention eTrack 3504280

If not please do.

It looks like we are going to get an EEB for this, it looks at the minute that the EEB is awaiting testing before being released, but you will need to alert the TSE who has the case (in case he didn;r see the eTrack).

Martin

mrmadej
Level 4
Partner Accredited

Hi,

I have suggested the TSE as you recommend. 

Thanks

Madej

mrmadej
Level 4
Partner Accredited

Hi,

 

The problem is still not solved.

But we have found the workaround which gives us time to wait for official fix in the 7.6.1 release.

The workaround is to set the ToleranceLimit to '1' on the whole resource type NetBackup on VCS.

And it looks well. In the engine_A.log we can find the entries like this:

2014/09/26 04:20:33 VCS INFO V-16-2-13075 (*****) Resource(nbu_server) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).

 

Thanks All

Regards

Madej