Solved: NetBackup unexpectedly going offline - VCS

mrmadej · ‎08-18-2014

Hello,

I have looked on many forums but can not find nothing helpfull.

I have NetBackup 7.6.0.2 on VCS with GCO on RedHat 6.

Since last upgrade I have issue with restarting the NBU by VCS. It looks like one of monitored NBU processes is not responding.

I can not find what kind of process exactly is not responding.

AGENT_DEBUG.log shows nothing and also VCS engine_A.log and NetBackup_A.log are without detailed information.

I have enabled debug logging on VCS but without expected result.

Can someone help me how to find that particular process?

Many thanks in advance
Madej

mrmadej · ‎09-30-2014

Hi,

The problem is still not solved.

But we have found the workaround which gives us time to wait for official fix in the 7.6.1 release.

The workaround is to set the ToleranceLimit to '1' on the whole resource type NetBackup on VCS.

And it looks well. In the engine_A.log we can find the entries like this:

2014/09/26 04:20:33 VCS INFO V-16-2-13075 (*****) Resource(nbu_server) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).

Thanks All

Regards

Madej

View solution in original post

trv · ‎08-18-2014

Check logs in /usr/openv/netbackup/logs/cluster/ on active node. There should be something like 02:22:31.972 [27044] <16> monitor:processStatus: Some Processes are DOWN while others are UP 02:22:31.972 [27044] <16> monitor:processStatus: Following Process are found DOWN: bprd 02:22:31.972 [27044] <16> monitor:processStatus: Following Process are found UP: nbevtmgr nbstserv vmd bpdbm nbpem nbjm nbemm nbrb NB_dbsrv nbaudit nbsl nbrmms nbatd nb disco 02:22:32.854 [27060] <4> Offline::main: Offline called with 2 Parameters Yes we are seeing some issue on our NBU clusters too. It's always bprd in our case.

Nicolai · ‎08-18-2014

does /var/log/messages show any signs of core dumps ?

mrmadej · ‎08-19-2014

Hi,

Thanks !!

I see the issue with bprd also.

03:53:01.308 [12386] <16> monitor:processStatus: Some Processes are DOWN while others are UP
03:53:01.308 [12386] <16> monitor:processStatus: Following Process are found DOWN: bprd
03:53:01.308 [12386] <16> monitor:processStatus: Following Process are found UP: nbevtmgr nbstserv vmd bpdbm nbpem nbjm nbaudit nbsl nbrmms nbemm nbrb NB_dbsrv nbatd nbdisco
03:53:02.122 [12434] <4> Offline::main: Offline called with 2 Parameters
03:53:02.122 [12434] <4> Offline::main: Initializing NBCluster using /usr/openv/netbackup/bin/cluster/NBU_RSP file

Have you found any solution?

trv · ‎08-19-2014

Nope, no solution I have found so far. We didnt log the case either as there is no core dump and nothing in standard logs. It is happening roughly once a month in our case and we can't afford to turn on verbose bpcd logging - tens of thousands of jobs daily on every master server affected. I am waiting for some bigger customer to push this issue through 1st level of support by force tbh :)

Nicolai · ‎08-19-2014

You may have to enable core dumping.

http://www.unixmen.com/how-to-enable-core-dumps-in-rhel6/

But you should still see a message in /var/log/messages if it does a segmentation fault. However if bprd quit internally you will see no cores.

Marianne · ‎08-19-2014

I would like to see command-line output of a manual start-up of NBU.

So, use hares to offline the NBU resource, leaving all other resources online:

hares -offline nbu_server -sys <live-nbu-node>

Use bpps -x to ensure all server processes terminate.
bpcd, vnetd and pbx must be left running.

Create bprd log folder under /usr/openv/netbackup/logs.

Start NetBackup:
/usr/openv/netbackup/bin/goodies/netbackup start

Check output and run 'bpps -x' in another window.
Repeat bpps +- every 30 seconds and repeat 3-4 times.

If bprd starts up initially and then 'dies', check bprd log for errors.
Please copy the log file to bprd.txt and upload here as File attachment.

Please also paste output of 'netbackup start' and bpps output into a text file (e.g. startup.txt) and upload as File attachment.
This will tell us why bprd is terminating and if any other processes are affected.
I have previously seen that expired eval key caused bprd to be stopped.

Handy NetBackup Links

mrmadej · ‎08-19-2014

Hi,

The problem occurs from time to time, average once per two or three weeks.
I found that the cluster debug log file contains entries that the bprd is DOWN, but on the same time in bprd debug log file are still logged entries. It looks like the process is still alive.
As trv wrote we also have thousands of jobs per day, so, we can't turn on the verbose logging. And in addition we can't stop NBU just like that, we have to schedule this activity.

In the messages is nothing about segmentation.

I have to look at the new 7.6.0.3 MR. Maybe the problem is solved in this release.

Regards

trv · ‎08-19-2014

We have core dumping enabled of course :) But there is no segfault message in /var/log/messages or dmesg so I assume bprd exits cleanly or there is something wrong with cluster process monitoring. dang so many typos ... edit this silently

Marianne · ‎08-19-2014

I agree - no need to turn on verbose logging.
Level 0 logging is perfectly fine.
bprd is one of the logs that I view as a 'must' on a master server.

We have seen at busy environments that the monitor cycle for NBU is simply taking too long, resulting in false timeout/failure reports.

So, check the current MonitorInterval and MonitorTimeout attributes for nbu_server resource.
If you change MonitorTimeout to more than 60 sec (e.g. 80 or 90 sec) ensure that MonitorInterval is more or equal value.

Handy NetBackup Links

mrmadej · ‎08-19-2014

I think we have default settings:

NetBackup MonitorInterval 60
NetBackup MonitorTimeout 60

Do you think that I should change this values for example:

NetBackup MonitorInterval 90
NetBackup MonitorTimeout 80

Regards

Marianne · ‎08-19-2014

Yes. That is what we did for busy environments where there was a clear 'MonitorTimeout' fault in engine_A log.

Do you perhaps have a snippet from the VCS logs when issue is seen?

Handy NetBackup Links

mrmadej · ‎08-19-2014

Below is the piece of engine_A.log:

2014/08/17 03:53:02 VCS ERROR V-16-2-13067 (eufrat) Agent is calling clean for resource(nbu_server) because the resource became OFFLINE unexpectedly, on its own.
2014/08/17 03:57:02 VCS INFO V-16-2-13716 (eufrat) Resource(nbu_server): Output of the completed operation (clean)
==============================================

Looking for NetBackup processes that need to be terminated.
Stopping nbcssc...
Stopping nbvault...

There may be backups and/or restores active.
 They will be terminated....
Suspending or cancelling selective jobs...
Stopping bprd...
Killing bpbackup processes...
Stopping nbjm...
Stopping nbars...
Stopping nbim...
Stopping nbsl...
Stopping nbrmms...
Stopping nbstserv...
Stopping nbpem...
Stopping nbproxy...
Stopping bpcompatd...
Stopping bpdbm...

And NetBackup_A.log:

2014/08/17 03:53:01 VCS ERROR V-16-2-13067 Thread(4146068336) Agent is calling clean for resource(nbu_server) because the resource became OFFLINE unexpectedly, on its own.
2014/08/17 03:57:02 VCS ERROR V-16-2-13069 Thread(4146068336) Resource(nbu_server) - clean failed.
2014/08/17 04:36:44 VCS ERROR V-16-2-13078 Thread(4146068336) Resource(nbu_server) - clean completed successfully after 12 failed attempts.
2014/08/17 04:36:44 VCS ERROR V-16-2-13073 Thread(4146068336) Resource(nbu_server) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.

Nicolai · ‎08-19-2014

It may not be a variable option if this happen from time to time. But I have previous successful used strace to track down errors. If bprd exit cleanly strace should be able to show us why.

http://www.cyberciti.biz/tips/linux-strace-command-examples.html

mrmadej · ‎08-20-2014

Yes, I thought about that.

Probably I will use "while" loop and when cluster reports the failure with bprd, strace will catch the output from the process. I have to test it in my test environment. I hope that this not increase the load too high.

Marianne · ‎08-20-2014

Nothing in the logs prior to 03:53?

Curious to know what was running on the master server at this point in time.
We see a 'bpbackup' process getting killed as part of the 'clean' process.
Can you see when this happened previously which processes were still running at the time?

This error is normally seen someone has manually killed or restarted NBU (but I guess it is highly unlikely at 3:53 in the morning!)

the resource became OFFLINE unexpectedly, on its own.

Please create bprd log folder as suggested earlier.
The log will be enabled next time this 'glitch' occurs.

Handy NetBackup Links

mrmadej · ‎08-20-2014

I attached the bprd log "bprd.log.1.gz" with extracted time range from 3:50 to 4:39. I think it is enough big.

Regards

Marianne · ‎08-20-2014

I have had a quick look at the bprd log.

I am surprized to see the size of the log for such a short period.
More than 180 000 lines!!!

This is why I asked previously - what was happening on the master server at this point in time?

Any idea of active backups at this time?
And user-directected operations that needs separate bprd process for each request?
Any idea about the 'bpbackup' process?

I see LOTS of connection attempts from IP address 10.21.157.130.
25 979 lines containing this IP address.
The master server cannot resolve this IP to a hostname and keeps on rejecting connection request:

04:39:59.928 [43388] <16> connected_peer: Unable to look up host name for IP address 10.21.157.130: Name or service not known (-2)
04:39:59.928 [43388] <2> connected_peer: Connection from host 10.21.157.130, 10.21.157.130, on non-reserved port 63815
04:39:59.944 [43388] <2> db_valid_master_server: 10.21.157.130 is not a valid server

Do you perhaps have any kind of resource monitoring in place?
I am curious to know what amount of memory/cpu is consumed by all of these bprd processes.

Was any kind of performance tuning done at NBU and OS level?
NBU 7.6 needs additional tuning at OS level, e.g.:
https://www.symantec.com/docs/TECH75332
and https://www.symantec.com/docs/TECH203066

Handy NetBackup Links

mrmadej · ‎08-21-2014

Hi,

This is normal activity of this master server. This master server performs about 150-250 concurrent jobs and the same number in the queue.

This particular problem occured at night and I think there was a lot of running backups. But I don't think that there was any user-directed operations.

I don't know why the master server can't resolve this IP to the hostname. This is the client which is backed up correctly. But I will ask the administrator to check this with LAN admins.

Regarding the tunning - it was performed according the support recommendation. For example the cache size for NBDB is set to 9GB. 6GB previously was filled very often and master was going down.

Regards

Marianne · ‎08-27-2014

If you say this started happening since 7.6 upgrade, then it is probably time to log a Support call....

Handy NetBackup Links

VOX

NetBackup unexpectedly going offline - VCS