08-18-2014 05:49 AM
Hello,
I have looked on many forums but can not find nothing helpfull.
I have NetBackup 7.6.0.2 on VCS with GCO on RedHat 6.
Since last upgrade I have issue with restarting the NBU by VCS. It looks like one of monitored NBU processes is not responding.
I can not find what kind of process exactly is not responding.
AGENT_DEBUG.log shows nothing and also VCS engine_A.log and NetBackup_A.log are without detailed information.
I have enabled debug logging on VCS but without expected result.
Can someone help me how to find that particular process?
Many thanks in advance
Madej
Solved! Go to Solution.
09-30-2014 01:31 AM
Hi,
The problem is still not solved.
But we have found the workaround which gives us time to wait for official fix in the 7.6.1 release.
The workaround is to set the ToleranceLimit to '1' on the whole resource type NetBackup on VCS.
And it looks well. In the engine_A.log we can find the entries like this:
2014/09/26 04:20:33 VCS INFO V-16-2-13075 (*****) Resource(nbu_server) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).
Thanks All
Regards
Madej
08-18-2014 06:46 AM
08-18-2014 09:24 AM
does /var/log/messages show any signs of core dumps ?
08-19-2014 12:20 AM
Hi,
Thanks !!
I see the issue with bprd also.
03:53:01.308 [12386] <16> monitor:processStatus: Some Processes are DOWN while others are UP
03:53:01.308 [12386] <16> monitor:processStatus: Following Process are found DOWN: bprd
03:53:01.308 [12386] <16> monitor:processStatus: Following Process are found UP: nbevtmgr nbstserv vmd bpdbm nbpem nbjm nbaudit nbsl nbrmms nbemm nbrb NB_dbsrv nbatd nbdisco
03:53:02.122 [12434] <4> Offline::main: Offline called with 2 Parameters
03:53:02.122 [12434] <4> Offline::main: Initializing NBCluster using /usr/openv/netbackup/bin/cluster/NBU_RSP file
Have you found any solution?
08-19-2014 12:39 AM
08-19-2014 12:52 AM
You may have to enable core dumping.
http://www.unixmen.com/how-to-enable-core-dumps-in-rhel6/
But you should still see a message in /var/log/messages if it does a segmentation fault. However if bprd quit internally you will see no cores.
08-19-2014 12:53 AM
I would like to see command-line output of a manual start-up of NBU.
So, use hares to offline the NBU resource, leaving all other resources online:
hares -offline nbu_server -sys <live-nbu-node>
Use bpps -x to ensure all server processes terminate.
bpcd, vnetd and pbx must be left running.
Create bprd log folder under /usr/openv/netbackup/logs.
Start NetBackup:
/usr/openv/netbackup/bin/goodies/netbackup start
Check output and run 'bpps -x' in another window.
Repeat bpps +- every 30 seconds and repeat 3-4 times.
If bprd starts up initially and then 'dies', check bprd log for errors.
Please copy the log file to bprd.txt and upload here as File attachment.
Please also paste output of 'netbackup start' and bpps output into a text file (e.g. startup.txt) and upload as File attachment.
This will tell us why bprd is terminating and if any other processes are affected.
I have previously seen that expired eval key caused bprd to be stopped.
08-19-2014 01:17 AM
Hi,
The problem occurs from time to time, average once per two or three weeks.
I found that the cluster debug log file contains entries that the bprd is DOWN, but on the same time in bprd debug log file are still logged entries. It looks like the process is still alive.
As trv wrote we also have thousands of jobs per day, so, we can't turn on the verbose logging. And in addition we can't stop NBU just like that, we have to schedule this activity.
In the messages is nothing about segmentation.
I have to look at the new 7.6.0.3 MR. Maybe the problem is solved in this release.
Regards
08-19-2014 01:29 AM
08-19-2014 01:42 AM
I agree - no need to turn on verbose logging.
Level 0 logging is perfectly fine.
bprd is one of the logs that I view as a 'must' on a master server.
We have seen at busy environments that the monitor cycle for NBU is simply taking too long, resulting in false timeout/failure reports.
So, check the current MonitorInterval and MonitorTimeout attributes for nbu_server resource.
If you change MonitorTimeout to more than 60 sec (e.g. 80 or 90 sec) ensure that MonitorInterval is more or equal value.
08-19-2014 04:02 AM
I think we have default settings:
NetBackup MonitorInterval 60
NetBackup MonitorTimeout 60
Do you think that I should change this values for example:
NetBackup MonitorInterval 90
NetBackup MonitorTimeout 80
Regards
08-19-2014 04:27 AM
Yes. That is what we did for busy environments where there was a clear 'MonitorTimeout' fault in engine_A log.
Do you perhaps have a snippet from the VCS logs when issue is seen?
08-19-2014 04:45 AM
Below is the piece of engine_A.log:
2014/08/17 03:53:02 VCS ERROR V-16-2-13067 (eufrat) Agent is calling clean for resource(nbu_server) because the resource became OFFLINE unexpectedly, on its own. 2014/08/17 03:57:02 VCS INFO V-16-2-13716 (eufrat) Resource(nbu_server): Output of the completed operation (clean) ============================================== Looking for NetBackup processes that need to be terminated. Stopping nbcssc... Stopping nbvault... There may be backups and/or restores active. They will be terminated.... Suspending or cancelling selective jobs... Stopping bprd... Killing bpbackup processes... Stopping nbjm... Stopping nbars... Stopping nbim... Stopping nbsl... Stopping nbrmms... Stopping nbstserv... Stopping nbpem... Stopping nbproxy... Stopping bpcompatd... Stopping bpdbm...
And NetBackup_A.log:
2014/08/17 03:53:01 VCS ERROR V-16-2-13067 Thread(4146068336) Agent is calling clean for resource(nbu_server) because the resource became OFFLINE unexpectedly, on its own. 2014/08/17 03:57:02 VCS ERROR V-16-2-13069 Thread(4146068336) Resource(nbu_server) - clean failed. 2014/08/17 04:36:44 VCS ERROR V-16-2-13078 Thread(4146068336) Resource(nbu_server) - clean completed successfully after 12 failed attempts. 2014/08/17 04:36:44 VCS ERROR V-16-2-13073 Thread(4146068336) Resource(nbu_server) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 2) the resource.
08-19-2014 06:29 AM
It may not be a variable option if this happen from time to time. But I have previous successful used strace to track down errors. If bprd exit cleanly strace should be able to show us why.
http://www.cyberciti.biz/tips/linux-strace-command-examples.html
08-20-2014 12:17 AM
Yes, I thought about that.
Probably I will use "while" loop and when cluster reports the failure with bprd, strace will catch the output from the process. I have to test it in my test environment. I hope that this not increase the load too high.
08-20-2014 01:08 AM
Nothing in the logs prior to 03:53?
Curious to know what was running on the master server at this point in time.
We see a 'bpbackup' process getting killed as part of the 'clean' process.
Can you see when this happened previously which processes were still running at the time?
This error is normally seen someone has manually killed or restarted NBU (but I guess it is highly unlikely at 3:53 in the morning!)
the resource became OFFLINE unexpectedly, on its own.
Please create bprd log folder as suggested earlier.
The log will be enabled next time this 'glitch' occurs.
08-20-2014 01:38 AM
I attached the bprd log "bprd.log.1.gz" with extracted time range from 3:50 to 4:39. I think it is enough big.
Regards
08-20-2014 06:41 AM
I have had a quick look at the bprd log.
I am surprized to see the size of the log for such a short period.
More than 180 000 lines!!!
This is why I asked previously - what was happening on the master server at this point in time?
Any idea of active backups at this time?
And user-directected operations that needs separate bprd process for each request?
Any idea about the 'bpbackup' process?
I see LOTS of connection attempts from IP address 10.21.157.130.
25 979 lines containing this IP address.
The master server cannot resolve this IP to a hostname and keeps on rejecting connection request:
04:39:59.928 [43388] <16> connected_peer: Unable to look up host name for IP address 10.21.157.130: Name or service not known (-2) 04:39:59.928 [43388] <2> connected_peer: Connection from host 10.21.157.130, 10.21.157.130, on non-reserved port 63815 04:39:59.944 [43388] <2> db_valid_master_server: 10.21.157.130 is not a valid server
Do you perhaps have any kind of resource monitoring in place?
I am curious to know what amount of memory/cpu is consumed by all of these bprd processes.
Was any kind of performance tuning done at NBU and OS level?
NBU 7.6 needs additional tuning at OS level, e.g.:
https://www.symantec.com/docs/TECH75332
and https://www.symantec.com/docs/TECH203066
08-21-2014 02:25 AM
Hi,
This is normal activity of this master server. This master server performs about 150-250 concurrent jobs and the same number in the queue.
This particular problem occured at night and I think there was a lot of running backups. But I don't think that there was any user-directed operations.
I don't know why the master server can't resolve this IP to the hostname. This is the client which is backed up correctly. But I will ask the administrator to check this with LAN admins.
Regarding the tunning - it was performed according the support recommendation. For example the cache size for NBDB is set to 9GB. 6GB previously was filled very often and master was going down.
Regards
08-27-2014 01:04 AM
If you say this started happening since 7.6 upgrade, then it is probably time to log a Support call....