05-04-2011 08:11 AM
Hi all ,
as mentioned in subject, in the past we had this kind of issue once per month, but now once per week . Netbackup VCS group is going to PARTIAL|FAULTED state , application is running with executed jobs ( all are active ), but new jobs are not able to start from external scheduler ( status 25, cannot connect on socket ). To fix it ( or fast workaround.. ) we need to offline group , kill -9 stuck processes ( from bpps -a ) as bp.kill_all -FORCE is not able to determinate them , clear the group and put online again.
Master is Netbackup 6.5.3 running on Solaris 5.10 , VCS Veritas-5.0MP3RP4-05
05-05-2011 12:02 AM
If your're Netbackup cluster switches or restarts unexpected, you can find the process causing the switch in :
/usr/openv/netbackup/bin/cluster/AGENT_DEBUG.log
A example:
# grep "Detected process offline" AGENT_DEBUG.log
02/23/10 14:55:091 Detected process offline:nbemm monitor::main
02/23/10 15:00:331 Detected process offline:vmd monitor::main
02/23/10 15:02:331 Detected process offline:nbemm monitor::main
02/23/10 15:42:031 Detected process offline:vmd monitor::main
02/23/10 15:44:031 Detected process offline:nbstserv monitor::main
Maybe update to 6.5.6 is the best fix
05-05-2011 01:11 AM
Hi Nicolai , thank you for opinion , in AGENT_DEBUG.LOG i can see mostly nbpem process
Tue May 3 09:24:14 2011 vcs/Online.pl: Start Online.......
05/03/11 09:24:143 Common Online called online::main
05/03/11 09:24:141 Calling start command online::main
05/04/11 07:05:571 Detected process offline:nbpem monitor::main
05/04/11 07:10:581 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 07:12:011 Detected process offline:nbpem monitor::main
05/04/11 07:17:021 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 07:18:031 Detected process offline:nbpem monitor::main
05/04/11 07:23:051 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 07:24:061 Detected process offline:nbpem monitor::main
05/04/11 07:29:071 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 07:30:081 Detected process offline:nbpem monitor::main
05/04/11 07:35:101 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 07:36:111 Detected process offline:nbpem monitor::main
05/04/11 07:41:121 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 07:42:131 Detected process offline:nbpem monitor::main
Wed May 4 07:54:33 2011 vcs/Online.pl: Start Online.......
05/04/11 07:54:333 Common Online called online::main
05/04/11 07:54:331 Calling start command online::main
05/04/11 11:11:041 Detected process offline:nbpem monitor::main
05/04/11 11:16:051 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 11:17:061 Detected process offline:nbpem monitor::main
05/04/11 11:22:081 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 11:23:081 Detected process offline:nbpem monitor::main
05/04/11 11:28:101 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 11:29:101 Detected process offline:nbpem monitor::main
05/04/11 11:34:121 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 11:35:131 Detected process offline:nbpem monitor::main
05/04/11 11:40:141 detected process online when in OFFLINE state: nbevtmgr nbstserv vmd bprd bpdbm nbjm nbemm nbrb NB_dbsrv monitor::main
05/04/11 11:41:161 Detected process offline:nbpem monitor::main
Wed May 4 11:41:16 2011 clean.pl: Start clean.......
05/04/11 11:43:493 Common Offline : Stop command completed. offline::main
Wed May 4 11:43:50 2011 vcs/Online.pl: Start Online.......
05/04/11 11:43:503 Common Online called online::main
05/04/11 11:43:511 Calling start command online::main
Wed May 4 12:12:53 2011 offline.pl: Start Offline.......
05/04/11 12:16:241 Common offline: Stop command failed. offline::main
Wed May 4 12:12:53 2011 offline.pl: offline exited with 99
Wed May 4 12:19:20 2011 vcs/Online.pl: Start Online.......
05/04/11 12:19:203 Common Online called online::main
05/04/11 12:19:201 Calling start command online::main
05-05-2011 08:57 AM
Please consider upgrade to 6.5.6 and do install patched nbpem http://www.symantec.com/docs/TECH141606. There is memory leak indeed.
05-05-2011 08:31 PM
Seems the memory leak is INTRODUCED with 6.5.6!!
05-05-2011 10:20 PM
I meant that they upgrade to 6.5.6, this patch must be applied to eliminate memory leak. It is quite logical.
05-05-2011 10:46 PM
Really? INTRODUCED in 6.5.6 does not by default mean that the memory leak exists prior to 6.5.6.... Well, not according to my logic!
I do agree that there seems to be a problem with nbpem. VCS is mere reacting to nbpem terminating for some or other reason...
We have seen customers experiencing problems with nbpem core dumping and leaving defunct processes all resolved since they've upgraded to 7.0.1.
05-06-2011 04:11 AM
When it comes to Solaris10, the lesson I got from Support is always to "disable tcp_fusion" first:
http://www.symantec.com/docs/TECH62004
tcp_fusion appears to cause many issues with Netbackup...
05-06-2011 05:24 AM
This is a issue we had at our site. The NBPEM does leaks memory, but releases it when it has a quite state. In our case NBPEM always had something to do, so it continued allocating memory until NBPEM reached 4GB and the core dumped.
05-06-2011 07:13 AM
thank you all , I will try to check it with Symantec engineer
05-06-2011 07:37 AM
I'm sorry, I was not precise enough while writting my previous posts. I wanted to say that they should install the patch AFTER the upgrade to 6.5.6. I didn't imply that the bug exists in 6.5.3. 6.5.3 is quite outdated version, so, in my opinion, they should upgrade and THEN install the nbpem patch to fix the memory leak bug found in 6.5.6. Sorry for messing you up.
05-06-2011 08:11 AM
The memory leak introduced with 6.5.6 has an EEB to fix -
05-06-2011 09:00 AM
Marcinn - It seems the Netbackup Policy Execution Manager is the culprit, not VCS
Do you see any core files on the master server (is core file creation enabled).
There are a number of EEB for NBPEM, I had nbpem version 6 when I ran 6.5.3. I sound like the best solution is to upgrade to 6.5.6. NBPEM is very stable in this version (see my other note when NBPEM leaks memory).