cancel
Showing results for 
Search instead for 
Did you mean: 

Netbackup processes going down

Amar_Rajan
Level 3

Hi All, nbvault process going down every night but not at same time which is affecting my catalog backup failing with 150. I checked various log but cant find any clue. Could you guys please help me on this, what needs to be checked and where. 

 

Also, on some days all the processes are going down and coming up by itself, due to this all the backups are getting killed with EC 50. Similarly i cant find any issue but only in VCS engine logs which says

*******************************************************log SNIP***********************

"clean procedure did not complete within the expected time"

"monitor procedure finished successfully after failing to complete within the expected time for (8) consecutive times"

"Agent is calling clean for resource(NetBackup_$service) because  4 successive invocations of the monitor procedure did not complete within the expected time"

Some Processes are DOWN while others are UP
Following Process are found DOWN: bprd nbjm
Following Process are found UP: vmd bpdbm nbpem nbevtmgr nbemm nbrb NB_dbsrv nbaudit

Looking for NetBackup processes that need to be terminated.
Stopping nbpem...
Stopping nbproxy...
Stopping bpcompatd...
Stopping bpdbm...

 

*************************************************************************

Could you guys please help.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

We see that someone has change the Critical attribute to 0.

This means that VCS will NOT take the resource down when the MonitorTimeout is exceeded.

But it seems that someone or something other than VCS has killed some monitored NBU processes:

2015/07/25 05:26:49 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(monitor)
Some Processes are DOWN while others are UP
Following Process are found DOWN: vmd bprd bpdbm nbpem nbjm
Following Process are found UP: nbevtmgr nbemm nbrb NB_dbsrv nbaudit

This is why the CLEAN entry point was called: 


2015/07/25 05:20:45 VCS ERROR V-16-2-13067 Thread(4145675152) Agent is calling clean for resource(NetBackup_$master)
because the resource became OFFLINE unexpectedly, on its own.

 

You need to enable additional logging as per Martin's suggestion and also check /var/adm/messages for this date and time. 

Ensure logging is enabled for processes that were found to be DOWN.

View solution in original post

17 REPLIES 17

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified
Which NBU version? Which VCS version? Hopefully higher than 7.x and taking hot catalog backup? 6.x cold catalog backup will take down NBU emm processes which may be seen as a fault in VCS. It seems that VCS is taking NBU down because of a monitor timeout. A workaround is to increase the VCS monitor timeout for the NBU resource.

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Please attach the engine_a and netbackup agent logs as attachments.

mph999
Level 6
Employee Accredited

These are the logs/ files I get for cluster issues - probably overkill for this issue, but for future reference

mkdir /tmp/sym
cp /etc/VRTSvcs/conf/config/main.cf /tmp/sym/main.cf
cp /usr/openv/netbackup/bin/cluster/AGENT_DEBUG.log /tmp/sym/agent_debug.txt
cp /usr/openv/netbackup/bin/cluster/NBU_RSP /tmp/sym/nbu_rsp.txt
cp /usr/openv/netbackup/bin/cluster/vcs/NetBackupTypes.cf /tmp/sym/nbu_types_cf.txt
cd /etc/VRTSvcs/conf/ ;tar cvf /tmp/sym/VRTSvcs.tar config
cp /var/VRTSvcs/log/engine_A.log /tmp/sym/engine_a.txt
cp /var/VRTSvcs/log/NetBackup_A.log /tmp/sym/netbackup_a.txt
cp /var/adm/messages /tmp/sym/messages.txt
cp /usr/openv/netbackup/logs/cluster/log.<date> /tmp/sym/cluster_log.txt

Note.  At 7.x, but before 7.1.x

Set ‘ DEBUG_LEVEL=1 ‘ in the /usr/openv/netbackup/bin/cluster/NBU_RSP configuration file prior to 7.1 this is needed to get info at sufficient level in cluster.log to determine the process that caused the cluster to fail, not required at or after 7.1)

6.5.x is way out of suport, but the file showing which processes where down was different, AGENT_DEBUG.log

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

I've been thinking about this one since opening it this morning - I have clustered more NBU masters on Solaris and Linux than I can remember. Since 5.1 days. 

I have never seen anything that is described here.

nbvault going down and causing catalog backup to fail?
No idea how the 2 are related? 
Or is this a vault catalog backup?
And why would nbvault go down?

Or is nbvault going down when the monitor timeout kicks in and all processes are taken down?

The logs and config files that Martin requested will hopefully help to understand what is happening here...

mph999
Level 6
Employee Accredited

I was wondering the same thing ...

Amar_Rajan
Level 3

Thanks for the response guys.

Netbackup version is 7.1

VCS version 5.0 MP3

I have posted the available logs. Just fyi, cluster setup is similar to all the masters but only this master is having this issue. Env is lot customized.

Agent_debug.log:

Fri Jul 24 20:54:35 2015 Start Offline.......

Fri Jul 24 20:57:37 2015 Start Offline.......

Fri Jul 24 21:08:39 2015 Start Offline.......

Sat Jul 25 05:15:41 2015 Start Offline.......

Sat Jul 25 05:18:43 2015 Start Offline.......

Sat Jul 25 05:20:45 2015 Start Offline.......

Sat Jul 25 05:23:47 2015 Start Offline.......

Sat Jul 25 05:26:49 2015 Start Offline.......

Sat Jul 25 05:28:50 2015 Start Offline.......

Sat Jul 25 05:31:20 2015 Start Online.......

Sun Jul 26 19:04:34 2015 Start Offline.......

Sun Jul 26 19:07:36 2015 Start Offline.......

Sun Jul 26 19:10:38 2015 Start Offline.......

Sun Jul 26 19:13:40 2015 Start Offline.......

 

nbu_rsp:

NBU_GROUP=$master
NODES=$nodes
SHARED_DISK=/usr/openv
VNAME=$master
PROBE_PROCS=vmd bprd bpdbm nbpem nbjm nbevtmgr nbemm nbrb NB_dbsrv nbaudit
CLUTYPE=VCS
PRODUCT_CODE=NBU
START_PROCS=NB_dbsrv nbevtmgr nbemm nbrb ltid vmd bpcompatd nbjm nbpem nbstserv nbrmms nbsl nbvault nbsvcmon bpdbm bprd bptm bpbrmds bpsched bpcd bpversion bpjobd nbproxy vltcore acsd tl8cd odld tldcd tl4d tlmd tshd rsmd tlhcd pbx_exchange nbkms nbaudit nbatd nbazd
DIR=kms mv

 

nbu_types_cf -- No such file (env is customized for your information)

 

engine log:

 

2015/07/24 21:09:40 VCS ERROR V-16-2-13006 (server node) Resource(NetBackup_$master): clean procedure did not complete
within the expected time.
2015/07/24 21:11:26 VCS INFO V-16-2-13026 (server node) Resource(NetBackup_$master) - monitor procedure finished succes
sfully after failing to complete within the expected time for (4) consecutive times.
2015/07/25 01:50:52 VCS INFO V-16-1-50135 User root fired command: haconf -dump from localhost
2015/07/25 05:09:43 VCS ERROR V-16-2-13027 (server node) Resource(NetBackup_$master) - monitor procedure did not comple
te within the expected time.
2015/07/25 05:15:41 VCS ERROR V-16-2-13210 (server node) Agent is calling clean for resource(NetBackup_$master) because
 4 successive invocations of the monitor procedure did not complete within the expected time.
2015/07/25 05:16:42 VCS ERROR V-16-2-13006 (server node) Resource(NetBackup_$master): clean procedure did not complete
within the expected time.
2015/07/25 05:20:45 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(monitor)
Some Processes are DOWN while others are UP
Following Process are found DOWN: nbjm
Following Process are found UP: vmd bprd bpdbm nbpem nbevtmgr nbemm nbrb NB_dbsrv nbaudit
2015/07/25 05:20:45 VCS INFO V-16-2-13026 (server node) Resource(NetBackup_$master) - monitor procedure finished succes
sfully after failing to complete within the expected time for (5) consecutive times.
2015/07/25 05:20:45 VCS ERROR V-16-2-13067 (server node) Agent is calling clean for resource(NetBackup_$master) because
 the resource became OFFLINE unexpectedly, on its own.
2015/07/25 05:21:46 VCS ERROR V-16-2-13006 (server node) Resource(NetBackup_$master): clean procedure did not complete
within the expected time.
2015/07/25 05:22:47 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(monitor)
Some Processes are DOWN while others are UP
Following Process are found DOWN: bprd nbjm
Following Process are found UP: vmd bpdbm nbpem nbevtmgr nbemm nbrb NB_dbsrv nbaudit
2015/07/25 05:23:47 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(monitor)
Some Processes are DOWN while others are UP
Following Process are found DOWN: bprd nbpem nbjm
Following Process are found UP: vmd bpdbm nbevtmgr nbemm nbrb NB_dbsrv nbaudit
2015/07/25 05:24:48 VCS INFO V-16-2-13003 (server node) Resource(NetBackup_$master): Output of the timed out operation
(clean)

Looking for NetBackup processes that need to be terminated.
Stopping nbpem...
Stopping nbproxy...
Stopping bpcompatd...
Stopping bpdbm...

The following processes are still active
root      3599     1  0 Jul24 ?        00:06:05 /usr/openv/netbackup/bin/bpdbm
root      3605  3599  3 Jul24 ?        00:32:37 /usr/openv/netbackup/bin/bpjobd
root      5370     1  0 Jul24 ?        00:00:05 /usr/openv/netbackup/bin/nbproxy dblib nbpem_email


root     10592 24181  0 05:22 ?        00:00:00 /usr/openv/netbackup/bin/admincmd/bpdbjobs -cancel 1108528
root     10595 24222  0 05:22 ?        00:00:00 /usr/openv/netbackup/bin/admincmd/bpdbjobs -summary -ignore_parent_job
s -all_columns
root     10637 24185  0 05:22 ?        00:00:00 /usr/openv/net
2015/07/25 05:25:49 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(monitor)
Some Processes are DOWN while others are UP
Following Process are found DOWN: vmd bprd bpdbm nbpem nbjm
Following Process are found UP: nbevtmgr nbemm nbrb NB_dbsrv nbaudit
2015/07/25 05:26:49 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(monitor)
Some Processes are DOWN while others are UP
Following Process are found DOWN: vmd bprd bpdbm nbpem nbjm
Following Process are found UP: nbevtmgr nbemm nbrb NB_dbsrv nbaudit
2015/07/25 05:27:49 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(clean)

Looking for NetBackup processes that need to be terminated.

Looking for Media Manager processes that need to be terminated.

Looking for more NetBackup processes that need to be terminated.
Stopping nbrb...
Stopping nbemm...
Stopping nbaudit...
Stopping nbevtmgr...
Stopping nbazd...
Stopping VxDBMS database server ...
Stopping bpcd...
Stopping vnetd...
Stopping nbatd...
/usr/openv/netbackup/bin/bp.kill_all FORCEKILL  2>&1 < /dev/null succeeded.
2015/07/25 05:29:22 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(clean)

Looking for NetBackup processes that need to be terminated.

Looking for Media Manager processes that need to be terminated.

Looking for more NetBackup processes that need to be terminated.
Stopping bpcd...
Stopping vnetd...
/usr/openv/netbackup/bin/bp.kill_all FORCEKILL  2>&1 < /dev/null succeeded.
2015/07/25 05:29:22 VCS INFO V-16-2-13078 (server node) Resource(NetBackup_$master) - clean completed successfully afte
r 3 failed attempts.
2015/07/25 05:29:22 VCS ERROR V-16-2-13073 (server node) Resource(NetBackup_$master) became OFFLINE unexpectedly on its
 own. Agent is restarting (attempt number 1 of 2) the resource.
2015/07/25 05:31:33 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(online)
no new style logging available
2015/07/25 05:31:34 VCS NOTICE V-16-2-13076 (server node) Agent has successfully restarted resource(NetBackup_$master).
2015/07/25 10:21:34 VCS ERROR V-16-2-13027 (server node) Resource(NetBackup_$master) - monitor procedure did not comple
te within the expected time.
2015/07/25 10:26:56 VCS INFO V-16-2-13026 (server node) Resource(NetBackup_$master) - monitor procedure finished succes
sfully after failing to complete within the expected time for (3) consecutive times.
2015/07/25 17:12:44 VCS INFO V-16-1-50135 User root fired command: haconf -dump from localhost
2015/07/25 17:14:58 VCS INFO V-16-1-50135 User root fired command: haconf -dump from localhost
2015/07/26 00:45:59 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(monitor)
do_ypcall: clnt_call: RPC: Timed out
2015/07/26 08:40:28 VCS INFO V-16-1-50135 User root fired command: haconf -dump from localhost
2015/07/26 08:42:36 VCS INFO V-16-1-50135 User root fired command: haconf -dump from localhost

 

 

Netbackup resource log (vcs):
2015/07/24 21:11:26 VCS INFO V-16-2-13026 Thread(4133485456) Resource(NetBackup_$master) - monitor procedure finished
 successfully after failing to complete within the expected time for (4) consecutive times.
2015/07/25 05:09:42 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4133485456)
2015/07/25 05:09:43 VCS ERROR V-16-2-13027 Thread(4145675152) Resource(NetBackup_$master) - monitor procedure did not
 complete within the expected time.
2015/07/25 05:11:40 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4145675152)
2015/07/25 05:13:40 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4133485456)
2015/07/25 05:15:40 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4145675152)
2015/07/25 05:15:41 VCS ERROR V-16-2-13210 Thread(4133485456) Agent is calling clean for resource(NetBackup_$master)
because 4 successive invocations of the monitor procedure did not complete within the expected time.
2015/07/25 05:16:41 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4133485456)
2015/07/25 05:16:42 VCS ERROR V-16-2-13006 Thread(4145675152) Resource(NetBackup_$master): clean procedure did not co
mplete within the expected time.
2015/07/25 05:18:42 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4145675152)
2015/07/25 05:19:43 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4133485456)
2015/07/25 05:20:45 VCS INFO V-16-2-13026 Thread(4145675152) Resource(NetBackup_$master) - monitor procedure finished
 successfully after failing to complete within the expected time for (5) consecutive times.
2015/07/25 05:20:45 VCS ERROR V-16-2-13067 Thread(4145675152) Agent is calling clean for resource(NetBackup_$master)
because the resource became OFFLINE unexpectedly, on its own.
2015/07/25 05:21:45 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4145675152)
2015/07/25 05:21:46 VCS ERROR V-16-2-13006 Thread(4133485456) Resource(NetBackup_$master): clean procedure did not co
mplete within the expected time.
2015/07/25 05:24:47 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4133485456)
2015/07/25 05:27:49 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4145675152)
2015/07/25 05:29:22 VCS ERROR V-16-2-13078 Thread(4133485456) Resource(NetBackup_$master) - clean completed successfu
lly after 3 failed attempts.
2015/07/25 05:29:22 VCS ERROR V-16-2-13073 Thread(4133485456) Resource(NetBackup_$master) became OFFLINE unexpectedly
 on its own. Agent is restarting (attempt number 1 of 2) the resource.
2015/07/25 10:21:33 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4133485456)
2015/07/25 10:21:34 VCS ERROR V-16-2-13027 Thread(4145675152) Resource(NetBackup_$master) - monitor procedure did not
 complete within the expected time.
2015/07/25 10:23:33 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4145675152)
2015/07/25 10:25:33 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4133485456)
2015/07/25 10:26:56 VCS INFO V-16-2-13026 Thread(4145675152) Resource(NetBackup_$master) - monitor procedure finished
 successfully after failing to complete within the expected time for (3) consecutive times.
2015/07/26 18:38:33 VCS WARNING V-16-2-13139 Thread(4156165008) Canceling thread (4145675152)
2015/07/26 18:38:34 VCS ERROR V-16-2-13027 Thread(4133485456) Resource(NetBackup_$master) - monitor procedure did not
 complete within the expected time.


 

@Marianne :

We aren't taking vault catalog, just regular images catalog backup.We are trying to resize the catalog partition since 2 weeks but due to unsuccessful catalog backup we are sitting tight

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

All I can see here is that the MonitorTimeout for NBU resource seems to be too small.

monitor procedure did not
 complete within the expected time.

 

nbvault is not the cause here. VCS is only monitoring these processes:

PROBE_PROCS=vmd bprd bpdbm nbpem nbjm nbevtmgr nbemm nbrb NB_dbsrv nbaudit 

nbvault is simply terminated when NBU is taken down because of the MonitorTimeout.

 

The offline action is also taking too long to complete which makes VCS getting its 'knickers in a knot'.

 

Please increase all Timeouts for the NetBackup resource:

OnlineTimeout
MonitorTimeout
OfflineTimeout

This has worked for me in the past in large environments where NBU got restarted 'out of the blue' due to MonitorTimeout being too small.

 

Extract from VCS Admin Guide:

Monitoring resource type and agent configuration

......

You can ... adjust how often VCS monitors various functions by modifying
their associated attributes. The attributes MonitorTimeout, OnlineTimeOut, and
OfflineTimeout indicate the maximum time (in seconds) within which the
monitor, online, and offline functions must complete or else be terminated.

The default for the MonitorTimeout attribute is 60 seconds. The defaults for the
OnlineTimeout and OfflineTimeout attributes is 300 seconds.

For best results, Symantec recommends measuring the time it takes to bring a resource online,
take it offline, and monitor before modifying the defaults. Issue an online or
offline command to measure the time it takes for each action. To measure how
long it takes to monitor a resource, fault the resource and issue a probe, or bring
the resource online outside of VCS control and issue a probe.

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Hi,

 

What you could do is to freeze the service group, or force stop had completely. That will allow you to see which NBU processes, if any are going down and causing the cluster to take down all the other processes. If no processes are found to be down then you can assume it is the monitor process that is taking to long and therefore the cluster is actually restarting the services because of that. In that case you can increate the monitor timeouts.

Amar_Rajan
Level 3

Thanks for the info Marianne, let me check with eng team, i think monitor timeout could help but i am not sure about the other timeouts since this is not happening in our weekly failovers, but other various times. 

Is there any other place/logs in netbackup other than VCS i can try to track whats going on with the  nbu processes.

Amar_Rajan
Level 3

Thanks Riaan, let me check that as well.

mph999
Level 6
Employee Accredited

As Riann suggests ,freeze the cluster, run NBU and see if anything goes down.  If so, we then know what.  If not, the cluster is reporting incorrectly.

The nice thing about that method, is no logs are needed.

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

The problem with log snippets is that we cannot see where the issue was first detected. 
Please copy engine_a log to engine_a.txt and upload as File attachment.

All I can see is the MonitorTimeout that kicked in. This will cause VCS to take action as per configuration.

The default for NBU is to offline the resource and start again on same node before failover will be attempted.
Another problem kicks in if the NBU processes cannot be stopped within the OfflineTimeout and the clean action needs to be called.
We see that happening in the log snippet and the clean procedure also takes too long to kill all running NBU processes.

main.cf, types.cf and NetBackupTypes.cf will tell us what kind of resource customisation is done.
You can also copy these files to .txt files and upload.

As per my previous post - all of this is probably the same as I have seen in large environments - that the MonitorTimeout of 60 seconds is too short. 
Changing this Timeout to 120 seconds fixed the issue.

It will be good to increase the online and offline timeouts as well.

Freeze the ServiceGroup, then stop NBU.
Record how long NBU is taking to go down.
Bear in mind that in an active, live environment where lots of backups are running, it will take longer to stop NBU than doing this at a time when no backups are running.

The time that NBU is taking to go down in active and non-active scenario need to be checked.
Adjust the OfflineTimeout to allow enough time for NBU processes to be taken down before the timeout kicks in.

Start NBU and record how long it takes for all processes to come online.
(The default OnlineTimeout of 5 min is normally sufficient.) 

Adjust timeouts accordingly.

So - my suggestions in summary:

1. Increase MonitorTimeout

2. Increase OfflineTimeout 

Amar_Rajan
Level 3

Hi Guys, thanks for the suggestions.

Yes, the environment is quite big and lot of clients. Unfortunately we cant try the fixes immediately so we are planning to schedule these tryouts. I am working with one of our seniors and lets see how it goes

First: We are going to check the timeout issue

Second: If increasing timeout doesnt yeild any result then freezing option may be tried before our weekly reboots.

 

For engine log, if you seen the snippet, i given from 07/24 when the processes are running fine and continued to 07/25, There is no errors or information which caused this process stopping.

Also, the main and types.cf are simple, there is no much customization. 

 

include "types.cf"

cluster $clusid (
        UserNames = { VCSGuest = XXXXXXXXXXX }
        ClusterLocation = "use_alt_vcs_llt_ports:nopublic"
        )

system $node1 (
        )

system $node2 (
        )

group $master (
        SystemList = { $node1 = 0, $node2 = 1 }
        UserStrGlobal = NBU
        AutoStartList = { $node1, $node2 }
        )

        DiskGroup DiskGroup_$master (
                ResourceOwner = root
                DiskGroup = "$master.gnr.0"
                )

        Mount Mount_$master_d1 (
                ResourceOwner = root
                BlockDevice = "/dev/vx/dsk/$master.gnr.0/gnr.0"
                FSType = ext3
                MountPoint = "/d/$master/d1"
                )

        Mount Mount_$master_d2 (
                ResourceOwner = root
                BlockDevice = "/dev/vx/dsk/$master.gnr.0/gnr.1"
                FSType = ext3
                MountPoint = "/d/$master/d2"
                )

        Mount Mount_$master_d3 (
                ResourceOwner = root
                BlockDevice = "/dev/vx/dsk/$master.gnr.0/gnr.2"
                FSType = ext3
                MountPoint = "/d/$master/d3"
                )

        NetBackup NetBackup_$master (
                Critical = 0
                ResourceOwner = root
                ServerType = NBUMasterwoMM
                )

        ZIP ZIP_$master (
                ResourceOwner = root
                ServiceAddress = "10.195.205.45"
                )

        Mount_$master_d1 requires DiskGroup_$master
        Mount_$master_d2 requires DiskGroup_$master
        Mount_$master_d3 requires DiskGroup_$master
        NetBackup_$master requires Mount_$master_d1
        NetBackup_$master requires Mount_$master_d2
        NetBackup_$master requires Mount_$master_d3
        NetBackup_$master requires ZIP_$master

*********************************************************************************

Thanks

mph999
Level 6
Employee Accredited

The engine log doesn't tell you which process the cluster detected (rightly or wrongly) as stopped.

You want cluster.log for that (at 7.x)

cp /usr/openv/netbackup/logs/cluster/log.<date> /tmp/sym/cluster_log.txt

Note.  At 7.x, but before 7.1.x

Set ‘ DEBUG_LEVEL=1 ‘ in the /usr/openv/netbackup/bin/cluster/NBU_RSP configuration file prior to 7.1 this is needed to get info at sufficient level in cluster.log to determine the process that caused the cluster to fail, not required at or after 7.1)

Amar_Rajan
Level 3

Hi Martin, unfortunately there is no such folder /usr/openv/netbackup/logs/cluster/ in our setup. Thats why, i couldn't give you.

mph999
Level 6
Employee Accredited

You may need to create it in that case ...

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

We see that someone has change the Critical attribute to 0.

This means that VCS will NOT take the resource down when the MonitorTimeout is exceeded.

But it seems that someone or something other than VCS has killed some monitored NBU processes:

2015/07/25 05:26:49 VCS INFO V-16-2-13001 (server node) Resource(NetBackup_$master): Output of the completed operation
(monitor)
Some Processes are DOWN while others are UP
Following Process are found DOWN: vmd bprd bpdbm nbpem nbjm
Following Process are found UP: nbevtmgr nbemm nbrb NB_dbsrv nbaudit

This is why the CLEAN entry point was called: 


2015/07/25 05:20:45 VCS ERROR V-16-2-13067 Thread(4145675152) Agent is calling clean for resource(NetBackup_$master)
because the resource became OFFLINE unexpectedly, on its own.

 

You need to enable additional logging as per Martin's suggestion and also check /var/adm/messages for this date and time. 

Ensure logging is enabled for processes that were found to be DOWN.