cancel
Showing results for 
Search instead for 
Did you mean: 

Understanding RestartLimit for non critical ressource

Hello,

we have some trouble with our oracle listener process. sometimes the listener is killed by vcs. We dont know why.

xxx VCS ERROR V-16-2-13027 (node1) Resource(lsnr-ORADB1) - monitor procedure did not complete within the expected time.
xxx VCS ERROR V-16-2-13210 (node1) Agent is calling clean for resource(lsnr-ORADB1) because 4 successive invocations of the monitor procedure did not complete within the expected time.
xxx VCS NOTICE V-16-20002-42 (node1) Netlsnr:lsnr-ORADB1:clean:Listener(LISTENER) kill TERM  2342
xxx VCS INFO V-16-2-13068 (node1) Resource(lsnr-ORADB1) - clean completed successfully.
xxx VCS INFO V-16-2-13026 (node1) Resource(lsnr-ORADB1) - monitor procedure finished successfully after failing to complete within the expected time for (4) consecutive times.
xxx VCS INFO V-16-1-10307 Resource lsnr-ORADB1 (Owner: unknown, Group: ORADB1) is offline on node1 (Not initiated by VCS)

However, Resource(lsnr-ORADB1) is set to non-critical, to prevent an failover.  I'll now set an RestartLimit for Resource(lsnr-ORADB1) to let the cluster try to restart the listener, but what happen if this failed? Will the Ressouce still staying offline or initiate the cluster an failover for the whole ResourceGroup?

thanks in advance for any help!

 

3 Replies

Hi, Does the lsnr get

Hi,

 

Does the lsnr get started by the agent? Does it start ok? Does a subsequent probe of the lsnr find it to be online or not? Does the scenarion only occur after sometime?

 

I think you need to investigate why the lsnr is reporting as not being online instead of trying to change the VCS behaviour.

 

If you believe its online, VCS should not report that its not.

The reason seems to be clear

The reason seems to be clear in the log snippet:

monitor procedure did not complete within the expected time.

You may want to increase the MonitorTimeout value.
The default is 60 sec. 

Extract from vcs_admin guide:
For best results, Symantec recommends measuring the time it takes to bring a resource online, take it offline, and monitor before modifying the defaults. Issue an online or offline command to measure the time it takes for each action. To measure how long it takes to monitor a resource, fault the resource and issue a probe, or bring the resource online outside of VCS control and issue a probe.

Also have a look at FaultOnMonitorTimeouts attribute.
Extract from vcs_admin manual:

The FaultOnMonitorTimeouts attribute defines whether VCS interprets a Monitor function timeout as a resource fault.
If the attribute is set to 0, VCS does not treat Monitor timeouts as a resource faults.
If the attribute is set to 1, VCS interprets the timeout as a resource fault and the agent calls the Clean function to shut the resource down.
By default, the FaultOnMonitorTimeouts attribute is set to 4. This means that the Monitor function must time out four times in a row before the resource is marked faulted. The first monitor time out timer and the counter of time outs are reset after one hour of the first monitor time out.

 

You may also want to read though this topic in the Admin Guide:

How VCS handles resource faults

Hi, Before going into details

Hi,

Before going into details of RestartLimit, lets go over the root cause of the issue.

#1: Resource(lsnr-ORADB1) - monitor procedure did not complete within the expected time.

On predefined parameters, monitor procedure determines if the managed object(in this case Oracle listener) instance is healthy. MonitorTimeout(in seconds) is maximum time needed/allowed to determine instances’ health. In this particular case, for some unknown ‘xyz’ reason, monitor procedure isn’t completing in stipulated time. You can check Agent’s log for more details. To be on safe side, you can increase MonitorTimeout(default 60 seconds) value. This will give monitor procedure additional time to probe state of Oracle listener.

#2: Sometimes the listener is killed by vcs. We dont know why?

This is byproduct of #1. As you can observer in logs, “Agent is calling clean for resource(lsnr-ORADB1) because 4 successive invocations of the monitor procedure did not complete within the expected time.”. After 4 failed monitors, VCS initiates resource fault routine. This behavior is controlled by FaultOnMonitorTimeouts attribute. The FaultOnMonitorTimeouts(default value 4) attribute defines whether VCS interprets a Monitor function timeout as a resource fault. When a monitor times out as many times as the value specified, the corresponding resource is brought down by calling the clean function. Oracle listener was terminated by clean procedure invoked after 4 monitors failed. You can increase FaultOnMonitorTimeouts value. Thus, more monitors will be attempted before initiating clean procedure. Safe side, you may also disable this by setting FaultOnMonitorTimeouts to 0.

#3: Setting RestartLimit

RestartLimit(default 0) is number of times to retry bringing a resource online when it is taken offline unexpectedly and before VCS declares it FAULTED. You can set this to >0. Thus, ResourceGroup will not be immediately failedover if Oracle listener faults. It will atleast retry before initiating Service Groups’ failover/offline.

#4: What if restart attempts failed? Will the Resource still staying offline or initiate the cluster an failover for the whole ResourceGroup?

If Critical==1,

    VCS will try to failover whole ResourceGroup.

Else If Critical==0 AND There is a Critical==1 parent resource in dependency tree

    VCS will try to failover whole ResourceGroup.

Else If Critical==0

    Oracle listener resource will remain offline and Service Group in PARTIAL state. VCS will not initiate failover of ResourceGroup.

 

Thanks & Regards,

Sunil Y