cancel
Showing results for 
Search instead for 
Did you mean: 

cluster behavior needed, which cfg vars to modify

pb227
Not applicable

Hallo,

 

I wish to have the following behavior from a Veritas cluster, monitoring a resource (app):

resource failed, first attempt to restart it on the same node, if not, migrate it to the second node.

However, is there another monitor which forces the resource to directly migrate if it fails too many times in a given timeframe, instead on starting it again on the same node ?

When testing, I have different behaviors depending on how much time I wait between manually killing the app and I do not know exactly which configurations I have to edit. basically, the question is how much time do I have between manually failing the resource, so the cluster restarts it again on the _same_ node?

 

cfg so far -> ToleranceLimit = 0 RestartLimit = 1 OnlineTimeout = 300.

 

1 ACCEPTED SOLUTION

Accepted Solutions

mikebounds
Level 6
Partner Accredited

The attribute you are missing is

ConfInterval

 

When a resource has remained online for the specified time (in
seconds), previous faults and restart attempts are ignored by
the agent. (See ToleranceLimit and RestartLimit attributes for
details.)
■ Type and dimension: integer-scalar
■ Default: 600 seconds

So with default ConInterval of 600 sec (10 mins) with:

RestartLimit=1, a resource will be restarted once and if it fails again within 10 mins it will cause failover but if it fails after 10 mins then it will be restarted again

ToleranceLimit=1, a failure will be ignored the first time and if it fails again within 10 mins it will cause failover but if it fails after 10 mins then it will be ignored again.

Mike

View solution in original post

1 REPLY 1

mikebounds
Level 6
Partner Accredited

The attribute you are missing is

ConfInterval

 

When a resource has remained online for the specified time (in
seconds), previous faults and restart attempts are ignored by
the agent. (See ToleranceLimit and RestartLimit attributes for
details.)
■ Type and dimension: integer-scalar
■ Default: 600 seconds

So with default ConInterval of 600 sec (10 mins) with:

RestartLimit=1, a resource will be restarted once and if it fails again within 10 mins it will cause failover but if it fails after 10 mins then it will be restarted again

ToleranceLimit=1, a failure will be ignored the first time and if it fails again within 10 mins it will cause failover but if it fails after 10 mins then it will be ignored again.

Mike