03-01-2013 06:39 AM
Hallo,
I wish to have the following behavior from a Veritas cluster, monitoring a resource (app):
resource failed, first attempt to restart it on the same node, if not, migrate it to the second node.
However, is there another monitor which forces the resource to directly migrate if it fails too many times in a given timeframe, instead on starting it again on the same node ?
When testing, I have different behaviors depending on how much time I wait between manually killing the app and I do not know exactly which configurations I have to edit. basically, the question is how much time do I have between manually failing the resource, so the cluster restarts it again on the _same_ node?
cfg so far -> ToleranceLimit = 0 RestartLimit = 1 OnlineTimeout = 300.
Solved! Go to Solution.
03-01-2013 07:18 AM
The attribute you are missing is
ConfInterval
When a resource has remained online for the specified time (inseconds), previous faults and restart attempts are ignored bythe agent. (See ToleranceLimit and RestartLimit attributes fordetails.)■ Type and dimension: integer-scalar■ Default: 600 seconds
So with default ConInterval of 600 sec (10 mins) with:
RestartLimit=1, a resource will be restarted once and if it fails again within 10 mins it will cause failover but if it fails after 10 mins then it will be restarted again
ToleranceLimit=1, a failure will be ignored the first time and if it fails again within 10 mins it will cause failover but if it fails after 10 mins then it will be ignored again.
Mike
03-01-2013 07:18 AM
The attribute you are missing is
ConfInterval
When a resource has remained online for the specified time (inseconds), previous faults and restart attempts are ignored bythe agent. (See ToleranceLimit and RestartLimit attributes fordetails.)■ Type and dimension: integer-scalar■ Default: 600 seconds
So with default ConInterval of 600 sec (10 mins) with:
RestartLimit=1, a resource will be restarted once and if it fails again within 10 mins it will cause failover but if it fails after 10 mins then it will be restarted again
ToleranceLimit=1, a failure will be ignored the first time and if it fails again within 10 mins it will cause failover but if it fails after 10 mins then it will be ignored again.
Mike