06-18-2015 02:50 AM
Hi there,
I have a system where the cleanup script can fail/timeout and I want to execute another script if this happens. And I was wondering which can be the best way of doing this.
In the veritas cluster server administrators guide for Linux I found the trigger RESNOTOFF.
From the documentation it is my understanding that this trigger will be triggered in the following cases:
I have tested this and the RESNOTOFF is working in the first scenario but not in the second.
For testing the second scenario I kill the service and I can see the following message in the engine_A.log:
VCS ERROR V-16-2-13067 (node1) Agent is calling clean for resource(service1) because the resource became OFFLINE unexpectedly, on its own.
When the cleanup fails I would expect the resource to became UNABLE TO OFFLINE. However, the status of the resource is still ONLINE:
# hares -state service1
#Resource Attribute System Value
service1 State node1 ONLINE
service1 State node2 OFFLINE
So the resource is ONLINE and VCS keeps running the cleanup command indefinitely (which is failing).
I was wondering if I need to configure something else to make the RESNOTOFF to work in this particular scenario.
Thanks,
Solved! Go to Solution.
06-25-2015 03:59 AM
Hi,
Cleanup script(AKA clean entry point) is invoked in different scenarios. There are different state transitions based on success/failure of clean entry point. As you specifically mentioned("I have a system where the cleanup script can fail/timeout"), we will elaborate only failure scenarios of clean entry point.
Scenario # 1
Resource ONLINE --> Resource attempting OFFLINE --> Resource fails to go OFFLINE --> Clean entry point invoked --> Clean entry point fails --> Resource moves to ONLINE|UNABLE TO OFFLINE
Scenario # 2
Resource ONLINE --> Resource unexpectedly went OFFLINE --> Clean entry point invoked --> Clean entry point fails --> If Type:: CleanRetryLimit == 0, clean entry point is retired infinitely --> Till clean entry point succeeds, resource remains ONLINE
Scenario # 3
Resource ONLINE --> Resource unexpectedly went OFFLINE --> Clean entry point invoked --> Clean entry point fails --> If Type:: CleanRetryLimit != 0, clean entry point is retried for CleanRetryLimit times --> If it still fails, resource moves to ONLINE| ADMIN_WAIT
Scenario # 4
Resource OFFLINE --> Resource attempting ONLINE --> Resource fails to go ONLINE --> Clean entry point invoked --> Clean entry point fails --> Resource moves to OFFLINE|ADMIN_WAIT
RESNOTOFF is invoked on the system if a resource in a service group does not go offline even after issuing the offline command to the resource. This event trigger only covers scenario # 1. That you also verifying in your test environment.
As per your description, you are either hitting scenario # 2 or # 3. RESNOTOFF won’t be executed in this scenarios. This is expected behavior. You needn’t worry about scenario # 2. In scenario # 2, clean entry point will be retried infinitely. Eventually, at some of time, clean entry point will succeed.
For scenarios # 3 and # 4, you can use RESADMINWAIT trigger. RESADMINWAIT trigger is invoked when a resource enters ADMIN_WAIT state.
To cover all possible failure scenarios of clean entry point, you should use RESNOTOFF and RESADMINWAIT triggers.
Thanks & Regards,
Sunil Y
06-24-2015 12:15 PM
Can you provide full logs, as above you have:
VCS ERROR V-16-2-13067 (node1) Agent is calling clean for resource(service1) because the resource became OFFLINE unexpectedly, on its own.
So here resource is OFFLINE
and then you have
# hares -state service1
#Resource Attribute System Value
service1 State node1 ONLINE
service1 State node2 OFFLINE
so here resource is ONLINE
Mike
06-25-2015 03:59 AM
Hi,
Cleanup script(AKA clean entry point) is invoked in different scenarios. There are different state transitions based on success/failure of clean entry point. As you specifically mentioned("I have a system where the cleanup script can fail/timeout"), we will elaborate only failure scenarios of clean entry point.
Scenario # 1
Resource ONLINE --> Resource attempting OFFLINE --> Resource fails to go OFFLINE --> Clean entry point invoked --> Clean entry point fails --> Resource moves to ONLINE|UNABLE TO OFFLINE
Scenario # 2
Resource ONLINE --> Resource unexpectedly went OFFLINE --> Clean entry point invoked --> Clean entry point fails --> If Type:: CleanRetryLimit == 0, clean entry point is retired infinitely --> Till clean entry point succeeds, resource remains ONLINE
Scenario # 3
Resource ONLINE --> Resource unexpectedly went OFFLINE --> Clean entry point invoked --> Clean entry point fails --> If Type:: CleanRetryLimit != 0, clean entry point is retried for CleanRetryLimit times --> If it still fails, resource moves to ONLINE| ADMIN_WAIT
Scenario # 4
Resource OFFLINE --> Resource attempting ONLINE --> Resource fails to go ONLINE --> Clean entry point invoked --> Clean entry point fails --> Resource moves to OFFLINE|ADMIN_WAIT
RESNOTOFF is invoked on the system if a resource in a service group does not go offline even after issuing the offline command to the resource. This event trigger only covers scenario # 1. That you also verifying in your test environment.
As per your description, you are either hitting scenario # 2 or # 3. RESNOTOFF won’t be executed in this scenarios. This is expected behavior. You needn’t worry about scenario # 2. In scenario # 2, clean entry point will be retried infinitely. Eventually, at some of time, clean entry point will succeed.
For scenarios # 3 and # 4, you can use RESADMINWAIT trigger. RESADMINWAIT trigger is invoked when a resource enters ADMIN_WAIT state.
To cover all possible failure scenarios of clean entry point, you should use RESNOTOFF and RESADMINWAIT triggers.
Thanks & Regards,
Sunil Y
08-04-2015 11:07 AM
Gentle reminder! This discussion is open for last 1.5 months and keeps popping in "Can you solve these?" section. If your query is resolved, then please mark appropriate comments as solution.
Thanks & Regards,
Sunil Y