Solved: Gentle reminder! This

javierrv · ‎06-18-2015

Hi there,

I have a system where the cleanup script can fail/timeout and I want to execute another script if this happens. And I was wondering which can be the best way of doing this.

In the veritas cluster server administrators guide for Linux I found the trigger RESNOTOFF.

From the documentation it is my understanding that this trigger will be triggered in the following cases:

A resource fails going offline (started by VCS) and the clean up fails.
A resource goes offline unexpectedly and the clean up fails.

I have tested this and the RESNOTOFF is working in the first scenario but not in the second.

For testing the second scenario I kill the service and I can see the following message in the engine_A.log:

VCS ERROR V-16-2-13067 (node1) Agent is calling clean for resource(service1) because the resource became OFFLINE unexpectedly, on its own.

When the cleanup fails I would expect the resource to became UNABLE TO OFFLINE. However, the status of the resource is still ONLINE:

# hares -state service1
#Resource                    Attribute             System     Value
service1                         State                 node1      ONLINE
service1                         State                 node2      OFFLINE

So the resource is ONLINE and VCS keeps running the cleanup command indefinitely (which is failing).

I was wondering if I need to configure something else to make the RESNOTOFF to work in this particular scenario.

Thanks,

Sunil_Yadav · ‎06-25-2015

Hi,

Cleanup script(AKA clean entry point) is invoked in different scenarios. There are different state transitions based on success/failure of clean entry point. As you specifically mentioned("I have a system where the cleanup script can fail/timeout"), we will elaborate only failure scenarios of clean entry point.

Scenario # 1

Resource ONLINE --> Resource attempting OFFLINE --> Resource fails to go OFFLINE --> Clean entry point invoked --> Clean entry point fails --> Resource moves to ONLINE|UNABLE TO OFFLINE

Scenario # 2

Resource ONLINE --> Resource unexpectedly went OFFLINE --> Clean entry point invoked --> Clean entry point fails --> If Type:: CleanRetryLimit == 0, clean entry point is retired infinitely --> Till clean entry point succeeds, resource remains ONLINE

Scenario # 3

Resource ONLINE --> Resource unexpectedly went OFFLINE --> Clean entry point invoked --> Clean entry point fails --> If Type:: CleanRetryLimit != 0, clean entry point is retried for CleanRetryLimit times --> If it still fails, resource moves to ONLINE| ADMIN_WAIT

Scenario # 4

Resource OFFLINE --> Resource attempting ONLINE --> Resource fails to go ONLINE --> Clean entry point invoked --> Clean entry point fails --> Resource moves to OFFLINE|ADMIN_WAIT

RESNOTOFF is invoked on the system if a resource in a service group does not go offline even after issuing the offline command to the resource. This event trigger only covers scenario # 1. That you also verifying in your test environment.

As per your description, you are either hitting scenario # 2 or # 3. RESNOTOFF won’t be executed in this scenarios. This is expected behavior. You needn’t worry about scenario # 2. In scenario # 2, clean entry point will be retried infinitely. Eventually, at some of time, clean entry point will succeed.

For scenarios # 3 and # 4, you can use RESADMINWAIT trigger. RESADMINWAIT trigger is invoked when a resource enters ADMIN_WAIT state.

To cover all possible failure scenarios of clean entry point, you should use RESNOTOFF and RESADMINWAIT triggers.

Thanks & Regards,
Sunil Y

View solution in original post

mikebounds · ‎06-24-2015

Can you provide full logs, as above you have:

VCS ERROR V-16-2-13067 (node1) Agent is calling clean for resource(service1) because the resource became OFFLINE unexpectedly, on its own.

So here resource is OFFLINE

and then you have

# hares -state service1
#Resource                    Attribute             System     Value
service1                         State                 node1      ONLINE
service1                         State                 node2      OFFLINE

so here resource is ONLINE

Mike

Sunil_Yadav · ‎06-25-2015

Hi,

Cleanup script(AKA clean entry point) is invoked in different scenarios. There are different state transitions based on success/failure of clean entry point. As you specifically mentioned("I have a system where the cleanup script can fail/timeout"), we will elaborate only failure scenarios of clean entry point.

Scenario # 1

Resource ONLINE --> Resource attempting OFFLINE --> Resource fails to go OFFLINE --> Clean entry point invoked --> Clean entry point fails --> Resource moves to ONLINE|UNABLE TO OFFLINE

Scenario # 2

Resource ONLINE --> Resource unexpectedly went OFFLINE --> Clean entry point invoked --> Clean entry point fails --> If Type:: CleanRetryLimit == 0, clean entry point is retired infinitely --> Till clean entry point succeeds, resource remains ONLINE

Scenario # 3

Resource ONLINE --> Resource unexpectedly went OFFLINE --> Clean entry point invoked --> Clean entry point fails --> If Type:: CleanRetryLimit != 0, clean entry point is retried for CleanRetryLimit times --> If it still fails, resource moves to ONLINE| ADMIN_WAIT

Scenario # 4

Resource OFFLINE --> Resource attempting ONLINE --> Resource fails to go ONLINE --> Clean entry point invoked --> Clean entry point fails --> Resource moves to OFFLINE|ADMIN_WAIT

RESNOTOFF is invoked on the system if a resource in a service group does not go offline even after issuing the offline command to the resource. This event trigger only covers scenario # 1. That you also verifying in your test environment.

As per your description, you are either hitting scenario # 2 or # 3. RESNOTOFF won’t be executed in this scenarios. This is expected behavior. You needn’t worry about scenario # 2. In scenario # 2, clean entry point will be retried infinitely. Eventually, at some of time, clean entry point will succeed.

For scenarios # 3 and # 4, you can use RESADMINWAIT trigger. RESADMINWAIT trigger is invoked when a resource enters ADMIN_WAIT state.

To cover all possible failure scenarios of clean entry point, you should use RESNOTOFF and RESADMINWAIT triggers.

Thanks & Regards,
Sunil Y

Sunil_Yadav · ‎08-04-2015

Gentle reminder! This discussion is open for last 1.5 months and keeps popping in "Can you solve these?" section. If your query is resolved, then please mark appropriate comments as solution.

Thanks & Regards,
Sunil Y

VOX

Trigger after failed cleanup script