Trigger after failed cleanup script

Question

Hi there,

I have a system where the cleanup script can fail/timeout and I want to execute another script if this happens. And I was wondering which can be the best way of doing this.

In the veritas cluster server administrators guide for Linux I found the trigger RESNOTOFF.

From the documentation it is my understanding that this trigger will be triggered in the following cases:

A resource fails going offline (started by VCS) and the clean up fails.
	A resource goes offline unexpectedly and the clean up fails.

I have tested this and the RESNOTOFF is working in the first scenario but not in the second.

For testing the second scenario I kill the service and I can see the following message in the engine_A.log:

VCS ERROR V-16-2-13067 (node1) Agent is calling clean for resource(service1) because the resource became OFFLINE unexpectedly, on its own.

When the cleanup fails I would expect the resource to became UNABLE TO OFFLINE. However, the status of the resource is still ONLINE:

# hares -state service1
		#Resource&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Attribute&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; System&nbsp;&nbsp;&nbsp;&nbsp; Value
		service1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; State&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; node1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ONLINE
		service1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; State&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; node2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; OFFLINE
		&nbsp;

So the resource is ONLINE and VCS keeps running the cleanup command indefinitely (which is failing).

I was wondering if I need to configure something else to make the RESNOTOFF to work in this particular scenario.

Thanks,

sunil_yadav · Accepted Answer

Hi,

Cleanup script(AKA clean entry point) is invoked in different scenarios. There are different state transitions based on success/failure of clean entry point. As you specifically mentioned("I have a system where the cleanup script can fail/timeout"), we will elaborate only failure scenarios of clean entry point.

Scenario # 1

Resource ONLINE --&gt;&nbsp;Resource attempting OFFLINE --&gt;&nbsp;Resource fails to go OFFLINE&nbsp;--&gt;&nbsp;Clean entry point invoked --&gt;&nbsp;Clean entry point fails --&gt;&nbsp;Resource moves to ONLINE|UNABLE TO OFFLINE

Scenario # 2

Resource ONLINE --&gt;&nbsp;Resource unexpectedly went OFFLINE --&gt;&nbsp;Clean entry point invoked --&gt;&nbsp;Clean entry point fails --&gt;&nbsp;If Type:: CleanRetryLimit == 0, clean entry point is retired infinitely --&gt;&nbsp;Till clean entry point succeeds, resource remains ONLINE

Scenario # 3

Resource ONLINE --&gt;&nbsp;Resource unexpectedly went OFFLINE --&gt;&nbsp;Clean entry point invoked --&gt;&nbsp;Clean entry point fails --&gt;&nbsp;If Type:: CleanRetryLimit != 0, clean entry point is retried for CleanRetryLimit times --&gt;&nbsp;If it still fails, resource moves to ONLINE| ADMIN_WAIT

Scenario # 4

Resource OFFLINE --&gt;&nbsp;Resource attempting ONLINE --&gt;&nbsp;Resource fails to go ONLINE --&gt;&nbsp;Clean entry point invoked --&gt;&nbsp;Clean entry point fails --&gt;&nbsp;Resource moves to OFFLINE|ADMIN_WAIT

&nbsp;

RESNOTOFF is invoked on the system if a resource in a service group does not go offline even after issuing the offline command to the resource. This event trigger only covers scenario # 1. That you also verifying in your test environment.

As per your description, you are either hitting scenario # 2 or # 3. RESNOTOFF won’t be executed in this scenarios. This is expected behavior. You&nbsp;needn’t worry about scenario # 2. In scenario # 2, clean entry point will be retried infinitely. Eventually, at some of time, clean entry point will succeed.&nbsp;

For scenarios # 3 and # 4, you can use RESADMINWAIT trigger.&nbsp;RESADMINWAIT trigger is invoked when a resource enters ADMIN_WAIT state.

&nbsp;

To cover all possible failure scenarios of clean entry point, you should use RESNOTOFF and RESADMINWAIT triggers.

&nbsp;

Thanks &amp; Regards,
	Sunil Y

&nbsp;

mikebounds · Answer

Can you provide full logs, as above you have:

VCS ERROR V-16-2-13067 (node1) Agent is calling clean for resource(service1) because the resource became OFFLINE unexpectedly, on its own.

So here resource is OFFLINE

and then you have&nbsp;

# hares -state service1
	#Resource&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Attribute&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; System&nbsp;&nbsp;&nbsp;&nbsp; Value
	service1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; State&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; node1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ONLINE
	service1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; State&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; node2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; OFFLINE

so here resource is ONLINE

Mike

sunil_yadav · Answer

Gentle reminder! This discussion is open for last 1.5 months and keeps popping in "Can you solve these?" section. If your query is resolved, then please mark appropriate comments as solution.&nbsp;

Thanks &amp; Regards,
	Sunil Y

Forum Discussion

Trigger after failed cleanup script

3 Replies

Related Content

Image Cleanup jobs failing with RC88

Advanced Disk cleanup job failed

Large Catalog Cleanup

Failed to execute postonline trigger

Cleanup slot

Recent Discussions

Configure two Mount type resources of nfs FStype attribute using the same share

order

key registration and reservation

Verifying that primary and dr clusters replication is synced

vcs can create logical nic