Solved: Long running action on custom agent timing out

Bill_Hurn · ‎10-30-2014

Hi,

I've created an action on a custom agent (based on ApplicationAgent) which can take a couple of minutes to complete. However, the action will timeout after MonitorInterval / 2. If I set the MonitorInterval to a sufficiently high value, the action will complete, but the cluster manager will then take a very long time to recognise that the application has started (some multiple of the MonitorInterval), causing the dependent applications in the group to take too long to start up.

I had hoped that I could override the action timeout using VCSAG_SET_RES_EP_TIMEOUT from ag_i18n_inc.sh, but this does not appear to affect the MonitorInterval / 2 maximum so it does not help.

In other instances we have created a completely separate custom Agent with the custom action so that its MonitorInterval can be set very high without changing the value for the real applications; this still leaves the application groups in 'Partial_Online' state for far longer than is acceptable.

I have also contemplated changing the MonitorInteval on the custom agent to a high value only during the period when the long running action is to be carried out, switching it back when the action completes, but there is a risk that the value might not get switched back, again causing slow startup.

Is there any way of allowing my custom action to use a timeout of several minutes without affecting the cluster manager's rapid ability to confirm the correct startup of the applications?

Any suggestions gratefully received

thanks,

Bill Hurn

mikebounds · ‎11-01-2014

As I understand you have a process at the end of week where you run some custom scripts and then offline the application. You could just do this via a script from the CLI, but I guess you want the option to be able to do this from Java or VOM GUI and to integrate more tightly with VCS.

I think using the offline script is a good idea, but I am not sure how you would avoid using action script as if you manually offline resource mid week, how would the script know if if needs to run end-of-week processes unless if determines this by looking at the date. I would implement this something like:

Create an Application or custom agent resource called something like "end-of-week-shutdown" and make this dependent on your application resource that needs to shutdown. Have the online of this resource just touch a file and the monitor just check the file exists.

Then have action on this resource which:

Either sets a VCS temporary attribute or creates a file in a set location
Runs hares offlie to offline the resource itself

The offline then checks for VCS temporary attribute or a file in a set location and if set then it runs the end-of-week shutdown processes and then removes file that resource monitors (if VCS temporary attribute is not set then offline just removes file that resource monitors)

Using a separate resource means you will have a visual representation of when the end-of-week shutdown processes are running (when the "end-of-week-shutdown" resource is offlining) which is separate from when the application is shutting down (the normal application resource is offlining).

Mike

View solution in original post

kjbss · ‎11-01-2014

Well I have news on the usage of the VCSAG_SET_RES_EP_TIMEOUT function found in /opt/VRTSvcs/bin/ag_i18n_inc.sh --

There is a bug in some versions of this file -- at least with the one installed with the following version for Solaris:

-$ pkginfo -l VRTSvcs | egrep 'VERSION|PSTAMP'
   VERSION:  6.0.100.000
    PSTAMP:  6.0.100.000-GA-2012-07-20-16.30.01

To fix it, make the following changes:

-$ diff /opt/VRTSvcs/bin/ag_i18n_inc.sh /opt/VRTSvcs/bin/ag_i18n_inc.sh.orig
389c389
<     timeout_file="${VCSLOG}/log/tmp/${VCS_LOG_RESOURCE_NAME}.tmo";
---
>     timeout_file="${VCSLOG}/log/tmp/.${VCS_LOG_RESOURCE_NAME}.tmo";
391c391,392
<     printf ${timeout} > "${timeout_file}"
---
>     echo ${timeout} > "${timeout_file}"

I figured it out because it worked on my SFHA 6.1 cluster, and I noticed the difference in the two VCSAG_SET_RES_EP_TIMEOUT shell functions.

Once you make the change, you can see it working, becuase the VCS engine places the following into the ening_A.log file:

2014/11/01 17:55:35 VCS NOTICE V-16-2-13033 (sol10u10-00) Monitor entry point of resource(kjbsg_KJBfoo) requested that the timeout be extended by (120) seconds

...and...

2014/11/01 17:44:37 VCS NOTICE V-16-2-13034 (sol10u10-00) Offline entry point of resource(kjbsg_KJBfoo) requested that the timeout be extended by (120) seconds

etc...

I haven't tested it, but it seems nearly certain that this takes effect for all entry points, including the action entry point that you are attempting...

Let us know if this provides you with the behaviour you are after (and give me a "solution" if it works! ;) )

View solution in original post

kjbss · ‎10-30-2014

AFAIK, changing the MonitorInterval does not effect how long it takes a resource to online; your first paragraph suggests that. If you stand by that, I'd like to understand how and get the details from you...

As far as changing the ActionTimeout value goes, the VCS Agent Dev guide says that you should be able to use that script function (VCSAG_SET_RES_EP_TIMEOUT) to extend the ActionTimeout value dynamically from within the entry point's execution context.

So it would seem that it should work; therefore, can you show us how you are calling this function from the called action program?

Also, just before the call to VCSAG_SET_RES_EP_TIMEOUT, insert the line:

echo "$(date +'%H:%M:%S')  -- Before the call to VCSAG_SET_RES_EP_TIMEOUT..." >> /tmp/showMe

...and just after the call, insert the line:

echo "$(date +'%H:%M:%S')  -- After the call to VCSAG_SET_RES_EP_TIMEOUT, which returned \'$?\'..." >> /tmp/showMe

This will first prove that you are calling the routine from where you are assuming you are calling it, and second it will be sort of nice to see how long it takes to run (should be less than one second, so the times should be identical most of the time).

Of course, you are (aren't you?) testing the return code that is set when you make the call to VCSAG_SET_RES_EP_TIMEOUT -- what is it? (I assume it should be '0', but I am not sure... for all I know it is prgrammed to do something clever like return the number of seconds beyond the previous EP timeout it has been changed to -- but I doubt it...)

mikebounds · ‎10-31-2014

Have you set action timeout by setting resource attribute - "ActionTimeout" - see extract from VCS admin guide:

ActionTimeout (user-defined)

Timeout value for the Action function.
■ Type and dimension: integer-scalar
■ Default: 30 seconds

Also you could do a nohup in the action script do that it can take as long as it likes, but this means that VCS won't capture the output of the action script which you may or may not need.

If neither of these work, then it would useful if you could explain why you need to use an action script.

Mike

Bill_Hurn · ‎10-31-2014

The MonitorInterval doesn't change how long a resource takes to start, but it does change how long it takes the cluster manager to decide that it is online. I noticed this when I removed the overridden MonitorInterval (we had it set to 5 seconds, rather than the default 30), and it started taking far longer to determine that the application had started. As we have other applications which depend on the first, this slowed the whole start-up considerably; putting the 5 seconds value back in speeded it up again.

I had only just discovered the VCSAG_SET_RES_EP_TIMEOUT function as it is not very well documented, but with a bit of trial and error I came up with:

# source this script to include the VCSAG_SET_RES_EP_TIMEOUT function
. ../ag_i18n_inc.sh

# set logging properties as they are used to determine where the timeout value file is written
VCSAG_SET_ENVS $1 action gather

# set actiion timeout to 300 
VCSAG_SET_RES_EP_TIMEOUT 300

I didn’t check the return code, but I did confirm that the file was being created, and contained the value (300) I was expecting. However, the action was still being killed after a couple of seconds.

This is I believe down to the capping of the ActionTimeout; from the ActionTimeout entry in the Agent Developer’s guide:

The default is 30 seconds. The value of the ActionTimeout attribute is internally

capped at MonitorInterval / 2.

If the ActionTimeout attribute is set to a value greater than MonitorInterval/2, then

MonitorInterval/2 is used instead of ActionTimeout. If ActionTimeout value is less than MonitorInterval/2, then the ActionTimeout value is honored.

The next paragraph suggests that the VCSAG_SET_RES_EP_TIMEOUT should not be constrained by this limit, but this did not seem to be the case.

I cannot run the action in the background as it forms part of the end-of-week shutdown and is used to generate a heap dump and gather log files present on a filesystem mounted by Veritas for the application. Therefore, I need to have the application online while the action runs and then bring it offline once the action has finished its work. The use of an Action was a nice way of running the task in the same place as a running application, while ensuring that it only happened on a specific end-of-week shutdown and not in any other offline event (as it is too slow).

Another mechanism has been suggested to me, to use temporary flags within Veritas to indicate that the normal offline script should include the additional gathering task, which will hopefully only be subject to the longer OfflineTimeout and does not require the custom action, so I’m investigating that one at the moment.

This is my first foray into the world of Veritas customisation, so please excuse all of the inaccuracies in my comments!

Thanks

Bill

mikebounds · ‎11-01-2014

As I understand you have a process at the end of week where you run some custom scripts and then offline the application. You could just do this via a script from the CLI, but I guess you want the option to be able to do this from Java or VOM GUI and to integrate more tightly with VCS.

I think using the offline script is a good idea, but I am not sure how you would avoid using action script as if you manually offline resource mid week, how would the script know if if needs to run end-of-week processes unless if determines this by looking at the date. I would implement this something like:

Create an Application or custom agent resource called something like "end-of-week-shutdown" and make this dependent on your application resource that needs to shutdown. Have the online of this resource just touch a file and the monitor just check the file exists.

Then have action on this resource which:

Either sets a VCS temporary attribute or creates a file in a set location
Runs hares offlie to offline the resource itself

The offline then checks for VCS temporary attribute or a file in a set location and if set then it runs the end-of-week shutdown processes and then removes file that resource monitors (if VCS temporary attribute is not set then offline just removes file that resource monitors)

Using a separate resource means you will have a visual representation of when the end-of-week shutdown processes are running (when the "end-of-week-shutdown" resource is offlining) which is separate from when the application is shutting down (the normal application resource is offlining).

Mike

kjbss · ‎11-01-2014

Well I have news on the usage of the VCSAG_SET_RES_EP_TIMEOUT function found in /opt/VRTSvcs/bin/ag_i18n_inc.sh --

There is a bug in some versions of this file -- at least with the one installed with the following version for Solaris:

-$ pkginfo -l VRTSvcs | egrep 'VERSION|PSTAMP'
   VERSION:  6.0.100.000
    PSTAMP:  6.0.100.000-GA-2012-07-20-16.30.01

To fix it, make the following changes:

-$ diff /opt/VRTSvcs/bin/ag_i18n_inc.sh /opt/VRTSvcs/bin/ag_i18n_inc.sh.orig
389c389
<     timeout_file="${VCSLOG}/log/tmp/${VCS_LOG_RESOURCE_NAME}.tmo";
---
>     timeout_file="${VCSLOG}/log/tmp/.${VCS_LOG_RESOURCE_NAME}.tmo";
391c391,392
<     printf ${timeout} > "${timeout_file}"
---
>     echo ${timeout} > "${timeout_file}"

I figured it out because it worked on my SFHA 6.1 cluster, and I noticed the difference in the two VCSAG_SET_RES_EP_TIMEOUT shell functions.

Once you make the change, you can see it working, becuase the VCS engine places the following into the ening_A.log file:

2014/11/01 17:55:35 VCS NOTICE V-16-2-13033 (sol10u10-00) Monitor entry point of resource(kjbsg_KJBfoo) requested that the timeout be extended by (120) seconds

...and...

2014/11/01 17:44:37 VCS NOTICE V-16-2-13034 (sol10u10-00) Offline entry point of resource(kjbsg_KJBfoo) requested that the timeout be extended by (120) seconds

etc...

I haven't tested it, but it seems nearly certain that this takes effect for all entry points, including the action entry point that you are attempting...

Let us know if this provides you with the behaviour you are after (and give me a "solution" if it works! ;) )

Bill_Hurn · ‎11-03-2014

Hi Mike,

I've gone for a solution similar to your suggestion: the end-of-week shutdown script sets a temporary flag in Veritas and the offline (StopProgram) script checks for the flag and, if set, runs the additional tasks before resetting the flag. Also the online (StartProgram) script resets the flag resets the flag just in case something odd happened during the offline process. This achieves the requirement, although I'm always a little hesitant about separating the decision to run the additional tasks from the script that runs them; use of the Action would have allowed tighter control, but this mechanism does mean there are far fewer changes to the Veritas setup (no custom type, no action and minimal change to the main.cf, only as an additional parameter is now needed by the shutdown).

Thank you for your help

Bill Hurn

Bill_Hurn · ‎11-03-2014

Hi,

Nice spot on the bug! I'll try the change and see if I can see the difference. I have in fact switched to using a Veritas flag with a modified offline script instead of using the custom action, so this won't be directly applicable, but other groups do use custom actions, so I'll make sure they are ionformed of the issue.

Thanks again

Bill Hurn

VOX

Long running action on custom agent timing out