Long running action on custom agent timing out
Hi,
I've created an action on a custom agent (based on ApplicationAgent) which can take a couple of minutes to complete. However, the action will timeout after MonitorInterval / 2. If I set the MonitorInterval to a sufficiently high value, the action will complete, but the cluster manager will then take a very long time to recognise that the application has started (some multiple of the MonitorInterval), causing the dependent applications in the group to take too long to start up.
I had hoped that I could override the action timeout using VCSAG_SET_RES_EP_TIMEOUT from ag_i18n_inc.sh, but this does not appear to affect the MonitorInterval / 2 maximum so it does not help.
In other instances we have created a completely separate custom Agent with the custom action so that its MonitorInterval can be set very high without changing the value for the real applications; this still leaves the application groups in 'Partial_Online' state for far longer than is acceptable.
I have also contemplated changing the MonitorInteval on the custom agent to a high value only during the period when the long running action is to be carried out, switching it back when the action completes, but there is a risk that the value might not get switched back, again causing slow startup.
Is there any way of allowing my custom action to use a timeout of several minutes without affecting the cluster manager's rapid ability to confirm the correct startup of the applications?
Any suggestions gratefully received
thanks,
Bill Hurn
As I understand you have a process at the end of week where you run some custom scripts and then offline the application. You could just do this via a script from the CLI, but I guess you want the option to be able to do this from Java or VOM GUI and to integrate more tightly with VCS.
I think using the offline script is a good idea, but I am not sure how you would avoid using action script as if you manually offline resource mid week, how would the script know if if needs to run end-of-week processes unless if determines this by looking at the date. I would implement this something like:
Create an Application or custom agent resource called something like "end-of-week-shutdown" and make this dependent on your application resource that needs to shutdown. Have the online of this resource just touch a file and the monitor just check the file exists.
Then have action on this resource which:
- Either sets a VCS temporary attribute or creates a file in a set location
- Runs hares offlie to offline the resource itself
The offline then checks for VCS temporary attribute or a file in a set location and if set then it runs the end-of-week shutdown processes and then removes file that resource monitors (if VCS temporary attribute is not set then offline just removes file that resource monitors)
Using a separate resource means you will have a visual representation of when the end-of-week shutdown processes are running (when the "end-of-week-shutdown" resource is offlining) which is separate from when the application is shutting down (the normal application resource is offlining).
Mike
Well I have news on the usage of the VCSAG_SET_RES_EP_TIMEOUT function found in /opt/VRTSvcs/bin/ag_i18n_inc.sh --
There is a bug in some versions of this file -- at least with the one installed with the following version for Solaris:
-$ pkginfo -l VRTSvcs | egrep 'VERSION|PSTAMP' VERSION: 6.0.100.000 PSTAMP: 6.0.100.000-GA-2012-07-20-16.30.01
To fix it, make the following changes:-$ diff /opt/VRTSvcs/bin/ag_i18n_inc.sh /opt/VRTSvcs/bin/ag_i18n_inc.sh.orig 389c389 < timeout_file="${VCSLOG}/log/tmp/${VCS_LOG_RESOURCE_NAME}.tmo"; --- > timeout_file="${VCSLOG}/log/tmp/.${VCS_LOG_RESOURCE_NAME}.tmo"; 391c391,392 < printf ${timeout} > "${timeout_file}" --- > echo ${timeout} > "${timeout_file}"
I figured it out because it worked on my SFHA 6.1 cluster, and I noticed the difference in the two VCSAG_SET_RES_EP_TIMEOUT shell functions.Once you make the change, you can see it working, becuase the VCS engine places the following into the ening_A.log file:
2014/11/01 17:55:35 VCS NOTICE V-16-2-13033 (sol10u10-00) Monitor entry point of resource(kjbsg_KJBfoo) requested that the timeout be extended by (120) seconds
...and...
2014/11/01 17:44:37 VCS NOTICE V-16-2-13034 (sol10u10-00) Offline entry point of resource(kjbsg_KJBfoo) requested that the timeout be extended by (120) seconds
etc...
I haven't tested it, but it seems nearly certain that this takes effect for all entry points, including the action entry point that you are attempting...
Let us know if this provides you with the behaviour you are after (and give me a "solution" if it works! ;-) )