02-20-2012 07:07 AM
Here is the set up:
-$ hatype -display foodo -attribute MonitorInterval ActionTimeout SupportedActions #Type Attribute Value foodo ActionTimeout 30 foodo MonitorInterval 10 foodo SupportedActions touch rmSleepAndRecreate
The 'rmSleepAndRecreate' action performs a sleep for 15 secounds so that it is greater than the MonitorInterval (10 seconds) period.
This is on purpose to see what would happen in the event a Supported-Action was invoked just before the MonitorInterval timer fired. It seems to me that this is something that can quite easily happen in the real world, and therefore should be well understood.
To my great disappointment, the VCS engine (or Agent) kills the action entry point thread when the monitor entry point is invoked:
-$ hares -action kjb_foodo rmSleepAndRecreate -sys centos57x64-01 VCS WARNING V-16-13343 Action entrypoint action timed out for resource(kjb_foodo) ...and in the VCS log: 2012/02/20 13:44:31 VCS INFO V-16-2-13003 (centos57x64-01) Resource(kjb_foodo): Output of the timed out operation (actions) <<<this is just stdout from the 'rmSleepAndRecreate' action entry point>>> ****DEBUG**** allArgs=[kjb_foodo PathName 1 /tmp/kjb_foodo ACTION_ARGS 0] resource_name=[kjb_foodo] PathName=[/tmp/kjb_foodo] <<<END STDOUT>>> ...and in the /var/VRTSvcs/log/foodo_A.log: 2012/02/20 13:44:30 VCS WARNING V-16-2-13139 Thread(4154313616) Canceling thread (4153260944) 2012/02/20 13:44:31 VCS WARNING V-16-2-13343 Thread(4150262672) Action entrypoint action timed out for resource(kjb_foodo)
This is NOT a good implementation; what if the action procedure was in the middle of performing something important or particularly sensitive?
I could find no documentation warning about this behaviour. If this rather dodgey implementation is "as designed", then the documentation describing the action entry point needs to have big danger signs all over it about how you must not implement a procedure to be performed by an action that would not mind getting killed in mid-flight!
I believe a much better implementation would be one where a scheduled monitor entry point does not get invoked until an action procedure has completed (or has been timed out via the ActionTimeout timer).
But hey, maybe this is particular to my Linux setup? Here's the particulars on that:
centos57x64-01.localdomain(root) /root: -$ printVRTSreleaseLevels Name : VRTSperl Release : RHEL5.3 Source RPM : VRTSperl-5.10.0.7-RHEL5.3.src.rpm Name : VRTSatClient Release : 0 Source RPM : VRTSatClient-5.0.32.0-0.src.rpm Name : VRTSvxfs Release : SP1RP2_RHEL5 Source RPM : VRTSvxfs-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSvcsag Release : SP1RP2_RHEL5 Source RPM : VRTSvcsag-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSaslapm Release : SP1_RHEL5 Source RPM : VRTSaslapm-5.1.100.000-SP1_RHEL5.src.rpm Name : VRTSgab Release : SP1RP2_RHEL5 Source RPM : VRTSgab-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSdbed Release : SP1RP2_RHEL5 Source RPM : VRTSdbed-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSlvmconv Release : SP1RP2_RHEL5 Source RPM : VRTSlvmconv-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSllt Release : SP1RP2_RHEL5 Source RPM : VRTSllt-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSvcs Release : SP1RP2_RHEL5 Source RPM : VRTSvcs-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSvcsea Release : SP1RP2_RHEL5 Source RPM : VRTSvcsea-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSobgui Release : 0 Source RPM : VRTSobgui-3.4.15.0-0.src.rpm Name : VRTSspt Release : GA Source RPM : VRTSspt-5.5.000.005-GA.src.rpm Name : VRTSatServer Release : 0 Source RPM : VRTSatServer-5.0.32.0-0.src.rpm Name : VRTSob Release : 0 Source RPM : VRTSob-3.4.312-0.src.rpm Name : VRTSfssdk Release : SP1RP2_RHEL5 Source RPM : VRTSfssdk-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSamf Release : SP1RP2_RHEL5 Source RPM : VRTSamf-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSvcsdr Release : SP1RP2_RHEL5 Source RPM : VRTSvcsdr-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTScscm Release : Linux_GENERIC Source RPM : VRTScscm-5.1.00.20-Linux_GENERIC.src.rpm Name : VRTSvxvm Release : SP1RP2_RHEL5 Source RPM : VRTSvxvm-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSvxfen Release : SP1RP2_RHEL5 Source RPM : VRTSvxfen-5.1.132.000-SP1RP2_RHEL5.src.rpm Name : VRTSodm Release : RP1_RHEL5 Source RPM : VRTSodm-5.1.101.000-RP1_RHEL5.src.rpm Name : VRTSvlic Release : 0 Source RPM : VRTSvlic-3.02.51.010-0.src.rpm Name : VRTSsfmh Release : 0 Source RPM : VRTSsfmh-3.1.830.0-0.src.rpm Name : VRTScps Release : SP1RP2_RHEL5 Source RPM : VRTScps-5.1.132.000-SP1RP2_RHEL5.src.rpm
Do others agree that this is a potentially dangerous VCS bug, or simply "works as designed, deal with it!".
Solved! Go to Solution.
02-21-2012 03:36 AM
This behavior is documented in the Agent Developer Guide.
ActionTimeout
After the hares -action command has instructed the agent to perform a specified action, the action entry point has the time specified by the ActionTimeout attribute (scalar-integer) to perform the action. The value of ActionTimeout may be set for individual resources, if overridden.
Whether overridden or not, no matter what value is specified for ActionTimeout, the value is internally limited to 0.5 * MonitorInterval. You can extend this value by using the VCSAgSetResEPTimeout (for C/C++ entry point)/VCSAG_SET_RES_EP_TIMEOUT (for script entry point). The default is 30 seconds. The ActionTimeout attribute value can be overridden.
02-21-2012 03:36 AM
This behavior is documented in the Agent Developer Guide.
ActionTimeout
After the hares -action command has instructed the agent to perform a specified action, the action entry point has the time specified by the ActionTimeout attribute (scalar-integer) to perform the action. The value of ActionTimeout may be set for individual resources, if overridden.
Whether overridden or not, no matter what value is specified for ActionTimeout, the value is internally limited to 0.5 * MonitorInterval. You can extend this value by using the VCSAgSetResEPTimeout (for C/C++ entry point)/VCSAG_SET_RES_EP_TIMEOUT (for script entry point). The default is 30 seconds. The ActionTimeout attribute value can be overridden.
02-21-2012 04:49 AM
Satish -- thanks for finding that bit of documentation that had alluded me. That really explains it.
My testing has revealed that this is indeed how the *effective* action timeout value is determined and enforced, AND, most importantly, if you kick off the action just before the monitor would have run, the monitor invocation is delayed (good) until the action has completed (or gets timed out by either the ActionTimeout or half the MonitorInterval, whichever is smaller).
I don't see a good reason to over complicate this as they have (why bother with an ActionTimeout attribute if you are going to not use it sometimes; just docuemnt that the action timeout is half the monitor interval), but I don't mind as long as I know what it is doing.
Bottom line is that if you are going to kick off an action procedure that would not like getting killed in-flight, you had better determine the amount of time you have to perform the procedure and only start it if there seems to be enough time. Wait, this cannot always be determined with consistent precision, for *lots* of reasons (variances on system load, etc). Therefore, the real bottom line is that you should never do anything in an action entry point that would have negative consequences if it gets killed while its running.
I still think this should be highligted in the documentation, that the action is pretty vulnerable to getting killed, so be careful what you action.
In my case, I've got to fork off a separate process and disconnect it from the parent so it will not get killled by the VCS engine (or agent framework). I then also need to modify the monitor to take this into account and react accordingly, to suit the circumstances. And there are potential pitfalls here to contend with, but it is all doable...
02-21-2012 07:15 AM
This is useful to know and explains why have sometimes seen actions timeout unexpectly. You say "most importantly, if you kick off the action just before the monitor would have run, the monitor invocation is delayed (good) until the action has completed"
Why is this good. For me, running an action should not normally effect the state of the resource and I don't know of any Symantec supplied actions that effect the state, so why should it cause an issue to run the monitor entry point at the same time. If an action could effect the state, then the user should freeze the service group prior to running the action (or the action code could freeze the service group).
It frustrates me that this information about timeouts is in the developers guide when timeouts are set by the USER without having to write any code and so I don't understand why this information cannot be put in the VCS USERS guide. I don't do any development of VCS agents and so most of the stuff in this guide is not relevent to me and does not make sense to me, so I guess this guide may tell me that 2 entry points can't run at the same time for a given resource.
Mike
02-21-2012 08:24 AM
Well, it's "horses for courses" (what is suitable for one person or situation might be unsuitable for another)....
All that really matters is that it is documented in the appropriate sections, as Mike implies.
And I would strongly advocate that because there is this (ill-conceived, IMHO) interplay between ActionTimeout and MonitorInterval, it is necessary that this be clearly reinforced and described where ever either of these two attributes are defined/described within the documentation, and that means the Admin and User guides, as well as the developer guide needs to state this relationship.
I've always had the idea that VCS agent entry points are designed to be "single threaded" per resource, per system. That is, only one entry point runs at a time. However, I'm not sure where/if this is specifically documented.
If the above is true, whether or not the action would temporarily cause a potential monitor to fail, would be non-relevant, as long as the action left the resource-environment such that it would pass a monitor invocation before the action terminated (and all of that done before it got timed out!).