Solved: VCS action entry point gets killed by scheduled mo...

kjbss · ‎02-20-2012

Here is the set up:

 -$ hatype -display foodo -attribute MonitorInterval ActionTimeout SupportedActions
#Type        Attribute              Value
foodo        ActionTimeout          30
foodo        MonitorInterval        10
foodo        SupportedActions       touch       rmSleepAndRecreate

The 'rmSleepAndRecreate' action performs a sleep for 15 secounds so that it is greater than the MonitorInterval (10 seconds) period.

This is on purpose to see what would happen in the event a Supported-Action was invoked just before the MonitorInterval timer fired. It seems to me that this is something that can quite easily happen in the real world, and therefore should be well understood.

To my great disappointment, the VCS engine (or Agent) kills the action entry point thread when the monitor entry point is invoked:

 -$ hares -action kjb_foodo rmSleepAndRecreate -sys centos57x64-01
VCS WARNING V-16-13343 Action entrypoint action timed out for resource(kjb_foodo) 

...and in the VCS log:
2012/02/20 13:44:31 VCS INFO V-16-2-13003 (centos57x64-01) Resource(kjb_foodo): Output of the timed out operation (actions)

<<<this is just stdout from the 'rmSleepAndRecreate' action entry point>>>
  ****DEBUG****

allArgs=[kjb_foodo PathName 1 /tmp/kjb_foodo ACTION_ARGS 0]
resource_name=[kjb_foodo]
PathName=[/tmp/kjb_foodo]

<<<END STDOUT>>>


...and in the /var/VRTSvcs/log/foodo_A.log:
2012/02/20 13:44:30 VCS WARNING V-16-2-13139 Thread(4154313616) Canceling thread (4153260944)
2012/02/20 13:44:31 VCS WARNING V-16-2-13343 Thread(4150262672) Action entrypoint action timed out for resource(kjb_foodo)

This is NOT a good implementation; what if the action procedure was in the middle of performing something important or particularly sensitive?

I could find no documentation warning about this behaviour. If this rather dodgey implementation is "as designed", then the documentation describing the action entry point needs to have big danger signs all over it about how you must not implement a procedure to be performed by an action that would not mind getting killed in mid-flight!

I believe a much better implementation would be one where a scheduled monitor entry point does not get invoked until an action procedure has completed (or has been timed out via the ActionTimeout timer).

But hey, maybe this is particular to my Linux setup? Here's the particulars on that:

 centos57x64-01.localdomain(root) /root:
-$ printVRTSreleaseLevels
Name : VRTSperl                      Release : RHEL5.3                        Source RPM : VRTSperl-5.10.0.7-RHEL5.3.src.rpm
Name : VRTSatClient                  Release : 0                              Source RPM : VRTSatClient-5.0.32.0-0.src.rpm
Name : VRTSvxfs                      Release : SP1RP2_RHEL5                   Source RPM : VRTSvxfs-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSvcsag                     Release : SP1RP2_RHEL5                   Source RPM : VRTSvcsag-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSaslapm                    Release : SP1_RHEL5                      Source RPM : VRTSaslapm-5.1.100.000-SP1_RHEL5.src.rpm
Name : VRTSgab                       Release : SP1RP2_RHEL5                   Source RPM : VRTSgab-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSdbed                      Release : SP1RP2_RHEL5                   Source RPM : VRTSdbed-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSlvmconv                   Release : SP1RP2_RHEL5                   Source RPM : VRTSlvmconv-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSllt                       Release : SP1RP2_RHEL5                   Source RPM : VRTSllt-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSvcs                       Release : SP1RP2_RHEL5                   Source RPM : VRTSvcs-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSvcsea                     Release : SP1RP2_RHEL5                   Source RPM : VRTSvcsea-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSobgui                     Release : 0                              Source RPM : VRTSobgui-3.4.15.0-0.src.rpm
Name : VRTSspt                       Release : GA                             Source RPM : VRTSspt-5.5.000.005-GA.src.rpm
Name : VRTSatServer                  Release : 0                              Source RPM : VRTSatServer-5.0.32.0-0.src.rpm
Name : VRTSob                        Release : 0                              Source RPM : VRTSob-3.4.312-0.src.rpm
Name : VRTSfssdk                     Release : SP1RP2_RHEL5                   Source RPM : VRTSfssdk-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSamf                       Release : SP1RP2_RHEL5                   Source RPM : VRTSamf-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSvcsdr                     Release : SP1RP2_RHEL5                   Source RPM : VRTSvcsdr-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTScscm                      Release : Linux_GENERIC                  Source RPM : VRTScscm-5.1.00.20-Linux_GENERIC.src.rpm
Name : VRTSvxvm                      Release : SP1RP2_RHEL5                   Source RPM : VRTSvxvm-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSvxfen                     Release : SP1RP2_RHEL5                   Source RPM : VRTSvxfen-5.1.132.000-SP1RP2_RHEL5.src.rpm
Name : VRTSodm                       Release : RP1_RHEL5                      Source RPM : VRTSodm-5.1.101.000-RP1_RHEL5.src.rpm
Name : VRTSvlic                      Release : 0                              Source RPM : VRTSvlic-3.02.51.010-0.src.rpm
Name : VRTSsfmh                      Release : 0                              Source RPM : VRTSsfmh-3.1.830.0-0.src.rpm
Name : VRTScps                       Release : SP1RP2_RHEL5                   Source RPM : VRTScps-5.1.132.000-SP1RP2_RHEL5.src.rpm

Do others agree that this is a potentially dangerous VCS bug, or simply "works as designed, deal with it!".

Satish_K__Pagar · ‎02-21-2012

This behavior is documented in the Agent Developer Guide.

ActionTimeout

After the hares -action command has instructed the agent to perform a specified action, the action entry point has the time specified by the ActionTimeout attribute (scalar-integer) to perform the action. The value of ActionTimeout may be set for individual resources, if overridden.

Whether overridden or not, no matter what value is specified for ActionTimeout, the value is internally limited to 0.5 * MonitorInterval. You can extend this value by using the VCSAgSetResEPTimeout (for C/C++ entry point)/VCSAG_SET_RES_EP_TIMEOUT (for script entry point). The default is 30 seconds. The ActionTimeout attribute value can be overridden.

View solution in original post

Satish_K__Pagar · ‎02-21-2012

This behavior is documented in the Agent Developer Guide.

ActionTimeout

After the hares -action command has instructed the agent to perform a specified action, the action entry point has the time specified by the ActionTimeout attribute (scalar-integer) to perform the action. The value of ActionTimeout may be set for individual resources, if overridden.

Whether overridden or not, no matter what value is specified for ActionTimeout, the value is internally limited to 0.5 * MonitorInterval. You can extend this value by using the VCSAgSetResEPTimeout (for C/C++ entry point)/VCSAG_SET_RES_EP_TIMEOUT (for script entry point). The default is 30 seconds. The ActionTimeout attribute value can be overridden.

kjbss · ‎02-21-2012

Satish -- thanks for finding that bit of documentation that had alluded me. That really explains it.

My testing has revealed that this is indeed how the *effective* action timeout value is determined and enforced, AND, most importantly, if you kick off the action just before the monitor would have run, the monitor invocation is delayed (good) until the action has completed (or gets timed out by either the ActionTimeout or half the MonitorInterval, whichever is smaller).

I don't see a good reason to over complicate this as they have (why bother with an ActionTimeout attribute if you are going to not use it sometimes; just docuemnt that the action timeout is half the monitor interval), but I don't mind as long as I know what it is doing.

Bottom line is that if you are going to kick off an action procedure that would not like getting killed in-flight, you had better determine the amount of time you have to perform the procedure and only start it if there seems to be enough time. Wait, this cannot always be determined with consistent precision, for *lots* of reasons (variances on system load, etc). Therefore, the real bottom line is that you should never do anything in an action entry point that would have negative consequences if it gets killed while its running.

I still think this should be highligted in the documentation, that the action is pretty vulnerable to getting killed, so be careful what you action.

In my case, I've got to fork off a separate process and disconnect it from the parent so it will not get killled by the VCS engine (or agent framework). I then also need to modify the monitor to take this into account and react accordingly, to suit the circumstances. And there are potential pitfalls here to contend with, but it is all doable...

mikebounds · ‎02-21-2012

This is useful to know and explains why have sometimes seen actions timeout unexpectly. You say "most importantly, if you kick off the action just before the monitor would have run, the monitor invocation is delayed (good) until the action has completed"

Why is this good. For me, running an action should not normally effect the state of the resource and I don't know of any Symantec supplied actions that effect the state, so why should it cause an issue to run the monitor entry point at the same time. If an action could effect the state, then the user should freeze the service group prior to running the action (or the action code could freeze the service group).

It frustrates me that this information about timeouts is in the developers guide when timeouts are set by the USER without having to write any code and so I don't understand why this information cannot be put in the VCS USERS guide. I don't do any development of VCS agents and so most of the stuff in this guide is not relevent to me and does not make sense to me, so I guess this guide may tell me that 2 entry points can't run at the same time for a given resource.

Mike

kjbss · ‎02-21-2012

Well, it's "horses for courses" (what is suitable for one person or situation might be unsuitable for another)....

All that really matters is that it is documented in the appropriate sections, as Mike implies.

And I would strongly advocate that because there is this (ill-conceived, IMHO) interplay between ActionTimeout and MonitorInterval, it is necessary that this be clearly reinforced and described where ever either of these two attributes are defined/described within the documentation, and that means the Admin and User guides, as well as the developer guide needs to state this relationship.

I've always had the idea that VCS agent entry points are designed to be "single threaded" per resource, per system. That is, only one entry point runs at a time. However, I'm not sure where/if this is specifically documented.

If the above is true, whether or not the action would temporarily cause a potential monitor to fail, would be non-relevant, as long as the action left the resource-environment such that it would pass a monitor invocation before the action terminated (and all of that done before it got timed out!).

VOX

VCS action entry point gets killed by scheduled monitor timer