Resource goes offline
Hello,
We are managing few applicaitons using VCS, but one of the application needs to restart manually every day, but we don't do that. Applicaiton team restart the application outside the cluster and since the resource is not ciritcal, application goes to offline. Because of this our SG always shows as partial online. We don't want to show the resouce as offline since application is manually started, how can we make the resource online automatically? We don't want to see SG as partial offline as well.
Thanks in advance.
If the application team start the application immediately after stopping it (as oppose to leaving down for a while while they do some maintenance), then you should freeze service group first. When frozen, VCS will still monitor resource, but if it sees it is offline it will marked it as offline as oppose to faulted ( this also means you do not need to make resource non-critical if actually the resource should be critical - i.e if application fails in a real situation, do you want VCS to fail service group over).
When you start the application, VCS should recognise the resource is online within 5 mins and you can manually probe resource to speed this up. If VCS does not see resource as online, then application team must be starting the application differently to how VCS starts it - if this is the case, please post extract from main.cf of your service group.
Mike
mokkan -- let me recap and confirm your requirement:
- You have one particular application that is managed by VCS that needs to be restarted every day -- I am going to assume that it is managed by an Application-Type resource.
- The Server Admins (IE: Cluster Admins) do not actively manage this application -- it is managed by the "Application Team"
- You do not want the Service Group to show a state of PARTIAL when the Application Team needs to restart it (which evidently takes longer than the in-effect MonitorInterval for that resource).
IF I got that right, ....What you need to do is to roll your own "Intentional Offline" feature:To do this, modify the current resource's defined MonitorProgram to check for the existence of the file '/tmp/<resourceName>.IntentionalOffline' and if it exists, it should exit 110 to indicate that the resource is to be considered online (regardless as to whether or not the actual application is running or not).
NOTE: the required exit code is dependent upon which version of the Application agent you are using -- you may need to exit 0 for online and 1 for offline -- you need to read the appropriate agent documentation.
Here is sample code to add to the MonitorProgram to accomplish this functionality:#!/bin/bash OFFLINE=100 ONLINE=110 if [[ -f /tmp/mySG_myRes_app.IntentionalOffline ]] then if [[ $(hares -state -sys $(lltstat -H)) == "OFFLINE" ]] then # # For some reason the IntentionalOffline file exists, yet VCS already believes the resources is OFFLINE on this host. # Best thing to do here is to exit "OFFLINE" # exit $OFFLINE else exit $ONLINE fi fi # ...rest of normal MonitorProgram logic follows....
I would also add the following logic to the beginning of the defined StartProgram, CleanProgram, and StopProgram in order to keep things in order and clean up stale "Intentional Offline" flag-files:
[[ -f /tmp/mySG_myRes_app.IntentionalOffline ]] && rm /tmp/mySG_myRes_app.IntentionalOffline
Now, create a workflow for the Application Team to use when they need to restart the application:
- Execute: touch /tmp/<resourceName>.IntentionalOffline
- Do the maintenance on the application environment.
- Restart the application.
- Execute: rm /tmp/<resourceName>.IntentionalOffline
Above of course can be scripted to make like easier for the Application Team....
Once that is in place, your Application Team will be able to do their daily maintenance on the application environment without having the service group show up as PARTIAL.1. If the Application Team is mucking about with an application that is actively managed by VCS, then that needs to stop ASAP! -- provide the Application Team the appropriate access level via VCS GUI (or CLI) so that they can restart the application via supported VCS interfaces (either one of several GUI ones or via the CLI).
2. If several other applications are dependent upon this master application, AND this dependancy is appropriately represented in the service group's hierarchy (via resource dependencies), then VCS will not let anyone restart the master application without firstly off-lining the dependent application-resources.
You say: "The VCS resource connect to another server to be online all the time" -- this makes no sense to me (sorry that I can't figure it out, but... I can't) and needs explaining in order for you to get the help you are looking for ...These "5 resrources depends on master application" -- are all of these in the same service group? ...and linked appropriately?
Please provide the names of the 5 dependent resources, and the output of
hares -dep <master_Application_Resouce_Name>
If the monitor for the dependent applications is failing and the applications are still actually up, then freeze application service group(s) before offlining master so they do not fault and when the master application is back online the dependent applications will automatically show as online (and then unfreeze groups).
If the dependent applications actually fail when the master is taken down then you can set RestartLimit on the resources so they will try to restart automatically. If the master application is down for a while then you will need to set appropiate OnlineTimeout and/or set the OnlineRetryLimit.
Note the RestartLimit is used when the application is up and it dies, and the OnlineRetryLimit is used when the application is down and it is trying to start, so you will probably want to set the RestartLimit to 1 and the OnlineRetryLimit along with the OnlineTimeout depending on how long the master application is down. For example if it is down for 15 mins then you could set OnlineRetryLimit to 3 with OnlineTimeout = 300 seconds (5 mins x 3 = 15 mins)
Note you set RestartLimit, OnlineRetryLimit and OnlineTimeout on the type (example type Application), not the resource, so for example you use:
hatype -modify Application RestartLimit 1
and all resources of type Application will then have a RestartLimit of 1, but if you have other resources of type Application that should not be restarted, then you can override the default type attributes using "hares -override" - example:
hares -override resource_name1 RestartLimit
hares -modify resource_name1 RestartLimit 1So then if RestartLimit is 0 (the default) for type Application, then all resources of type Application will have a RestartLimit of 0, except for those which you have overriden.
Mike