Forum Discussion

mokkan's avatar
mokkan
Level 6
11 years ago

Resource goes offline

Hello,

We are managing few applicaitons using VCS, but one of the application needs to restart manually every day, but we don't do that. Applicaiton team restart the application outside the cluster and since the resource is not ciritcal, application goes to offline. Because of this our SG always shows as partial online. We don't want to show the resouce as offline since application is manually started, how can we make the resource online automatically? We don't want to see  SG as partial offline as well.

 

Thanks in advance.

 

 

 

 

 

 

  • If the application team start the application immediately after stopping it (as oppose to leaving down for a while while they do some maintenance), then you should freeze service group first.  When frozen, VCS will still monitor resource, but if it sees it is offline it will marked it as offline as oppose to faulted ( this also means you do not need to make resource non-critical if actually the resource should be critical - i.e if application fails in a real situation, do you want VCS to fail service group over).

    When you start the application, VCS should recognise the resource is online within 5 mins and you can manually probe resource to speed this up.  If VCS does not see resource as online, then application team must be starting the application differently to how VCS starts it - if this is the case, please post extract from main.cf of your service group.

    Mike

  • mokkan -- let me recap and confirm your requirement:

    1. You have one particular application that is managed by VCS that needs to be restarted every day -- I am going to assume that it is managed by an Application-Type resource.
    2. The Server Admins (IE: Cluster Admins) do not actively manage this application -- it is managed by the "Application Team"
    3. You do not want the Service Group to show a state of PARTIAL when the Application Team needs to restart it (which evidently takes longer than the in-effect MonitorInterval for that resource).


    IF I got that right, ....What you need to do is to roll your own "Intentional Offline" feature:

    To do this, modify the current resource's defined MonitorProgram to check for the existence of the file '/tmp/<resourceName>.IntentionalOffline' and if it exists, it should exit 110 to indicate that the resource is to be considered online (regardless as to whether or not the actual application is running or not). 

    NOTE:  the required exit code is dependent upon which version of the Application agent you are using -- you may need to exit 0 for online and 1 for offline -- you need to read the appropriate agent documentation.


    Here is sample code to add to the MonitorProgram to accomplish this functionality:

    #!/bin/bash
    
    OFFLINE=100
    ONLINE=110
    
    if [[ -f /tmp/mySG_myRes_app.IntentionalOffline ]]
    then
      if [[ $(hares -state -sys $(lltstat -H)) == "OFFLINE" ]]
      then
        #
        # For some reason the IntentionalOffline file exists, yet VCS already believes the resources is OFFLINE on this host.  
        # Best thing to do here is to exit "OFFLINE"
        #
        exit $OFFLINE
       else
        exit $ONLINE
      fi
    fi
    
    # ...rest of normal MonitorProgram logic follows....

     

    I would also add the following logic to the beginning of the defined StartProgram, CleanProgram, and StopProgram in order to keep things in order and clean up stale "Intentional Offline" flag-files:

    [[ -f /tmp/mySG_myRes_app.IntentionalOffline ]] && rm /tmp/mySG_myRes_app.IntentionalOffline

     

    Now, create a workflow for the Application Team to use when they need to restart the application:

    1. Execute: touch /tmp/<resourceName>.IntentionalOffline
    2. Do the maintenance on the application environment.  
    3. Restart the application.
    4. Execute:  rm /tmp/<resourceName>.IntentionalOffline

    Above of course can be scripted to make like easier for the Application Team....


    Once that is in place, your Application Team will be able to do their daily maintenance on the application environment without having the service group show up as PARTIAL.

     

     

  • 1.  If the Application Team is mucking about with an application that is actively managed by VCS, then that needs to stop ASAP! -- provide the Application Team the appropriate access level via VCS GUI (or CLI) so that they can restart the application via supported VCS interfaces (either one of several GUI ones or via the CLI).

    2.  If several other applications are dependent upon this master application, AND this dependancy is appropriately represented in the service group's hierarchy (via resource dependencies),  then VCS will not let anyone restart the master application without firstly off-lining the dependent application-resources.  


    You say:  "The VCS resource connect to another server to be online all the time" -- this makes no sense to me (sorry that I can't figure it out, but... I can't) and needs explaining in order for you to get the help you are looking for ...

     

    These "5 resrources depends on master application" -- are all of these in the same service group? ...and linked appropriately? 

    Please provide the names of the 5 dependent resources, and the output of

    hares -dep <master_Application_Resouce_Name>

     

  • If the monitor for the dependent applications is failing and the applications are still actually up, then freeze application service group(s) before offlining master so they do not fault and when the master application is back online the dependent applications will automatically show as online (and then unfreeze groups).

    If the dependent applications actually fail when the master is taken down then you can set RestartLimit on the resources so they will try to restart automatically.  If the master application is down for a while then you will need to set appropiate OnlineTimeout and/or set the OnlineRetryLimit.  

    Note the RestartLimit is used when the application is up and it dies, and the OnlineRetryLimit is used when the application is down and it is trying to start, so you will probably want to set the RestartLimit to 1 and the OnlineRetryLimit along with the OnlineTimeout depending on how long the master application is down. For example if it is down for 15 mins then you could set OnlineRetryLimit to 3 with OnlineTimeout = 300 seconds (5 mins x 3 = 15 mins)

    Note you set RestartLimit, OnlineRetryLimit and OnlineTimeout on the type (example type Application), not the resource, so for example you use:

    hatype -modify Application RestartLimit 1

    and all resources of type Application will then have a RestartLimit of 1, but if you have other resources of type Application that should not be restarted, then you can override the default type attributes using "hares -override" - example:

    hares -override resource_name1 RestartLimit
    hares -modify resource_name1 RestartLimit 1

    So then if RestartLimit is 0 (the default) for type Application, then all resources of type Application will have a RestartLimit of 0, except for those which you have overriden.

    Mike

  • Hi,

    Have you done any analysis from logs so far as to why application is reported offline by VCS ? Is it monitoring timing out or something else ? Refer engine_A.log to understand why this happens ..

    Secondly, once app team restarts the application, VCS should automatically probe that & declare the group as online. By default monitor cycle will execute in the defined period & once app is restarted, monitor will detect app as online & update the status accordingly. If this is not happening, quite possible monitor script doesn't have appropriate exit codes of 110 (succesful) & 100 (unsuccessful) ?

     

    G

  • If the application team start the application immediately after stopping it (as oppose to leaving down for a while while they do some maintenance), then you should freeze service group first.  When frozen, VCS will still monitor resource, but if it sees it is offline it will marked it as offline as oppose to faulted ( this also means you do not need to make resource non-critical if actually the resource should be critical - i.e if application fails in a real situation, do you want VCS to fail service group over).

    When you start the application, VCS should recognise the resource is online within 5 mins and you can manually probe resource to speed this up.  If VCS does not see resource as online, then application team must be starting the application differently to how VCS starts it - if this is the case, please post extract from main.cf of your service group.

    Mike

  • mokkan -- let me recap and confirm your requirement:

    1. You have one particular application that is managed by VCS that needs to be restarted every day -- I am going to assume that it is managed by an Application-Type resource.
    2. The Server Admins (IE: Cluster Admins) do not actively manage this application -- it is managed by the "Application Team"
    3. You do not want the Service Group to show a state of PARTIAL when the Application Team needs to restart it (which evidently takes longer than the in-effect MonitorInterval for that resource).


    IF I got that right, ....What you need to do is to roll your own "Intentional Offline" feature:

    To do this, modify the current resource's defined MonitorProgram to check for the existence of the file '/tmp/<resourceName>.IntentionalOffline' and if it exists, it should exit 110 to indicate that the resource is to be considered online (regardless as to whether or not the actual application is running or not). 

    NOTE:  the required exit code is dependent upon which version of the Application agent you are using -- you may need to exit 0 for online and 1 for offline -- you need to read the appropriate agent documentation.


    Here is sample code to add to the MonitorProgram to accomplish this functionality:

    #!/bin/bash
    
    OFFLINE=100
    ONLINE=110
    
    if [[ -f /tmp/mySG_myRes_app.IntentionalOffline ]]
    then
      if [[ $(hares -state -sys $(lltstat -H)) == "OFFLINE" ]]
      then
        #
        # For some reason the IntentionalOffline file exists, yet VCS already believes the resources is OFFLINE on this host.  
        # Best thing to do here is to exit "OFFLINE"
        #
        exit $OFFLINE
       else
        exit $ONLINE
      fi
    fi
    
    # ...rest of normal MonitorProgram logic follows....

     

    I would also add the following logic to the beginning of the defined StartProgram, CleanProgram, and StopProgram in order to keep things in order and clean up stale "Intentional Offline" flag-files:

    [[ -f /tmp/mySG_myRes_app.IntentionalOffline ]] && rm /tmp/mySG_myRes_app.IntentionalOffline

     

    Now, create a workflow for the Application Team to use when they need to restart the application:

    1. Execute: touch /tmp/<resourceName>.IntentionalOffline
    2. Do the maintenance on the application environment.  
    3. Restart the application.
    4. Execute:  rm /tmp/<resourceName>.IntentionalOffline

    Above of course can be scripted to make like easier for the Application Team....


    Once that is in place, your Application Team will be able to do their daily maintenance on the application environment without having the service group show up as PARTIAL.

     

     

  • Hello All,

    Thank you for your all the replies. I made a mistake on putting some info. It is not going to offline, it gets faulted. I will update you with the right info.

     

    Thanks again for all the quick replies.

  • Why not train application owners to offline and online resources using cluster commands?

     hares -offline <res-name> -sys <system>

  • Sorry Guys.  This is exact;ly happeing,  application team is not restarting the application locally, The VCS resource connect to another server to be online all the time. There are more than 5 resrources depends on master application. Application team restarts the master application and  there are 5 resources depend on master application. Once the master application restarts, all the those 5 resource is going to faulted state. How can we bring them onlne without manual interaction?

  • 1.  If the Application Team is mucking about with an application that is actively managed by VCS, then that needs to stop ASAP! -- provide the Application Team the appropriate access level via VCS GUI (or CLI) so that they can restart the application via supported VCS interfaces (either one of several GUI ones or via the CLI).

    2.  If several other applications are dependent upon this master application, AND this dependancy is appropriately represented in the service group's hierarchy (via resource dependencies),  then VCS will not let anyone restart the master application without firstly off-lining the dependent application-resources.  


    You say:  "The VCS resource connect to another server to be online all the time" -- this makes no sense to me (sorry that I can't figure it out, but... I can't) and needs explaining in order for you to get the help you are looking for ...

     

    These "5 resrources depends on master application" -- are all of these in the same service group? ...and linked appropriately? 

    Please provide the names of the 5 dependent resources, and the output of

    hares -dep <master_Application_Resouce_Name>

     

  • If the monitor for the dependent applications is failing and the applications are still actually up, then freeze application service group(s) before offlining master so they do not fault and when the master application is back online the dependent applications will automatically show as online (and then unfreeze groups).

    If the dependent applications actually fail when the master is taken down then you can set RestartLimit on the resources so they will try to restart automatically.  If the master application is down for a while then you will need to set appropiate OnlineTimeout and/or set the OnlineRetryLimit.  

    Note the RestartLimit is used when the application is up and it dies, and the OnlineRetryLimit is used when the application is down and it is trying to start, so you will probably want to set the RestartLimit to 1 and the OnlineRetryLimit along with the OnlineTimeout depending on how long the master application is down. For example if it is down for 15 mins then you could set OnlineRetryLimit to 3 with OnlineTimeout = 300 seconds (5 mins x 3 = 15 mins)

    Note you set RestartLimit, OnlineRetryLimit and OnlineTimeout on the type (example type Application), not the resource, so for example you use:

    hatype -modify Application RestartLimit 1

    and all resources of type Application will then have a RestartLimit of 1, but if you have other resources of type Application that should not be restarted, then you can override the default type attributes using "hares -override" - example:

    hares -override resource_name1 RestartLimit
    hares -modify resource_name1 RestartLimit 1

    So then if RestartLimit is 0 (the default) for type Application, then all resources of type Application will have a RestartLimit of 0, except for those which you have overriden.

    Mike

  • I repeat my previous advice - train the App team to use VCS! 

    The GUI can be used to include dependencies.

  • Thank you very much all of you.

     

    Mike,

    As usual you are awesome !!!!