Forum Discussion

Mark777's avatar
Mark777
Level 3
11 years ago

Monitor failing on inactive host for Application agent

Split off from https://www-secure.symantec.com/connect/forums/help-ha-app-and-dns-resource

The application agent seems to be running on both servers and the monitor is obviously failing on the secondary host.

What attribute do I need to set so that it only runs one monitor instance?

 

Thanks

  • You can set OfflineMonitorInterval for the Application to 0 (hatype -modify Application OfflineMonitorInterval 0) which will turn off Offline Monitoring, but you shouldn't need to do this:

    If you using MonitorProcesses or PidFiles attribute then the resource should just report offline.  If you are using MonitorProgram attribute then your program should be on local storage and should report offline when the application is not running (as oppose to the Monitor program failing to run when the application is not running).

    Mike

  • You can set OfflineMonitorInterval for the Application to 0 (hatype -modify Application OfflineMonitorInterval 0) which will turn off Offline Monitoring, but you shouldn't need to do this:

    If you using MonitorProcesses or PidFiles attribute then the resource should just report offline.  If you are using MonitorProgram attribute then your program should be on local storage and should report offline when the application is not running (as oppose to the Monitor program failing to run when the application is not running).

    Mike

  • This setup is actually based on shared storage, so ideally would be nice to link it to the host that cfs has as "active"

  • The norm for VCS agents is that the agent will run on all systems that are listed in the resource's Service Group's SystemList.

    As Mike says, it would normally return OFFLINE on the host where the Service Group is offline, and ONLINE on the host where the Service Group is online.  

    The idea here is that it is meant on the offline host to make sure that the resource does not online on the system that should not be hosting the application, in order to protect against concurrency violations.

    In order for us to help further than what Mike has already provided, we need to hear details on exactly how the monitor is failing on the offline system.  

    If it is failing because the monitor program cannot be found on the offline host, then it sounds like the monitor program lives on shared storage that is only accessible on the system where the resource is online on.  In such a case you need to move the monitor program to an area where it can always be accessed on each system, whether or not the service group is currently online on that system or not.

    A common place to put such programs would be in either of the two locations:

    /opt/VRTSvcs/bin/scripts/<yourApplicationName>/<yourMonitorProgram>
    /opt/VRTSagents/ha/bin/<yourApplicationName>/<yourMonitorProgram>

    If it makes sense in your case, a common design pattern in such cases is to check to see if the shared storage is mounted locally before proceeding with the actual monitor-logic.  If the shared storage is mounted locally, carry on with the normal monitor procedure; otherwise, return OFFINE.

     

  • Thanks for the response.

     

    The agent is actually running on the shared storage and checks for a running process. Because it is shared storage the mount is online on both hosts, so not sure exactly what can be checked / "mounted locally"

     

    What I might do (for now anyway) is to check for the prescence of a file and make that the dependency ...

  • Can you clarify a few things:

    1. You are using CFS so the same filessytem is mounted on both nodes at the same time - correct?
    2. You have a resource of type application configured in a failover service group - correct? - if resource is configured in a parallel group, like same group as CFS mounts, then it sounds like it should be in a failover group if it should only run on one node at a time
    3. Is your application resource using MonitorProcesses, PidFiles or MonitorProgram attribute
    4. Is your Monitor failing as oppose to reporting offline - can you give entry in the engine log
    5. If your application resource is using MonitorProgram attribute - then what does your MonitorProgram do that causes it to fail

    Mike

  • Hi Mike,

    hatype -modify Application OfflineMonitorInterval 0 seems to have fixed the issue (I think) - just keen to know the best way. So comments inline below.

     

    1. You are using CFS so the same filessytem is mounted on both nodes at the same time - correct? [yes, a cfs cluster created with fencing using a cp server]
    2. You have a resource of type application configured in a failover service group - correct? - if resource is configured in a parallel group, like same group as CFS mounts, then it sounds like it should be in a failover group if it should only run on one node at a time [The "application" service group is created under the cfs cluster. I created it and used type Failover.
    3. Is your application resource using MonitorProcesses, PidFiles or MonitorProgram attribute [it's a custom agent which looks for running process using ps, (I also don't really want to make any changes to agents)].
    4. Is your Monitor failing as oppose to reporting offline - can you give entry in the engine log [after repeated attempts to monitor its forces a clean]
    5. If your application resource is using MonitorProgram attribute - then what does your MonitorProgram do that causes it to fail [as per point 3 - custom app and agent]
  • I'm still not clear on point 4.  If a monitor fails, then "hastatus -sum" should show that resource is unprobed  on the system the monitor fails - i.e something like:

    -- RESOURCES NOT PROBED
    -- Group           Type                 Resource             System           
    
    E  failoversg      custom_agent         custom_res           Sys2

        

    If the agent is reporting offline on the in-active node then this is fine - this is what it should do.  I don't undertstand why it would call clean as a clean is normally called when the resource has been online and then goes offline outside of VCS control, or if offline fails.

    Can you give extract from engine log of the issue you see when the OfflineMonitorInterval is not zero.

    Why do yo need to use a custom agent - is there is a reason you cannot use the application agent. Custom agent is only normally used if the application agent doesn't work - an example would be if an application cannot be uniquely identified by the first 80 characters shown by normal ps and so you need to use ucb ps to show more than 80 chars.

    Mike

     

     

  • Hi Mike,

     

    Firstly - the odd thing is that I haven't been able to reproduce it after chaning the OfflineMonitorInterval to non zero. i.e. it shows the process is dead but doesn't run a clean which it was doing before. I even offlined everything and checked nothing was running. Maybe something strange was in place before after agents were installed/configured.

     

    Anyway for future reference, current state below.

     

    Regarding "custom agents" - in short they were developed by someone else, and built for a DR/GCO setup and done for various applications using a standard approach, and each application type tests for more than just the process running, so that's the main reason.

    So we can probably park this for now.

    Thanks again.

    Mark

     

    hastatus -sum

     

    -- SYSTEM STATE

    -- System               State                Frozen

     

    A  host1                RUNNING              0

    A  host2                RUNNING              0

     

    -- GROUP STATE

    -- Group           System               Probed     AutoDisabled    State

     

    B  MyApplication  host1                Y          N               ONLINE

    B  MyApplication  host2                Y          N               OFFLINE

    B  cvm             host1                Y          N               ONLINE

    B  cvm             host2               Y          N               ONLINE

    B  vrts_vea_cfs_int_cfsmount1 host1                Y          N               ONLINE

    B  vrts_vea_cfs_int_cfsmount1 host2                Y          N               ONLINE

    B  vxfen           host1               Y          N               ONLINE

    B  vxfen           host2               Y          N               ONLINE

     

    ==============================================

    DEBUG [Thu Jul 10 11:47:34 2014] APP::monitor monitor enabled

    App APP is stopped

    ==============================================

     

    (where host, MyApplication and APP represent actual names)