Forum Discussion

mokkan's avatar
mokkan
Level 6
11 years ago
Solved

Application Agent PID file quesiton

I have  created a unix scirpt to do some task and enf of the task it will create a local file in /var/lock/abcd.lock file.

I created a application resource and put  /var/lock/abcd.lock for PidFiles attribute. The lock file is creating, but not resouce is not making as fault.

I am not sure why it is making as fault. If the lock file is there, applicaiton agent shouln't bring the resouce up?

 

  • You have not said if the lock file is being created on shared storage that moves with the Service Group.

    If the lock file is not on shared storage, if NodeA panics/crashes whilst the SG was up on it, then...:

    1.  The SG fails-over to NodeB.  (OK so far...)

    2.  Some time later NodeA comes back online, and VCS runs the monitor entry point for all resources, in order to determine whether these are online or offline on NodeA. 

    3.  Because the lock file still exists locally on NodeA, your Application resource is considered online on NodeA.

    4.  But it is also online on NodeB, and so then VCS complains about a Concurrency Violation (where a resource is in a non-Parellel service group, and is online on more than one node). 

    This will require administrative action to resolve -- you would need to manually remove the lock file on NodeA. 

    Usually, "one time" operations are done via the PreOnline and/or PostOnline triggers.  This also avoids the overhead of the node running the monitor routine every MonitorInterval seconds. 

    Otherwise, if this task needs to be done before another resource is brought online, then often it could/should be implemented in that resource's "start up procedures". 

    So the question is: 

    Does the task need to be performed after some resource comes online (such as the Disk Group resource) AND prior to another resource coming online? If so, can you add this task to be performed by modifying that resource's online procedures?

    At the end of the day, you need to test the various failure cases (server, applicaiton, network, etc) to see how your implementation behaves...

  • What version of VCS?

    Have you read the "Symantec Cluster Server Bundled Agents Reference Guide" for your release and OS? 

    From memory, I am nearly certain that the PidFiles attribute of the Application Type is meant to contain a list (one per line???) of PIDs that the agent would then use to see if that PID resided in the OS's process table.

    Simply creating a "lock file", and handing that to the Application Agent via the PidFiles attribute, will garantee that the resource will never online.

    But I could be wrong! -- read the manual...  Here's the SFHA 6.1 one for AIX:

    http://www.symantec.com/business/support/resources/sites/BUSINESS/content/live/DOCUMENTATION/6000/DOC6956/en_US/vcs_bundled_agents_61_aix.pdf

     

     

  • IF  I understand correctly,  If I get the pid number and if I write in the lock file would it bring the resource up?

  • Here is the script for application resource.   The problem is pid id dies faster because this script will ccomplete in 1 or 2 minutes. Applicaiton agent comes and check every  5 minutes.  How can i make it online. Am I missing some thing here?

     #!/bin/bash -x
    pid="$$"
    script_dir="/project1"
    log_dir="${script_dir}/logs"
    echo $log_dir
    status="not_completed"
    fc_state="copying"
    while [ "$status" != "completed" ] ; do
    sleep 60
    if grep -q "$fc_state" $log_dir/log1 ; then
        echo "$fc_state exist on log"
    else
        echo "$fc_state does not exist log"
        status="completed"
        echo $pid > /var/locks/acproject.lock
    fi
    echo "my PID is $pid"
    done

     

  • Mokkan --

    Yes, you are really missing something... Let's start with:

    VCS resource's are meant to contain, or manage, a service or process that runs for the entire time that the service group it is contained within is up.

    IE:  A VCS resrouce is an integral component-service/process of the over-all service group.

    From a cursory look at you script, it seems what you are trying to do it is not something that runs the entire time the service group is up, and therefore not an appropriate candidate for an Application resource (nor any other kind of resource).  It looks more like you are trying to do a one-time thing, or implement a sequencing mechanism. 

    From the script: It appears that you are waiting for another process to remove the "copying" string from log1 file, at which point you create a file (echo $pid > /var/locks/acproject.lock) and then exit the script.

    Presumably some other mechanism then reacts to the existanct of this file (which looks more like a "flag" file, than a "lock" file), in order that ????what????

    There seems little reason to insert the script's PID into the flag file, as the script then terminates and therefore that particular PID number is not relevant to anything any longer...

     

    So we really need to start at ground zero here:

    What is the problem you are trying to fix?  What is the objective here? 

     

  • I  changed my mind and wrote a small script to fix the problem rather than using the PID file.  What I did was wrote a monitro script to exit 0 , if the lock file exist and exit 1 if  there is no lock file exist. That will help my issue.

     

     

  • Well, you still haven't described what you are trying to do, so I hope you are right.  I am assuming you are doing as described all within an Application Agent still? 

    If so, under what criteria to you remove the lock file?  Have you written that into your Offline script that the Applicaiton agent runs?  Don't forget the clean entry point too...

    You may find other unintended consequences from such an arrangement, in the even of a server failure, and other cases; largely depending upon whether the lock file exists on a file system that moves with the Service Group or not... 

    IE:  Let us say that the lock file exists on a file system that moves with the Service Group and in it online on nodeA. 

    1. Let us say nodeA crashes (this leaves the lock file in tact on the shared file system).

    2. Your Service Group is brought up on the nodeB.

    3.  VCS runs the monitor entry point and notices, via your script, that the lock file exists, and "believe the resource is online already, and therefore will not run the "wait for some external process to insert "completed" into the log file" -- no, it will be directly consider to already be online by VCS. 

    Is that OK for your requirements?

     

  • Here you go, I am rnning a task, if it runs successfully, I will create a lock file.

     

    This is what I am doing now, looks like working and I need fo do some clean up

    If [ $ = "start"]; then

    It will do the task and create a lock file.

    elif [ $1 = "stop" ]; then
       echo "stopping ........."
       `/bin/rm -f $lock_file`
    elif [ $1 = "monitor" ]; then
       echo "monitoring......."
       if [ -f $lock_file ];then
       exit 0
       else
       exit 1
       fi


    Application STOP/START will do manually

    Clean/Monitor will be done by resource timelimit

    Also, in clean I did  /bin/rm -f  $lockfile

    Does it make sense for you?

  • You have not said if the lock file is being created on shared storage that moves with the Service Group.

    If the lock file is not on shared storage, if NodeA panics/crashes whilst the SG was up on it, then...:

    1.  The SG fails-over to NodeB.  (OK so far...)

    2.  Some time later NodeA comes back online, and VCS runs the monitor entry point for all resources, in order to determine whether these are online or offline on NodeA. 

    3.  Because the lock file still exists locally on NodeA, your Application resource is considered online on NodeA.

    4.  But it is also online on NodeB, and so then VCS complains about a Concurrency Violation (where a resource is in a non-Parellel service group, and is online on more than one node). 

    This will require administrative action to resolve -- you would need to manually remove the lock file on NodeA. 

    Usually, "one time" operations are done via the PreOnline and/or PostOnline triggers.  This also avoids the overhead of the node running the monitor routine every MonitorInterval seconds. 

    Otherwise, if this task needs to be done before another resource is brought online, then often it could/should be implemented in that resource's "start up procedures". 

    So the question is: 

    Does the task need to be performed after some resource comes online (such as the Disk Group resource) AND prior to another resource coming online? If so, can you add this task to be performed by modifying that resource's online procedures?

    At the end of the day, you need to test the various failure cases (server, applicaiton, network, etc) to see how your implementation behaves...