Solved: VCS monitoring

GeorgeC · ‎05-27-2010

System Details SUN T5140 running Solaris 10, s10s_u7wos_08 SPARC I'm running Veritas cluster file system HA, V5.1 I'm having a problem with my two node failover cluster. I have a service group and it's resources running on nod1, however node2 (where they are not running) is reporting that the resouces have failed. I've a bit confused as to why this is happing. It seems that my monitor program is running on node2 when it shouldn;t be. I have few other questions as well. Hopefully, this is a configuration/settings issue. Here is a snippet from the /var/adm/messages file on node two. This output is being generated by my monitoring program, /usr/local/bin/slstatus, which is being called by the cluster. The same program is being ran on node 1, where the service group is running, and is working normally. If I fail the resource group over to node 2, then node 1 starts reporting that the resources are failed. oot@net-log-02.ns.pitt.edu # tail -f /var/adm/messages May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22850]: [ID 702911 user.crit] Syslog process for eh-core-2 has failed May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22854]: [ID 702911 user.crit] Syslog process for fq-core-2 has failed May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22876]: [ID 702911 user.crit] Syslog process for fr-core-1 has failed May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22877]: [ID 702911 user.crit] Syslog process for bw-core-1 has failed May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22878]: [ID 702911 user.crit] Syslog process for gbg-core-1 has failed May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22879]: [ID 702911 user.crit] Syslog process for cl-core-1 has failed May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22880]: [ID 702911 user.crit] Syslog process for jhn-core-2 has failed May 27 16:51:49 net-log-02.ns.pitt.edu xntpd[23386]: [ID 854739 daemon.info] synchronized to 136.142.5.75, stratum=2 May 27 16:51:47 net-log-02.ns.pitt.edu xntpd[23386]: [ID 774427 daemon.notice] time reset (step) -1.263488 s May 27 16:51:47 net-log-02.ns.pitt.edu xntpd[23386]: [ID 204180 daemon.info] synchronisation lost May 27 16:53:34 net-log-02.ns.pitt.edu SYSLOG-NG[23043]: [ID 702911 user.crit] Syslog process for rd-dev-core-514 has failed May 27 16:55:52 net-log-02.ns.pitt.edu SYSLOG-NG[23221]: [ID 702911 user.crit] Syslog process for cl-core-2 has failed May 27 16:56:25 net-log-02.ns.pitt.edu SYSLOG-NG[23284]: [ID 702911 user.crit] Syslog process for rd-wan3 has failed May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23330]: [ID 702911 user.crit] Syslog process for all-ios has failed May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23331]: [ID 702911 user.crit] Syslog process for rd-dev-core-1 has failed May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23332]: [ID 702911 user.crit] Syslog process for all-asa has failed May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23333]: [ID 702911 user.crit] Syslog process for ps-core-1 has failed May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23334]: [ID 702911 user.crit] Syslog process for rd-core-1 has failed May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23335]: [ID 702911 user.crit] Syslog process for sc-core-1 has failed May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23336]: [ID 702911 user.crit] Syslog process for bs795-core-1 has failed May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23337]: [ID 702911 user.crit] Syslog process for mc-core-1 has failed May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23338]: [ID 702911 user.crit] Syslog process for sc-core-2 has failed May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23345]: [ID 702911 user.crit] Syslog process for jhn-core-2 has failed May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23380]: [ID 702911 user.crit] Syslog process for fr-core-1 has failed May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23381]: [ID 702911 user.crit] Syslog process for gbg-core-1 has failed May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23382]: [ID 702911 user.crit] Syslog process for cl-core-1 has failed May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23383]: [ID 702911 user.crit] Syslog process for fq-core-2 has failed May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23385]: [ID 702911 user.crit] Syslog process for eh-core-2 has failed May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23384]: [ID 702911 user.crit] Syslog process for bw-core-1 has failed May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23387]: [ID 702911 user.crit] Syslog process for brd-core-1 has failed May 27 16:57:10 net-log-02.ns.pitt.edu xntpd[23386]: [ID 854739 daemon.info] synchronized to 136.142.5.76, stratum=2 May 27 16:57:09 net-log-02.ns.pitt.edu xntpd[23386]: [ID 774427 daemon.notice] time reset (step) -1.053500 s May 27 16:57:09 net-log-02.ns.pitt.edu xntpd[23386]: [ID 204180 daemon.info] synchronisation lost

GeorgeC · ‎06-01-2010

It was explained to me that monitoring under VCS takes place on all nodes of the cluster. This is to check and guard against concurrency violations amoung other things. My script, which does it's own logging via the syslog facility was reporting that my resouces were offline on the inactive node (whihc is corret by the way, since it is a failover resouce that was running on the other node). The simple solution would be to disable logging from within my script and let VCS handle the alerts to /var/adm/messages

Thank you one and all for you help and replies.
George

View solution in original post

vcs_man · ‎05-27-2010

Hi George,

Could you please send us your /etc/VRTSvcs/conf/config/main.cf file along with snippet of /var/VRTSvcs/log/engine_A.log file?
Also, would like to know more details about your monitor script.

Thanks,
Mandar

Marianne · ‎05-28-2010

I agree - we need to see your cluster config and cluster log.
The messages seem to be coming from SYSLOG-NG, not VCS.

Handy NetBackup Links

GeorgeC · ‎05-28-2010

The error messages are comming from my monitor script, /usr/local/bin/slstatus. I've included the script below. This script is ran by VCS to test whether a syslog-ng process is running or not. The question is, why is it running on the inactive node at all?

/usr/local/bin/slstatus
#!/bin/sh
CONFIGDIR=/fwsm-logs/conf
LD_LIBRARY_PATH=/usr/sfw/lib:/usr/local/lib
DAEMON=/usr/local/sbin/syslog-ng
LOGGER=/usr/bin/logger
export LD_LIBRARY_PATH

# Function to log error messages to syslog
#
# Log <program> <severity> <text>
#
Log()
{
$LOGGER -t SYSLOG-NG -i -p user.$1 "$2"
}

FWname=$1
INST=$1
CONFFILE=$FWname.conf
PIDFILE=/var/run/syslog-ng.$FWname.pid

/usr/ucb/ps -auwwx | grep syslog-ng | grep $INST > /dev/null 2>&1
RET=$?
if [ $RET -ne 0 ]; then
        if [ -f $PIDFILE ]; then
                rm $PIDFILE
        fi
        Log crit "Syslog process for $INST has failed"
        exit 100
fi
if [ ! -f $PIDFILE ]; then
        Log crit "Pid file for Syslog process $INST is missing"
        exit 100
fi
        exit 110

Marianne · ‎05-28-2010

I still don't see this as VCS problem.
Please post your main.cf as well as Engine_A log.

Handy NetBackup Links

GeorgeC · ‎05-28-2010

Marianne,
Hmm. I not sure if I am explaining this correctly or not. So I will try again.
I have a a failover service group setup that contains a vip resouce, cluster mount point and about 20 application resouces that run syslog-ng's.

If the service group is running on node 1 and my monitor program, on node 1, is saying all my process are running fine, why is node 2 trying to monitor processes that it souldn't be monitoring. My monitor script is only called by VCS, it is not ran manually, or by cron. VCS is the only application that runs this script. So why when my service group is running on node1, is node 2 monitoring process that it should not be?

GeorgeC · ‎05-28-2010

Mandar
I have cleared all of my logs in var/VRTSvcs/log/ and rebooted both of my cluster nodes. The logs that I have have been there since April and contain a lot of data that includes me testing my resouces and scripts. I'm hoping that by letting the cluster run for a bit the logs will contain data that is useful to my problem.

GeorgeC · ‎05-28-2010

Here is my main.cf. Basically, what the cluster is doing is running a number of syslog-ng processes that listen on different network ports (usually one syslog-ng per network device) all of which write to the same file, fwsm.log. I took the simple approach and set up 1 service group with many syslog-ng application resources.

GeorgeC · ‎05-28-2010

Mandar,
Here is the var/VRTSvcs/log/engine_A.log from node 1, where my service group should be, and is running. This is from after I rebooted both of my cluster nodes.
This is long..... Sorry.

GeorgeC · ‎05-28-2010

Here is the /var/VRTSvcs/log/engine_A.log from node 2

Leigh_Brown · ‎05-30-2010

Hi George,

VCS monitors resources on both nodes, at all times. This is normal behaviour.

So, I think you would remove the alerting from your monitor script, and let VCS do the monitoring for you (that's it's job, after all). You can use the VCS notification facilities to alert you if VCS detects a failure.

Regards,

Leigh.

cshoesmith · ‎05-30-2010

I agree with Leigh.

Take a look at the Bundled Agent: "FileOnOnly" monitor as an example of how this is normally acheived. You are getting caught out by actively reporting during 'offline monitoring'. For VCS to monitor for concurrency violations, it needs to monitor the resource on all nodes in the resource's systemlist, irrespective of where it is online.

Example: /opt/VRTSvcs/bin/FileOnOnly/monitor

# start of monitor
#!/bin/sh
# REMOVED HEADER COMMENTS

RESNAME=$1
shift;

. "../ag_i18n_inc.sh";

VCSAG_GET_ATTR_VALUE "PathName" -1 1 "$@" ; PathName=${VCSAG_ATTR_VALUE};
if [ $? != $VCSAG_SUCCESS ] ; then exit $VCSAG_RES_UNKNOWN ; fi;

if [ -z "$PathName" ]
then
   exit 100
else
   if [ -f $PathName ]; then exit 110;
   else exit 100;
   fi
fi

# end of monitor

This example should help you get over your issue.

Availability Products Unix Backline Support.
Sydney Australia.

avsrini · ‎05-30-2010

Hi George,

For your question of why VCS is monitoring resources on all the nodes (configured in servcie group).
VCS checks the status of the resources on all the configured nodes in the cluster to detect "Concurrency Violation".

i.e., if a resource is part of failover Service group, its suppose to be online on only one node in the cluster.

If someone manually brings the resource online on other node, without knowing its already running on
the cluster, it will cause data corruption. Thus VCS is designed to check the status of resources on all the configured
nodes in the cluster. If VCS detects that resource is bought online manually, it will call clean to offline the resource
on the new node to prevent data corruption.

This concurrency violation doesn't applies to Parallel Service groups, because all the resources are suppose to be
online on all the configured nodes.

Hope this clarifies your doubt.

Regards
Srini

lennart_norrby1 · ‎05-31-2010

Hi George,
The double logging is a miss (bug) in the 5.1 version. There is an existing fix for this, but I don't think it is public yet. Contact support and open a case and they will provide you with the fix.

Regards

Lennart Norrby

GeorgeC · ‎06-01-2010

Leigh,
My cluster background is with Solaris Cluster. Under SC, with a failover service group, the script would run on only the actvie node. As you and several others pointed out, under VCS it runs on all nodes in the cluster. This is actually a better method, since with Solaris Cluster, it is entirely possilbe to start up a resouce manually on the inactive node and hammer a file system.

Thank you for the explanation.

GeorgeC · ‎06-01-2010

It was explained to me that monitoring under VCS takes place on all nodes of the cluster. This is to check and guard against concurrency violations amoung other things. My script, which does it's own logging via the syslog facility was reporting that my resouces were offline on the inactive node (whihc is corret by the way, since it is a failover resouce that was running on the other node). The simple solution would be to disable logging from within my script and let VCS handle the alerts to /var/adm/messages

Thank you one and all for you help and replies.
George

VOX

Inactive node is reporting that my resources are failed.