cancel
Showing results for 
Search instead for 
Did you mean: 

Inactive node is reporting that my resources are failed.

GeorgeC
Level 4

System Details
SUN T5140 running Solaris 10, s10s_u7wos_08 SPARC
I'm running Veritas cluster file system HA, V5.1

I'm having a problem with my two node failover cluster.
I have a service group and it's resources running on nod1, however node2 (where they are not running) is reporting that the resouces have failed. I've a bit confused as to why this is happing. It seems that my monitor program is running on node2 when it shouldn;t be. I have few other questions as well.
Hopefully, this is a configuration/settings issue.

Here is a snippet from the /var/adm/messages file on node two. This output is being generated by my monitoring program, /usr/local/bin/slstatus, which is being called by the cluster. The same program is being ran on node 1, where the service group is running, and is working normally. If I fail the resource group over to node 2, then node 1 starts reporting that the resources are failed.
oot@net-log-02.ns.pitt.edu # tail -f /var/adm/messages
May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22850]: [ID 702911 user.crit] Syslog process for eh-core-2 has failed
May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22854]: [ID 702911 user.crit] Syslog process for fq-core-2 has failed
May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22876]: [ID 702911 user.crit] Syslog process for fr-core-1 has failed
May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22877]: [ID 702911 user.crit] Syslog process for bw-core-1 has failed
May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22878]: [ID 702911 user.crit] Syslog process for gbg-core-1 has failed
May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22879]: [ID 702911 user.crit] Syslog process for cl-core-1 has failed
May 27 16:51:29 net-log-02.ns.pitt.edu SYSLOG-NG[22880]: [ID 702911 user.crit] Syslog process for jhn-core-2 has failed
May 27 16:51:49 net-log-02.ns.pitt.edu xntpd[23386]: [ID 854739 daemon.info] synchronized to 136.142.5.75, stratum=2
May 27 16:51:47 net-log-02.ns.pitt.edu xntpd[23386]: [ID 774427 daemon.notice] time reset (step) -1.263488 s
May 27 16:51:47 net-log-02.ns.pitt.edu xntpd[23386]: [ID 204180 daemon.info] synchronisation lost
May 27 16:53:34 net-log-02.ns.pitt.edu SYSLOG-NG[23043]: [ID 702911 user.crit] Syslog process for rd-dev-core-514 has failed
May 27 16:55:52 net-log-02.ns.pitt.edu SYSLOG-NG[23221]: [ID 702911 user.crit] Syslog process for cl-core-2 has failed
May 27 16:56:25 net-log-02.ns.pitt.edu SYSLOG-NG[23284]: [ID 702911 user.crit] Syslog process for rd-wan3 has failed
May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23330]: [ID 702911 user.crit] Syslog process for all-ios has failed
May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23331]: [ID 702911 user.crit] Syslog process for rd-dev-core-1 has failed
May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23332]: [ID 702911 user.crit] Syslog process for all-asa has failed
May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23333]: [ID 702911 user.crit] Syslog process for ps-core-1 has failed
May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23334]: [ID 702911 user.crit] Syslog process for rd-core-1 has failed
May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23335]: [ID 702911 user.crit] Syslog process for sc-core-1 has failed
May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23336]: [ID 702911 user.crit] Syslog process for bs795-core-1 has failed
May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23337]: [ID 702911 user.crit] Syslog process for mc-core-1 has failed
May 27 16:56:26 net-log-02.ns.pitt.edu SYSLOG-NG[23338]: [ID 702911 user.crit] Syslog process for sc-core-2 has failed
May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23345]: [ID 702911 user.crit] Syslog process for jhn-core-2 has failed
May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23380]: [ID 702911 user.crit] Syslog process for fr-core-1 has failed
May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23381]: [ID 702911 user.crit] Syslog process for gbg-core-1 has failed
May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23382]: [ID 702911 user.crit] Syslog process for cl-core-1 has failed
May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23383]: [ID 702911 user.crit] Syslog process for fq-core-2 has failed
May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23385]: [ID 702911 user.crit] Syslog process for eh-core-2 has failed
May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23384]: [ID 702911 user.crit] Syslog process for bw-core-1 has failed
May 27 16:56:27 net-log-02.ns.pitt.edu SYSLOG-NG[23387]: [ID 702911 user.crit] Syslog process for brd-core-1 has failed
May 27 16:57:10 net-log-02.ns.pitt.edu xntpd[23386]: [ID 854739 daemon.info] synchronized to 136.142.5.76, stratum=2
May 27 16:57:09 net-log-02.ns.pitt.edu xntpd[23386]: [ID 774427 daemon.notice] time reset (step) -1.053500 s
May 27 16:57:09 net-log-02.ns.pitt.edu xntpd[23386]: [ID 204180 daemon.info] synchronisation lost

 
1 ACCEPTED SOLUTION

Accepted Solutions

GeorgeC
Level 4

It was explained to me that monitoring under VCS takes place on all nodes of the cluster. This is to check and guard against concurrency violations amoung other things. My script, which does it's own logging via the syslog facility was reporting that my resouces were offline on the inactive node (whihc is corret by the way, since it is a failover resouce that was running on the other node). The simple solution would be to disable logging from within my script and let VCS handle the alerts to /var/adm/messages

Thank you one and all for you help and replies.
George

View solution in original post

15 REPLIES 15

vcs_man
Level 4
Employee Accredited Certified
Hi George,

Could you please send us your /etc/VRTSvcs/conf/config/main.cf file along with snippet of /var/VRTSvcs/log/engine_A.log file?
Also, would like to know more details about your monitor script.

Thanks,
Mandar

Marianne
Level 6
Partner    VIP    Accredited Certified
I agree - we need to see your cluster config and cluster log.
The messages seem to be coming from SYSLOG-NG, not VCS.

GeorgeC
Level 4

The error messages are comming from my monitor script, /usr/local/bin/slstatus. I've included the script below. This script is ran by VCS to test whether a syslog-ng process is running or not. The question is, why is it running on the inactive node at all?

/usr/local/bin/slstatus
#!/bin/sh
CONFIGDIR=/fwsm-logs/conf
LD_LIBRARY_PATH=/usr/sfw/lib:/usr/local/lib
DAEMON=/usr/local/sbin/syslog-ng
LOGGER=/usr/bin/logger
export LD_LIBRARY_PATH

# Function to log error messages to syslog
#
# Log <program> <severity> <text>
#
Log()
{
  $LOGGER -t SYSLOG-NG -i -p user.$1 "$2"
}
FWname=$1
INST=$1
CONFFILE=$FWname.conf
PIDFILE=/var/run/syslog-ng.$FWname.pid
/usr/ucb/ps -auwwx | grep syslog-ng | grep $INST > /dev/null 2>&1
RET=$?
if [ $RET -ne 0 ]; then
        if [ -f $PIDFILE ]; then
                rm $PIDFILE
        fi
        Log crit "Syslog process for $INST has failed"
        exit 100
fi
if [ ! -f  $PIDFILE ]; then
        Log crit "Pid file for Syslog process $INST is missing"
        exit 100
fi
        exit 110
 

Marianne
Level 6
Partner    VIP    Accredited Certified

I still don't see this as VCS problem.
Please post your main.cf as well as Engine_A log.

GeorgeC
Level 4
Marianne,
     Hmm. I not sure if I am explaining this correctly or not. So I will try again.
I have a a failover service group setup that contains a vip resouce,  cluster mount point and about 20 application resouces that run syslog-ng's.

If the service group is running on node 1 and my monitor program, on node 1, is saying all my process are running fine, why is node 2 trying to monitor processes that it souldn't be monitoring. My monitor script is only called by VCS, it is not ran manually, or by cron. VCS is the only application that runs this script. So why when my service group is running on node1, is node 2 monitoring process that it should not be?

GeorgeC
Level 4
Mandar
    I have cleared all of my logs in var/VRTSvcs/log/ and rebooted both of my cluster nodes. The logs that I have have been there since April and contain a lot of data that includes me testing my resouces and scripts. I'm hoping that by letting the cluster run for a bit  the logs will contain data that is useful to my problem.

GeorgeC
Level 4

Here is my main.cf.  Basically, what the cluster is doing is running a number of syslog-ng processes that listen on different network ports (usually one syslog-ng per network device) all of which write to the same file, fwsm.log. I took the simple approach and set up 1 service group with many syslog-ng application resources.

 

GeorgeC
Level 4
Mandar,
     Here is the var/VRTSvcs/log/engine_A.log from node 1, where my service group should be, and is running. This is from after I rebooted both of my cluster nodes.
This is long..... Sorry.

GeorgeC
Level 4
Here is the /var/VRTSvcs/log/engine_A.log from node 2

Leigh_Brown
Level 3

Hi George,

VCS monitors resources on both nodes, at all times. This is normal behaviour.

So, I think you would remove the alerting from your monitor script, and let VCS do the monitoring for you (that's it's job, after all).  You can use the VCS notification facilities to alert you if VCS detects a failure.

Regards,

Leigh.

cshoesmith
Level 3
Employee
I agree with Leigh.

Take a look at the Bundled Agent:  "FileOnOnly" monitor as an example of how this is normally acheived. You are getting caught out by actively reporting during 'offline monitoring'. For VCS to monitor for concurrency violations, it needs to monitor the resource on all nodes in the resource's systemlist, irrespective of where it is online.

Example: /opt/VRTSvcs/bin/FileOnOnly/monitor

# start of monitor
#!/bin/sh
# REMOVED HEADER COMMENTS

RESNAME=$1
shift;

. "../ag_i18n_inc.sh";

VCSAG_GET_ATTR_VALUE "PathName" -1 1 "$@" ; PathName=${VCSAG_ATTR_VALUE};
if [ $? != $VCSAG_SUCCESS ] ; then exit $VCSAG_RES_UNKNOWN  ; fi;

if [ -z "$PathName" ]
then
   exit 100
else
   if [ -f $PathName ]; then exit 110;
   else exit 100;
   fi
fi

# end of monitor


This example should help you get over your issue.


Availability Products Unix Backline Support.
Sydney Australia.

avsrini
Level 4
Employee Accredited Certified
Hi George,

For your question of why VCS is monitoring resources on all the nodes (configured in servcie group).
VCS checks the status of the resources on all the configured nodes in the cluster to detect "Concurrency Violation".

i.e., if a resource is part of failover Service group, its suppose to be online on only one node in the cluster.

If someone manually brings the resource online on other node, without knowing its already running on
the cluster, it will cause data corruption. Thus VCS is designed to check the status of resources on all the configured
nodes in the cluster. If VCS detects that resource is bought online manually, it will call clean to offline the resource
on the new node to prevent data corruption.

This concurrency violation doesn't applies to Parallel Service groups, because all the resources are suppose to be
online on all the configured nodes.

Hope this clarifies your doubt.

Regards
Srini



lennart_norrby1
Not applicable
Partner

Hi George,
The double logging is a miss (bug) in the 5.1 version. There is an existing fix for this, but I don't think it is public yet. Contact support and open a case and they will provide you with the fix.

Regards

Lennart Norrby

GeorgeC
Level 4

Leigh,
    My cluster background is with Solaris Cluster. Under SC, with a failover service group, the script would run on only the actvie node. As you and several others pointed out, under VCS it runs on all nodes in the cluster. This is actually a better method, since with Solaris Cluster, it is entirely possilbe to start up a resouce manually on the inactive node and hammer a file system.

Thank you for the explanation.

GeorgeC
Level 4

It was explained to me that monitoring under VCS takes place on all nodes of the cluster. This is to check and guard against concurrency violations amoung other things. My script, which does it's own logging via the syslog facility was reporting that my resouces were offline on the inactive node (whihc is corret by the way, since it is a failover resouce that was running on the other node). The simple solution would be to disable logging from within my script and let VCS handle the alerts to /var/adm/messages

Thank you one and all for you help and replies.
George