Forum Discussion

Shivam_HCL's avatar
Shivam_HCL
Level 3
12 years ago

Service group does not fail over on another node on force power down.

VCS 6.0.1

Hi i have configured a two node cluster with local storage and running two service groups. They both running fine and i am able to switch over them to any node on the cluster but when i forcely power down a node where both service groups are active, just one service group fails over to another node and the one running apache resource gets faild and do not fail over.

below pasted the contents of main.cf file.

==========================================

 

 

cat /etc/VRTSvcs/conf/config/main.cf
include "OracleASMTypes.cf"
include "types.cf"
include "Db2udbTypes.cf"
include "OracleTypes.cf"
include "SybaseTypes.cf"
 
cluster mycluster (
        UserNames = { admin = IJKcJEjGKfKKiSKeJH, root = ejkEjiIhjKjeJh }
        ClusterAddress = "192.168.25.101"
        Administrators = { admin, root }
        )
 
system server3 (
        )
 
system server4 (
        )
 
group ClusterService (
        SystemList = { server3 = 0, server4 = 1 }
        AutoStartList = { server3, server4 }
        OnlineRetryLimit = 3
        OnlineRetryInterval = 120
        )
 
        IP webip (
                Device = eth0
                Address = "192.168.25.101"
                NetMask = "255.255.255.0"
                )
 
        NIC csgnic (
                Device = eth0
                )
 
        webip requires csgnic
 
 
        // resource dependency tree
        //
        //      group ClusterService
        //      {
        //      IP webip
        //          {
        //          NIC csgnic
        //          }
        //      }
 
 
group httpsg (
        SystemList = { server3 = 0, server4 = 1 }
        AutoStartList = { server3, server4 }
        OnlineRetryLimit = 3
        OnlineRetryInterval = 15
        )
 
        Apache apachenew (
                httpdDir = "/usr/sbin"
                ConfigFile = "/etc/httpd/conf/httpd.conf"
                )
 
        IP ipresource (
                Device = eth0
                Address = "192.168.25.102"
                NetMask = "255.255.255.0"
                )
 
        apachenew requires ipresource
 
 
        // resource dependency tree
        //
        //      group httpsg
        //      {
        //      Apache apachenew
        //          {
        //          IP ipresource
        //          }
        //      }
#
=====================
 

engine logs while the powerdown occurs says -

 

 

2013/03/12 16:33:02 VCS INFO V-16-1-10077 Received new cluster membership
2013/03/12 16:33:02 VCS NOTICE V-16-1-10112 System (server3) - Membership: 0x1, DDNA: 0x0
2013/03/12 16:33:02 VCS ERROR V-16-1-10079 System server4 (Node '1') is in Down State - Membership: 0x1
2013/03/12 16:33:02 VCS ERROR V-16-1-10322 System server4 (Node '1') changed state from RUNNING to FAULTED
2013/03/12 16:33:02 VCS NOTICE V-16-1-10449 Group httpsg autodisabled on node server4 until it is probed
2013/03/12 16:33:02 VCS NOTICE V-16-1-10449 Group VCShmg autodisabled on node server4 until it is probed
2013/03/12 16:33:02 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system server4
2013/03/12 16:33:02 VCS NOTICE V-16-1-10446 Group httpsg is offline on system server4
2013/03/12 16:33:02 VCS ERROR V-16-1-10205 Group ClusterService is faulted on system server4
2013/03/12 16:33:02 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system server4
2013/03/12 16:33:02 VCS INFO V-16-1-10493 Evaluating server3 as potential target node for group ClusterService
2013/03/12 16:33:02 VCS INFO V-16-1-10493 Evaluating server4 as potential target node for group ClusterService
2013/03/12 16:33:02 VCS INFO V-16-1-10494 System server4 not in RUNNING state
2013/03/12 16:33:02 VCS NOTICE V-16-1-10301 Initiating Online of Resource webip (Owner: Unspecified, Group: ClusterService) on System server3
2013/03/12 16:33:02 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, UP; Current status =eth1, DOWN.
2013/03/12 16:33:02 VCS INFO V-16-6-15015 (server3) hatrigger:/opt/VRTSvcs/bin/triggers/sysoffline is not a trigger scripts directory or can not be executed
2013/03/12 16:33:14 VCS INFO V-16-1-10298 Resource webip (Owner: Unspecified, Group: ClusterService) is online on server3 (VCS initiated)
2013/03/12 16:33:14 VCS NOTICE V-16-1-10447 Group ClusterService is online on system server3
 
as per the above logs, the default SG ClusterService has been failed over to another node but SG httpsg faild.
 
please suggest on it.
 
Thanks....
 
 
  • Just missed your email - so I can now see you have only defined one heartbeat and this means VCS cannot detect between eth1 failure and system failure and therefore it will not failover any service groups (apart from ClusterService), so you need to have at least 2 heartbeats (which need to be independent in a live cluster - i.e not a dual-port card, but this is ok for testing)

    Mike

  • My guess is that you have only defined one heartbeat - eth1 in /etc/llttab and that is why group does not failover and if this is the case, if you add eth0 as a lowpri heartbeat, then this will resolve your issue - can you provide contents of /etc/llttab.

    Mike

  • Just missed your email - so I can now see you have only defined one heartbeat and this means VCS cannot detect between eth1 failure and system failure and therefore it will not failover any service groups (apart from ClusterService), so you need to have at least 2 heartbeats (which need to be independent in a live cluster - i.e not a dual-port card, but this is ok for testing)

    Mike

  • Sure mike,

     

     

    [root@server3 log]# cat /etc/llttab
    set-node server3
    set-cluster 36349
    link eth1 eth-00:0c:29:76:c4:c0 - ether - -
    [root@server3 log]#
     
    [root@server4 ~]# cat /etc/llttab
    set-node server4
    set-cluster 36349
    link eth1 eth-00:0c:29:02:73:38 - ether - -
    [root@server4 ~]#
     
    Sorry for this question which may sound strange but how does having one heartbeat affect the failover in case i directly pull out the power cable of a running server which would cause complete power down of the server from cluster. even i configure two heartbeat, pulling the cable out would cause failure on both links.
     
    Shivam

     

  • Hi Shivam,

    VCS clustering is designed to have 2 or more heartbeats.  If you are down to a single heartbeat then the cluster goes into jeopardy membership and assumes that the cluster is not faulted but that there is a another issue going on.  As a result of an unknown failure happening in the environment, VCS decides to do nothing.

    However, if 2 or more heartbeats are lost at the same time, VCS decides that the node is dead and marks service groups on that node as "Faulted".  The surviving nodes then attempt to online the faulted service groups.

    Basically, if there is only 1 heartbeat then VCS will not react when it is lost.  However, if there are 2 or more heartbeats that are lost at the same time, then VCS will react because it assumes that the node is dead.

    In your case, you need to configure a second heartbeat so that VCS comes out of Jeopardy membership.  Then it will react when the power is pulled from the active node.

    Thank you,

    Wally

  • If your 2 heartbeats are independent then they should never fail at the same time and therefore if a node sees 2 heartbeat links go down at the same time, then it assumes node has gone down as if your heartbeats are truely independent then it is very unlikley they would fail at the same time.  With only one heartbeat when a node sees it is gone, it has noway of telling if the link went down (like NIC or switch failure) or if the node went down.  This is why when you only have one link, the cluster goes in Jeopardy.

    Mike

  • Thanks Mike & Wally.

     

    Creating another heartbeat link with below method solved the issue.

     

    added this line in /etc/llttab file. (Taken help from 

    link-lowpri eth0 eth-00:50:56:91:03:30 - ether - -

    with MAC address for eth0 on each system

     

    Thanks a Lot to all for the quick help.....

    Shivam