cancel
Showing results for 
Search instead for 
Did you mean: 

CVM master node does not join cluster following a reboot

timus
Level 4

Problem description -

Before reboot -

[root@node1 ~]# vxdctl -c mode
mode: enabled: cluster active - MASTER
master: node1

[root@node2 ~]# vxdctl -c mode
mode: enabled: cluster active - SLAVE
master: node2

In a 2 node cluster, if the CVM master (node1) is rebooted, the second node (node2) takes the role of master but the rebooted node (node1) is not able to join cluster and  following message appears in error and CVMcluster resource goes to faulted state- 

 VCS ERROR V-16-10031-1005 (node1) CVMCluster:???:monitor:node - state: out of cluster
reason: Cannot find disk on slave node: retry to add a node failed

 

however, if I do a hastop -all on node2 (which is now master) and then hastart on node1 followed by hastart on node2, both the nodes joins cluster successfully.

 please find below the cluster configuration info for your reference -

***********************************************

group cvm (
        SystemList = { node1 = 0, node2 = 1 }
        AutoFailOver = 0
        Parallel = 1
        AutoStartList = { node1, node2 }
        )

        CFSfsckd vxfsckd (
                ActivationMode @node1 = { vgCFSwcmctprd = sw }
                ActivationMode @node2 = { vgCFSwcmctprd = sw }
                )

        CVMCluster cvm_clus (
                CVMClustName = wcmct572
                CVMNodeId = { node1 = 1, node2 = 0 }
                CVMTransport = gab
                CVMTimeout = 200
                )

        CVMVxconfigd cvm_vxconfigd (
                Critical = 0
                CVMVxconfigdArgs = { syslog }
                )

        cvm_clus requires cvm_vxconfigd
        vxfsckd requires cvm_clus

***********************************

also find below the o/p of lltstat and nidmap

[root@node1 ~]# lltstat -nvv|head
LLT node information:
    Node                 State    Link  Status  Address
     0 node2              OPEN
                                  eth1   UP      00:17:A4:77:18:5A
                                  eth2   UP      00:17:A4:77:18:5C
                                  eth0   UP      00:17:A4:77:18:58
   * 1 node1           OPEN
                                  eth1   UP      00:17:A4:77:1C:5A
                                  eth2   UP      00:17:A4:77:1C:5C
                                  eth0   UP      00:17:A4:77:1C:58

 

[root@node1 ~]# vxclustadm nidmap
Name                             CVM Nid    CM Nid     State
node1                            0          1          Joined: Master
node2                            1          0          Joined: Slave

 

can someone pls tell what is the reason for this, is there a problem in my CVM configuration ?
Is it because of nodeid mismatch ? if so what are the steps to resolve it ?

8 REPLIES 8

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hello,

As you can see the error message above, I believe there is some issue in shared disk count ..

Please remember, CVM is very much peculiar about shared disks ... number of shared disks on both the nodes should be equal else node will not join CVM cluster .. & error above indicates same..

check this:

# vxdisk list | grep -i shared | wc -l

 

Unless the count is equal on both the nodes, CVM cluster will not be formed...

 

Gaurav

timus
Level 4

Hello Gaurav,

 

The number of shared disks on both the nodes are same -

vxdisk list | grep -i shared | wc -l
16

If there had been a mismatch in no. of shared disks, then the cluster would have never been formed, which is not the case in my scenario. please see the above description -

"however, if I do a hastop -all on node2 (which is now master) and then hastart on node1 followed by hastart on node2, both the nodes joins cluster successfully."

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hi,

I saw your note, my point was to ask, when your node 1 is trying to join back the cluster (& it subsequently fails) , have you checked the shared disk count during that time ?  is it still same ?

ok, now when you are doing hastop -all & then start "had" on node 1 first, then both node join successfully.. so what is different in this case, is, you are doing a local build of cluster configuration from node 1, in any case, is there a difference between main.cf file on node 1 & node 2 ??  Though if cluster config was saved properly then it is expected to be same, however still double check..

Next, do you use IoFencing ? do you see any more error messages pertaining to disk before this node join errors ?

any hard error / transport errors on any of the disks ?

 

G

timus
Level 4

Please find my answers below -

I saw your note, my point was to ask, when your node 1 is trying to join back the cluster (& it subsequently fails) , have you checked the shared disk count during that time ?  is it still same ?

- Yes the count  of disks is still same, when one node tries to join the cluster and fails

 

is there a difference between main.cf file on node 1 & node 2 ?? 

-No difference in main.cf on both nodes.

Next, do you use IoFencing ?

-No, fencing is in  disabled mode # vxfenadm -d

I/O Fencing Cluster Information:
================================

 Fencing Protocol Version: 201
 Fencing Mode: Disabled
 Cluster Members:

do you see any more error messages pertaining to disk before this node join errors ?

- No such errors

One more observation - If i restart the node1(the original master node) first and then node2, everything goes fine and cluster comes up successfully .

As  I mentioned above is it to do something with the mismatch of nodeid ?

[root@node1 ~]# vxclustadm nidmap
Name                             CVM Nid    CM Nid     State
node1                            0          1          Joined: Master
node2                            1          0          Joined: Slave

timus
Level 4

JFYI -

#cat /etc/llthosts
0 node2
1 node1

whereas in main.cf the nodeid are reversed, can this be a culprit ? Is this kind of configuration proper ?

group cvm (
        SystemList = { node1 = 0, node2 = 1 }

Gaurav_S
Moderator
Moderator
   VIP    Certified

I don't think above has any role to play in this issue,  unless the entries of /etc/llthosts are different on both the nodes ... if both the hosts have same contents then it is not an issue...

regarding CM ID & CVM NID, it could be different, & systemlist just provides priority , it shouln't make difference to CVM cluster join since main.cf is same on both the nodes...

 

Gaurav

timus
Level 4

what is the probable cause of this issue any help/ suggestion highly appreciated.

Gaurav_S
Moderator
Moderator
   VIP    Certified

well, can't say right now on what is the issue because its a strange behaviour ..

 

can u attach the main.cf file & engine_A.log for the day this issue happened ?

 

G