04-14-2011 04:09 AM
Problem description -
Before reboot -
[root@node1 ~]# vxdctl -c mode
mode: enabled: cluster active - MASTER
master: node1
[root@node2 ~]# vxdctl -c mode
mode: enabled: cluster active - SLAVE
master: node2
In a 2 node cluster, if the CVM master (node1) is rebooted, the second node (node2) takes the role of master but the rebooted node (node1) is not able to join cluster and following message appears in error and CVMcluster resource goes to faulted state-
VCS ERROR V-16-10031-1005 (node1) CVMCluster:???:monitor:node - state: out of cluster
reason: Cannot find disk on slave node: retry to add a node failed
however, if I do a hastop -all on node2 (which is now master) and then hastart on node1 followed by hastart on node2, both the nodes joins cluster successfully.
please find below the cluster configuration info for your reference -
***********************************************
group cvm (
SystemList = { node1 = 0, node2 = 1 }
AutoFailOver = 0
Parallel = 1
AutoStartList = { node1, node2 }
)
CFSfsckd vxfsckd (
ActivationMode @node1 = { vgCFSwcmctprd = sw }
ActivationMode @node2 = { vgCFSwcmctprd = sw }
)
CVMCluster cvm_clus (
CVMClustName = wcmct572
CVMNodeId = { node1 = 1, node2 = 0 }
CVMTransport = gab
CVMTimeout = 200
)
CVMVxconfigd cvm_vxconfigd (
Critical = 0
CVMVxconfigdArgs = { syslog }
)
cvm_clus requires cvm_vxconfigd
vxfsckd requires cvm_clus
***********************************
also find below the o/p of lltstat and nidmap
[root@node1 ~]# lltstat -nvv|head
LLT node information:
Node State Link Status Address
0 node2 OPEN
eth1 UP 00:17:A4:77:18:5A
eth2 UP 00:17:A4:77:18:5C
eth0 UP 00:17:A4:77:18:58
* 1 node1 OPEN
eth1 UP 00:17:A4:77:1C:5A
eth2 UP 00:17:A4:77:1C:5C
eth0 UP 00:17:A4:77:1C:58
[root@node1 ~]# vxclustadm nidmap
Name CVM Nid CM Nid State
node1 0 1 Joined: Master
node2 1 0 Joined: Slave
can someone pls tell what is the reason for this, is there a problem in my CVM configuration ?
Is it because of nodeid mismatch ? if so what are the steps to resolve it ?
04-14-2011 04:13 AM
Hello,
As you can see the error message above, I believe there is some issue in shared disk count ..
Please remember, CVM is very much peculiar about shared disks ... number of shared disks on both the nodes should be equal else node will not join CVM cluster .. & error above indicates same..
check this:
# vxdisk list | grep -i shared | wc -l
Unless the count is equal on both the nodes, CVM cluster will not be formed...
Gaurav
04-14-2011 04:21 AM
Hello Gaurav,
The number of shared disks on both the nodes are same -
vxdisk list | grep -i shared | wc -l
16
If there had been a mismatch in no. of shared disks, then the cluster would have never been formed, which is not the case in my scenario. please see the above description -
"however, if I do a hastop -all on node2 (which is now master) and then hastart on node1 followed by hastart on node2, both the nodes joins cluster successfully."
04-14-2011 04:28 AM
Hi,
I saw your note, my point was to ask, when your node 1 is trying to join back the cluster (& it subsequently fails) , have you checked the shared disk count during that time ? is it still same ?
ok, now when you are doing hastop -all & then start "had" on node 1 first, then both node join successfully.. so what is different in this case, is, you are doing a local build of cluster configuration from node 1, in any case, is there a difference between main.cf file on node 1 & node 2 ?? Though if cluster config was saved properly then it is expected to be same, however still double check..
Next, do you use IoFencing ? do you see any more error messages pertaining to disk before this node join errors ?
any hard error / transport errors on any of the disks ?
G
04-14-2011 05:14 AM
Please find my answers below -
I saw your note, my point was to ask, when your node 1 is trying to join back the cluster (& it subsequently fails) , have you checked the shared disk count during that time ? is it still same ?
- Yes the count of disks is still same, when one node tries to join the cluster and fails
is there a difference between main.cf file on node 1 & node 2 ??
-No difference in main.cf on both nodes.
Next, do you use IoFencing ?
-No, fencing is in disabled mode # vxfenadm -d
I/O Fencing Cluster Information:
================================
Fencing Protocol Version: 201
Fencing Mode: Disabled
Cluster Members:
do you see any more error messages pertaining to disk before this node join errors ?
- No such errors
One more observation - If i restart the node1(the original master node) first and then node2, everything goes fine and cluster comes up successfully .
As I mentioned above is it to do something with the mismatch of nodeid ?
[root@node1 ~]# vxclustadm nidmap
Name CVM Nid CM Nid State
node1 0 1 Joined: Master
node2 1 0 Joined: Slave
04-14-2011 05:28 AM
JFYI -
#cat /etc/llthosts
0 node2
1 node1
whereas in main.cf the nodeid are reversed, can this be a culprit ? Is this kind of configuration proper ?
group cvm (
SystemList = { node1 = 0, node2 = 1 }
04-14-2011 05:40 AM
I don't think above has any role to play in this issue, unless the entries of /etc/llthosts are different on both the nodes ... if both the hosts have same contents then it is not an issue...
regarding CM ID & CVM NID, it could be different, & systemlist just provides priority , it shouln't make difference to CVM cluster join since main.cf is same on both the nodes...
Gaurav
04-14-2011 05:50 AM
what is the probable cause of this issue any help/ suggestion highly appreciated.
04-15-2011 12:39 AM
well, can't say right now on what is the issue because its a strange behaviour ..
can u attach the main.cf file & engine_A.log for the day this issue happened ?
G