07-09-2013 05:28 AM
Hi
Faulted on node2 This message comes when i turn off the node1 to see the failover work or not my services group and the resource not failover to node 2
(Service Groups state is online and the system list for the Service Groups on both nodes)
Please help .
Regards
Solved! Go to Solution.
07-19-2013 01:18 AM
I have looked at logs and main.cf more closely now and I can see your issue. You have nested mounts:
07-09-2013 05:36 AM
Please provide hastatus -sum output from before you run the test and extract from engine_A.log and main.cf.
Mike
07-09-2013 06:49 AM
Please provide logs as this should say what issue is. One probable cause is that you only have 1 heartbeat working as this will stop failover, so please also provide output of "lltstat -nvv" and also if resources are not probed, so please provide output of "hastatus -sum"
As an aside, I would recommend setting AutoStartList for your service group so that they start on a cold cluster start.
Mike
07-09-2013 10:22 PM
i think the clusterService group will still failover even if we have only one LLT link active. as this group fault is a special fault and handled differently than other groups
07-10-2013 12:55 AM
Solom,
If when you turn off node1, if the ClusterService group is failing over to node2 and the other groups are not, then if Amit is right in that the ClusterService groups fails over, even with only one heartbeat (which makes sense as the ClusterService group has no storage), then almost certainly, your issue is the probable cause I mentioned on my last post - that only one heartbeat is working, so please provide output from eariler request of "lltstat -nvv".
Mike
07-10-2013 06:39 AM
also engine-logs can be useful
07-11-2013 07:27 AM
07-11-2013 07:47 AM
Logs are in /var/VRTSvcs/logs, but output of "lltstat -nvv" may show the issue is there is only one heartbeat working.
Mike
07-11-2013 01:09 PM
[root@tcpri-clu1 ~]# lltstat -nvv
LLT node information:
Node State Link Status Address
* 0 tcpri-clu1 OPEN
eth4 UP 3C:4A:92:EF:8B:7B
eth5 UP 3C:4A:92:EF:8B:7F
1 tcpri-clu2 OPEN
eth4 UP 3C:4A:92:EF:7B:73
eth5 UP 3C:4A:92:EF:7B:77
2 CONNWAIT
eth4 DOWN
eth5 DOWN
3 CONNWAIT
eth4 DOWN
eth5 DOWN
4 CONNWAIT
eth4 DOWN
eth5 DOWN
5 CONNWAIT
eth4 DOWN
eth5 DOWN
6 CONNWAIT
eth4 DOWN
eth5 DOWN
7 CONNWAIT
eth4 DOWN
eth5 DOWN
8 CONNWAIT
eth4 DOWN
eth5 DOWN
9 CONNWAIT
eth4 DOWN
eth5 DOWN
10 CONNWAIT
eth4 DOWN
eth5 DOWN
11 CONNWAIT
eth4 DOWN
eth5 DOWN
12 CONNWAIT
eth4 DOWN
eth5 DOWN
13 CONNWAIT
eth4 DOWN
eth5 DOWN
14 CONNWAIT
eth4 DOWN
eth5 DOWN
15 CONNWAIT
eth4 DOWN
eth5 DOWN
16 CONNWAIT
eth4 DOWN
eth5 DOWN
17 CONNWAIT
eth4 DOWN
eth5 DOWN
18 CONNWAIT
eth4 DOWN
eth5 DOWN
19 CONNWAIT
eth4 DOWN
eth5 DOWN
20 CONNWAIT
eth4 DOWN
eth5 DOWN
21 CONNWAIT
eth4 DOWN
eth5 DOWN
22 CONNWAIT
eth4 DOWN
eth5 DOWN
23 CONNWAIT
eth4 DOWN
eth5 DOWN
24 CONNWAIT
eth4 DOWN
eth5 DOWN
25 CONNWAIT
eth4 DOWN
eth5 DOWN
26 CONNWAIT
eth4 DOWN
eth5 DOWN
27 CONNWAIT
eth4 DOWN
eth5 DOWN
28 CONNWAIT
eth4 DOWN
eth5 DOWN
29 CONNWAIT
eth4 DOWN
eth5 DOWN
30 CONNWAIT
eth4 DOWN
eth5 DOWN
31 CONNWAIT
eth4 DOWN
eth5 DOWN
32 CONNWAIT
eth4 DOWN
eth5 DOWN
33 CONNWAIT
eth4 DOWN
eth5 DOWN
34 CONNWAIT
eth4 DOWN
eth5 DOWN
35 CONNWAIT
eth4 DOWN
eth5 DOWN
36 CONNWAIT
eth4 DOWN
eth5 DOWN
37 CONNWAIT
eth4 DOWN
eth5 DOWN
38 CONNWAIT
eth4 DOWN
eth5 DOWN
39 CONNWAIT
eth4 DOWN
eth5 DOWN
40 CONNWAIT
eth4 DOWN
eth5 DOWN
41 CONNWAIT
eth4 DOWN
eth5 DOWN
42 CONNWAIT
eth4 DOWN
eth5 DOWN
43 CONNWAIT
eth4 DOWN
eth5 DOWN
44 CONNWAIT
eth4 DOWN
eth5 DOWN
45 CONNWAIT
eth4 DOWN
eth5 DOWN
46 CONNWAIT
eth4 DOWN
eth5 DOWN
47 CONNWAIT
eth4 DOWN
eth5 DOWN
48 CONNWAIT
eth4 DOWN
eth5 DOWN
49 CONNWAIT
eth4 DOWN
eth5 DOWN
50 CONNWAIT
eth4 DOWN
eth5 DOWN
51 CONNWAIT
eth4 DOWN
eth5 DOWN
52 CONNWAIT
eth4 DOWN
eth5 DOWN
53 CONNWAIT
eth4 DOWN
eth5 DOWN
54 CONNWAIT
eth4 DOWN
eth5 DOWN
55 CONNWAIT
eth4 DOWN
eth5 DOWN
56 CONNWAIT
eth4 DOWN
eth5 DOWN
57 CONNWAIT
eth4 DOWN
eth5 DOWN
58 CONNWAIT
eth4 DOWN
eth5 DOWN
59 CONNWAIT
eth4 DOWN
eth5 DOWN
60 CONNWAIT
eth4 DOWN
eth5 DOWN
61 CONNWAIT
eth4 DOWN
eth5 DOWN
62 CONNWAIT
eth4 DOWN
eth5 DOWN
63 CONNWAIT
eth4 DOWN
eth5 DOWN
07-11-2013 03:33 PM
How are you turning off your node as from the logs I see:
2013/06/17 21:52:47 VCS NOTICE V-16-1-10322 System tcpri-clu1 (Node '0') changed state from RUNNING to LEAVING
as oppose to "changed state from RUNNING to FAULTED" so "LEAVING" means VCS is been shutdown, rather than a node being "killed". If RC scripts are being called, then historically the RC scripts ran "hastop -evacuate", which means service groups would be failed over, but since at least 5.1, the RC scripts now run "hastop -sysoffline" and I am not sure if this fails over service groups as the "-sysoffline" flag is undocumented.
The best way to test a node failing is to power down system boards if this server is part of a logical domain or flick the power switch. The best O/S command I got to work for this test was "uadmin 2 0" and even this was sometimes not quick enough bringing down the server, as VCS knew command was run so saw shutdown as administrive so didn't failover groups.
Mike
07-11-2013 10:36 PM
Mike,
The 'hastop -sysoffline' is same as 'hastop -local -evacuate -noautodisable', except that it will not be interactive. in VCS 5.1, we made the 'hastop' command confirmative, however we wanted to handle the reboot case and hence introduced this special flag.
Thus the 'hastop -sysoffline' will also 'evacuate' the online groups to the other node possible. However it is possible that due to dependencies etc, the evacuation may not happen.
Solom: Best way to provide these logs/main.cf is to attach, as then it is better handled. i will look into logs and then respond.
07-14-2013 04:59 PM
Please check attached file
07-14-2013 11:48 PM
Thanks Amit. Do you know why the -sysoffline" is not documented in man pages or VCS user guide. Also do any of the VCS guides tell you that service groups will evacuated by the O/S shutdown scripts as I couldn't find this info in the VCS user or install guide.
Mike
07-15-2013 01:56 AM
The ' hastop -sysoffline' command is not expected to be used outside the shutdown RC script. This was the reason it was not documented.
At reboot, it is known that, in reboot sequence, while stopping VCS, the online SGs will be evacuated if appropriate. This should be in document as the behavior is present even in 4.1.
I will reconfirm the documentation again
07-18-2013 07:13 AM
Sorry for the cuttoff. I was assisgned temporarily to a different project.
My problem is still there and I just realized something that could be the source of my problem. Previously when I was running Veritas 5.1 on Redhat 5.6 the disk groups were defined on all hosts of the cluster. I could, at the time, manually deport and import disk grops between servers in the cluster.
Now, I can only see the disk groups visible on the primary node of the cluster. Eventhough the luns are visible from all nodes of the cluster. So I have a feeling that my problem lies in the original setup of the Diskgroups or the cluster it self. I tried to delete the diskgroups and create them again, but the option to share the diskgroup between hosts is greyed out.
Thoughts??
07-18-2013 08:38 AM
two quick followups:
1. I managed to do a manual deport and export between the nodes in the cluster using vxdiskadmin but not through vom
2. I ran a fire drill on the ClusterService and it executed fine. But on the Service Groups for the mounts it failed
07-18-2013 05:01 PM
also engine-logs
code
Attach
07-18-2013 05:03 PM
I working on vom and the ClusterService is online on node1 and when i turn off node1 node2 going online but the problem after created Service Groups and resource can't failover to node2 and the type for services group is failover and my nodes working on redhat 6.3 .
main.cf
Attached
07-19-2013 01:18 AM
I have looked at logs and main.cf more closely now and I can see your issue. You have nested mounts:
07-19-2013 08:03 AM
Thank you very much for the thourough answer; I really didnt think about it this way. I will give it a shot a get back to you
Solom