03-12-2013 04:06 AM
VCS 6.0.1
Hi i have configured a two node cluster with local storage and running two service groups. They both running fine and i am able to switch over them to any node on the cluster but when i forcely power down a node where both service groups are active, just one service group fails over to another node and the one running apache resource gets faild and do not fail over.
below pasted the contents of main.cf file.
==========================================
Solved! Go to Solution.
03-12-2013 08:35 AM
Just missed your email - so I can now see you have only defined one heartbeat and this means VCS cannot detect between eth1 failure and system failure and therefore it will not failover any service groups (apart from ClusterService), so you need to have at least 2 heartbeats (which need to be independent in a live cluster - i.e not a dual-port card, but this is ok for testing)
Mike
03-12-2013 04:24 AM
How do you power down your server - commands from the O/S like "halt" and "reboot" are sometimes not severe enough and VCS knows this was done by ther user, as oppose to a power outage and so sees this as an adminstrive powerdown and does not failover service groups, but ClusterService is a "special" service group so this is failed over.
The best O/S command I got to work for this test was "uadmin 2 0" and even this was sometimes not quick enough bringing down the server, so VCS knew command was run. The best way is to power down system boards if this server is part of a logical domain or flick the power switch.
If you are doing a severe powerdown and still having issues, can you provide the output of "hastatus -sum" before you do your powerdown test.
Mike
03-12-2013 04:48 AM
As an aside the settings on the httpsg are probably not what you intend:
OnlineRetryLimit = 3OnlineRetryInterval = 15
This means if the Apache process dies, VCS will try to restart (the whole group) locally and if it faults again after 15 seconds then the previous fault will be ignored which probably means it will never try the other node.
OnlineRetryLimit is normally only set at the service group level for the ClusterService group and when it is set for "normal" service groups then it is normally set to 1 which means it will try to restart The whole group locally once and then try another node.
If you want Apache to restart locally then you should set this at a resource type level like "hatype -modify Apache RestartLimit 1", so this will JUST restart Apache, not all resources in the service group.
Mike
03-12-2013 04:48 AM
i am actually working on a test environment before going live on production. currently i have created just one service group which is not getting failed over when i intentionally pull out the power cable from server. So i am not using any command currently to power down the server, however once i used the command reboot as well but no luck with that too.
03-12-2013 05:31 AM
For powerdown i am simply unplug the power cable. Actually currently i am working in a test environment and will go live on production once it start working as expected.
My intension is to make the httpsg available even if one node goes offline suddenly due to any hardware failure or panic. In my case the ClusterServer works well but unfortunately httpsg which was created by me downn't fail over to another node from the faulted server.
OnlineRetryLimit = 3
OnlineRetryInterval = 15
These two attributes were set by me while troubleshooting.
===============
[root@server3 log]# hastatus -sum
-- SYSTEM STATE
-- System State Frozen
A server3 RUNNING 0
A server4 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B ClusterService server3 Y N ONLINE
B ClusterService server4 Y N OFFLINE
B httpsg server3 Y N OFFLINE
B httpsg server4 Y N ONLINE
[root@server3 log]#
[root@server3 log]# hagrp -display | egrep -i "SystemList|FailOverPolicy|AutoStart|AutoFailOver"
ClusterService AutoFailOver global 1
ClusterService AutoStart global 1
ClusterService AutoStartIfPartial global 1
ClusterService AutoStartList global server3 server4
ClusterService AutoStartPolicy global Order
ClusterService ClusterFailOverPolicy global Manual
ClusterService FailOverPolicy global Priority
ClusterService SystemList global server3 0 server4 1
httpsg AutoFailOver global 1
httpsg AutoStart global 1
httpsg AutoStartIfPartial global 1
httpsg AutoStartList global server3 server4
httpsg AutoStartPolicy global Order
httpsg ClusterFailOverPolicy global Manual
httpsg FailOverPolicy global Priority
httpsg SystemList global server3 0 server4 1
[root@server3 log]#
[root@server3 log]# hares -display | grep -i critical
apachenew Critical global 1
csgnic Critical global 1
ipresource Critical global 1
webip Critical global 1
[root@server3 log]#
03-12-2013 05:45 AM
Hi Shivam,
Pleaes confirm that before you reboot any of the nodes, the state of the httpsg group and all the resources that are part of the httpsg group are in a steady state "OFFLINE" in VCS. That means they are fully probed in VCS and detected in a steady state - OFFLINE. Then try the tests.
Thanks,
Satish/
03-12-2013 05:56 AM
The hastatus -sum you gave is not from before the test as this shows service group on different systems - if you show the engine log before the power down then I can work out what the state was before the test. I would need to see the engine log from the last instance of message "Group httpsg is online on system" before you did the powerdown.
Mike
03-12-2013 06:44 AM
Hi Mike
Please find below the logs and status of the service groups & resources while the test.
++++++++++++++++++++Before test+++++++++++++++++++++++++++++
[root@server3 log]# hastatus -sum
-- SYSTEM STATE
-- System State Frozen
A server3 RUNNING 0
A server4 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B ClusterService server3 Y N OFFLINE
B ClusterService server4 Y N ONLINE
B httpsg server3 Y N OFFLINE
B httpsg server4 Y N ONLINE
[root@server3 log]# hagrp -state
#Group Attribute System Value
ClusterService State server3 |OFFLINE|
ClusterService State server4 |ONLINE|
httpsg State server3 |OFFLINE|
httpsg State server4 |ONLINE|
[root@server3 log]# hares -state
#Resource Attribute System Value
apachenew State server3 OFFLINE
apachenew State server4 ONLINE
csgnic State server3 ONLINE
csgnic State server4 ONLINE
ipresource State server3 OFFLINE
ipresource State server4 ONLINE
webip State server3 OFFLINE
webip State server4 ONLINE
[root@server3 log]#
[root@server3 log]# hares -disp | grep -i group
apachenew Group global httpsg
csgnic Group global ClusterService
ipresource Group global httpsg
webip Group global ClusterService
[root@server3 log]#
============ While Powered off server4 =====================
(both service group should have been failed over to server3 as expected but only service group "ClusterService" failed over not "httpsg")
[root@server3 log]# hastatus -sum
-- SYSTEM STATE
-- System State Frozen
A server3 RUNNING 0
A server4 FAULTED 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B ClusterService server3 Y N ONLINE
B httpsg server3 Y N OFFLINE
B httpsg server4 Y Y OFFLINE
[root@server3 log]# hares -state
#Resource Attribute System Value
apachenew State server3 OFFLINE
apachenew State server4 OFFLINE
csgnic State server3 ONLINE
csgnic State server4 ONLINE
ipresource State server3 OFFLINE
ipresource State server4 OFFLINE
webip State server3 ONLINE
webip State server4 OFFLINE
[root@server3 log]# hagrp -state
#Group Attribute System Value
ClusterService State server3 |ONLINE|
ClusterService State server4 |OFFLINE|
httpsg State server3 |OFFLINE|
httpsg State server4 |OFFLINE|
[root@server3 log]#
ENGINE LOG WHILE Server4 WAS DOWN (Collected from server3)
---------------------------------
2013/03/12 18:54:47 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, UP; Current status =eth1, DOWN.
2013/03/12 18:54:49 VCS INFO V-16-1-10077 Received new cluster membership
2013/03/12 18:54:49 VCS NOTICE V-16-1-10112 System (server3) - Membership: 0x1, DDNA: 0x0
2013/03/12 18:54:49 VCS ERROR V-16-1-10079 System server4 (Node '1') is in Down State - Membership: 0x1
2013/03/12 18:54:49 VCS ERROR V-16-1-10322 System server4 (Node '1') changed state from RUNNING to FAULTED
2013/03/12 18:54:49 VCS NOTICE V-16-1-10449 Group httpsg autodisabled on node server4 until it is probed
2013/03/12 18:54:49 VCS NOTICE V-16-1-10449 Group VCShmg autodisabled on node server4 until it is probed
2013/03/12 18:54:49 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system server4
2013/03/12 18:54:49 VCS NOTICE V-16-1-10446 Group httpsg is offline on system server4
2013/03/12 18:54:49 VCS ERROR V-16-1-10205 Group ClusterService is faulted on system server4
2013/03/12 18:54:49 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system server4
2013/03/12 18:54:49 VCS INFO V-16-1-10493 Evaluating server3 as potential target node for group ClusterService
2013/03/12 18:54:49 VCS INFO V-16-1-10493 Evaluating server4 as potential target node for group ClusterService
2013/03/12 18:54:49 VCS INFO V-16-1-10494 System server4 not in RUNNING state
2013/03/12 18:54:49 VCS NOTICE V-16-1-10301 Initiating Online of Resource webip (Owner: Unspecified, Group: ClusterService) on System server3
2013/03/12 18:54:49 VCS INFO V-16-6-15015 (server3) hatrigger:/opt/VRTSvcs/bin/triggers/sysoffline is not a trigger scripts directory or can not be executed
2013/03/12 18:55:02 VCS INFO V-16-1-10298 Resource webip (Owner: Unspecified, Group: ClusterService) is online on server3 (VCS initiated)
2013/03/12 18:55:02 VCS NOTICE V-16-1-10447 Group ClusterService is online on system server3
============ Status after i powered on server4 =================
[root@server3 log]# hastatus -sum
-- SYSTEM STATE
-- System State Frozen
A server3 RUNNING 0
A server4 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B ClusterService server3 Y N ONLINE
B ClusterService server4 Y N OFFLINE
B httpsg server3 Y N ONLINE
B httpsg server4 Y N OFFLINE
[root@server3 log]# hares -state
#Resource Attribute System Value
apachenew State server3 ONLINE
apachenew State server4 OFFLINE
csgnic State server3 ONLINE
csgnic State server4 ONLINE
ipresource State server3 ONLINE
ipresource State server4 OFFLINE
webip State server3 ONLINE
webip State server4 OFFLINE
[root@server3 log]# hagrp -state
#Group Attribute System Value
ClusterService State server3 |ONLINE|
ClusterService State server4 |OFFLINE|
httpsg State server3 |ONLINE|
httpsg State server4 |OFFLINE|
[root@server3 log]#
ENGINE LOG While Server4 was coming up and came up.
---------------------
2013/03/12 19:02:59 VCS INFO V-16-1-10077 Received new cluster membership
2013/03/12 19:02:59 VCS NOTICE V-16-1-10112 System (server3) - Membership: 0x1, DDNA: 0x2
2013/03/12 19:02:59 VCS ERROR V-16-1-10113 System server4 (Node '1') is in DDNA Membership - Membership: 0x1, Visible: 0x0
2013/03/12 19:03:02 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth1, DOWN; Current status =eth1, UP.
2013/03/12 19:03:08 VCS INFO V-16-1-10077 Received new cluster membership
2013/03/12 19:03:08 VCS NOTICE V-16-1-10112 System (server3) - Membership: 0x3, DDNA: 0x2
2013/03/12 19:03:08 VCS NOTICE V-16-1-10322 System (Node '1') changed state from UNKNOWN to INITING
2013/03/12 19:03:08 VCS ERROR V-16-1-10111 System server4 (Node '1') is in Regular and Jeopardy Memberships - Membership: 0x3, Jeopardy: 0x2
2013/03/12 19:03:08 VCS NOTICE V-16-1-10453 Node: 1 changed name from: 'server4' to: 'server4'
2013/03/12 19:03:08 VCS NOTICE V-16-1-10322 System server4 (Node '1') changed state from FAULTED to INITING
2013/03/12 19:03:08 VCS NOTICE V-16-1-10322 System server4 (Node '1') changed state from INITING to CURRENT_DISCOVER_WAIT
2013/03/12 19:03:08 VCS NOTICE V-16-1-10322 System server4 (Node '1') changed state from CURRENT_DISCOVER_WAIT to REMOTE_BUILD
2013/03/12 19:03:09 VCS INFO V-16-1-10455 Sending snapshot to node membership: 0x2
2013/03/12 19:03:10 VCS NOTICE V-16-1-10322 System server4 (Node '1') changed state from REMOTE_BUILD to RUNNING
2013/03/12 19:03:12 VCS INFO V-16-1-10304 Resource ipresource (Owner: Unspecified, Group: httpsg) is offline on server4 (First probe)
2013/03/12 19:03:12 VCS INFO V-16-1-10304 Resource webip (Owner: Unspecified, Group: ClusterService) is offline on server4 (First probe)
2013/03/12 19:03:14 VCS INFO V-16-6-15015 (server4) hatrigger:/opt/VRTSvcs/bin/triggers/injeopardy is not a trigger scripts directory or can not be executed
2013/03/12 19:03:14 VCS INFO V-16-6-15015 (server4) hatrigger:/opt/VRTSvcs/bin/triggers/sysjoin is not a trigger scripts directory or can not be executed
2013/03/12 19:03:14 VCS INFO V-16-6-15023 (server4) dump_tunables:
########## VCS Environment Variables ##########
VCS_CONF=/etc/VRTSvcs
VCS_DIAG=/var/VRTSvcs
VCS_HOME=/opt/VRTSvcs
VCS_LOG_AGENT_NAME=
VCS_LOG_CATEGORY=6
VCS_LOG_SCRIPT_NAME=hatrigger
VCS_LOG=/var/VRTSvcs
########## Other Environment Variables ##########
CONSOLE=/dev/pts/0
HOME=/
INIT_VERSION=sysvinit-2.86
LANG=en_US.UTF-8
LD_LIBRARY_PATH=/opt/VRTSvcs/lib:
PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/sbin:/usr/sbin:/bin:/usr/bin:/opt/VRTSvcs/bin
previous=N
PREVLEVEL=N
PWD=/var/VRTSvcs/diag/had
runlevel=5
RUNLEVEL=5
SELINUX_INIT=YES
SHLVL=5
TERM=linux
_=/usr/bin/env
2013/03/12 19:03:14 VCS INFO V-16-6-15002 (server4) hatrigger:hatrigger executed /opt/VRTSvcs/bin/internal_triggers/dump_tunables server4 1 successfully
2013/03/12 19:03:16 VCS NOTICE V-16-1-10438 Group ClusterService has been probed on system server4
2013/03/12 19:03:16 VCS NOTICE V-16-1-10438 Group VCShmg has been probed on system server4
2013/03/12 19:03:16 VCS NOTICE V-16-1-10435 Group VCShmg will not start automatically on System server4 as the system is not a part of AutoStartList attribute of the group.
2013/03/12 19:03:18 VCS INFO V-16-1-10304 Resource apachenew (Owner: Unspecified, Group: httpsg) is offline on server4 (First probe)
2013/03/12 19:03:18 VCS NOTICE V-16-1-10438 Group httpsg has been probed on system server4
2013/03/12 19:03:18 VCS INFO V-16-1-50007 Initiating auto-start online of group httpsg
2013/03/12 19:03:18 VCS INFO V-16-1-10493 Evaluating server3 as potential target node for group httpsg
2013/03/12 19:03:18 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group httpsg on all nodes
2013/03/12 19:03:18 VCS NOTICE V-16-1-10301 Initiating Online of Resource ipresource (Owner: Unspecified, Group: httpsg) on System server3
2013/03/12 19:03:30 VCS INFO V-16-1-10298 Resource ipresource (Owner: Unspecified, Group: httpsg) is online on server3 (VCS initiated)
2013/03/12 19:03:30 VCS NOTICE V-16-1-10301 Initiating Online of Resource apachenew (Owner: Unspecified, Group: httpsg) on System server3
2013/03/12 19:03:30 VCS NOTICE V-16-10061-20494 (server3) Apache:apachenew:online:<Apache::Start> Command exit code [0]. Command output [Application started successfully.]
2013/03/12 19:03:42 VCS INFO V-16-1-10298 Resource apachenew (Owner: Unspecified, Group: httpsg) is online on server3 (VCS initiated)
2013/03/12 19:03:42 VCS NOTICE V-16-1-10447 Group httpsg is online on system server3
Shivam
03-12-2013 07:02 AM
Sorry, should have spotted this earlier - you have message:
System (server3) - Membership: 0x1, DDNA: 0x0
DDNA means "Daemon Down Node Alive", so VCS thinks server4 is still up but "had" daemon is down, which is why service groups are not failed over. But, don't quite understand why this is happening if you are pulling power cable. Another possible issue is if you only have one heartbeat so could you provide output from "lltstat -nvv".
You may find the following post useful:
https://www-secure.symantec.com/connect/forums/failover-secondary-system-upon-ungraceful-shutdown
Mike
03-12-2013 08:16 AM
Hi Mike and Shivam,
During the time of the power off testing there were no nodes in the DDNA membership. The hex code after the DDNA (and other memberships for that matter) tell you what nodes are in that membership. In this case, 0x0 indicates that no node is in that specific membership at that time.
During the reboot of the powered off node, you can see that DDNA membership changes to 0x2. To decode this, convert the number from hex to binary. When in binary the position of the 1's indicate which node ID are in that membership. Each node takes up 1 bit and the node ID's start at 0 (going from right to left for node IDs.) So in this case hex 0x2 equials 0010 in binary and the second position that the 1 is in corrasponds to the node ID 1.
I'm not sure why the http group is not being marked as "Faulted" at the 18:54:49 timeframe. However, the group "ClusterSerivce" is a special group that the cluster takes extra precautions to keep online and failover when other groups do not.
During your statup of the node, I see that the cluster goes into Jeopardy. Jeopardy is that you are down to a single heartbeat. Jeopardy will prevent the cluster from failing over all service groups other than the ClusterService group.
Here is what I would recommend:
1. Increase the ShutdownTimeOut value to 300 on all servers. This is a per server setting but can be set for all servers from a connection to a single node. Be sure the save and close the cluster configuration when done with making this change.
2. Ensure that the cluster is not in Jeopardy membership prior to running the power off/shutdown test.
3. Perform the power off/shutdown test again.
Let us know how your tesitng goes.
Thank you,
Wally
03-12-2013 08:28 AM
Thanks Mike & Wally,
Below is output of lltstat -nvv as asked by Mike.
03-12-2013 08:33 AM
My guess is that you have only defined one heartbeat - eth1 in /etc/llttab and that is why group does not failover and if this is the case, if you add eth0 as a lowpri heartbeat, then this will resolve your issue - can you provide contents of /etc/llttab.
Mike
03-12-2013 08:35 AM
Just missed your email - so I can now see you have only defined one heartbeat and this means VCS cannot detect between eth1 failure and system failure and therefore it will not failover any service groups (apart from ClusterService), so you need to have at least 2 heartbeats (which need to be independent in a live cluster - i.e not a dual-port card, but this is ok for testing)
Mike
03-12-2013 08:39 AM
Sure mike,
03-12-2013 10:24 AM
Hi Shivam,
VCS clustering is designed to have 2 or more heartbeats. If you are down to a single heartbeat then the cluster goes into jeopardy membership and assumes that the cluster is not faulted but that there is a another issue going on. As a result of an unknown failure happening in the environment, VCS decides to do nothing.
However, if 2 or more heartbeats are lost at the same time, VCS decides that the node is dead and marks service groups on that node as "Faulted". The surviving nodes then attempt to online the faulted service groups.
Basically, if there is only 1 heartbeat then VCS will not react when it is lost. However, if there are 2 or more heartbeats that are lost at the same time, then VCS will react because it assumes that the node is dead.
In your case, you need to configure a second heartbeat so that VCS comes out of Jeopardy membership. Then it will react when the power is pulled from the active node.
Thank you,
Wally
03-12-2013 10:24 AM
If your 2 heartbeats are independent then they should never fail at the same time and therefore if a node sees 2 heartbeat links go down at the same time, then it assumes node has gone down as if your heartbeats are truely independent then it is very unlikley they would fail at the same time. With only one heartbeat when a node sees it is gone, it has noway of telling if the link went down (like NIC or switch failure) or if the node went down. This is why when you only have one link, the cluster goes in Jeopardy.
Mike
03-12-2013 02:58 PM
Thanks Mike & Wally.
Creating another heartbeat link with below method solved the issue.
added this line in /etc/llttab file. (Taken help from
link-lowpri eth0 eth-00:50:56:91:03:30 - ether - -
with MAC address for eth0 on each system
Thanks a Lot to all for the quick help.....
Shivam