VCS AutoStartList ungracefully failover.
I have a cluster setup and everything seems to be working as expected except in one test case of an ungraceful shutdowns.
The outstanding issue seems to be with the ungraceful shutdown. At this time when it comes to an ungraceful shutdown, it seems to be able to fail only one way. This order seems to be determined by the AutoStartList. If you are already running on the last system in the list, it will not go back the first system.
System A is ungracefully shutdown > System B sees the fault and starts the resources
System A is brought back online and all errors cleared
System B is ungracefully shutdown > System A sees the fault but does not start any of the resources
Is this correct. Is there a way to force it to try the first system in the list?
The 'AutoStartList' is used only in case of node-joining event. For example, for a failover group, when all of the nodes in its SystemList join the cluster, and AutoStartList is set with AutoStart attribute set to 1 (default), the Online of service group is initiated. This considers the AutoStartList order for group's possible target.
On the node-fault (for System B above), if the System A has not brought the resources online, it is more-likey that the fault was detected after ShutdownTimeout.
The groups are failed-over to other node on a node-fault only when following happens:
1. Node A - port-h is closed un-gracefully. (HAD dies). Node goes into DDNA (Daemon Dead, Node Alive)state and all other nodes mark the SGs configured on this node as 'AutoDisabled' - to avoid any concurrency violation.
2. Node A leaves port-a membership within ShutdownTimeout seconds. At this point other nodes will consider that node A is down and 'AutoEnable' the SGs configured on node A and start failover action.
2.a - if the port-a membership does not go within ShutDownTimeout, to protect concurrency violations, VCS will continue groups in AutoDisabled state. One can come out of this situation, by confirming that the Node A is indeed down / applications are not running on this node, issue 'hagrp -autoenable' command followed by 'hagrp -online' command.