VCS AutoStartList ungracefully failover.

Question

I have a cluster setup and everything seems to be working as expected except in one test case of an ungraceful shutdowns.
The outstanding issue seems to be with the ungraceful shutdown. At this time when it comes to an ungraceful shutdown, it seems to be able to fail only one way. This order seems to be determined by the AutoStartList. If you are already running on the last system in the list, it will not go back the first system.
System A is ungracefully shutdown &gt; System B sees the fault and starts the resources
System A is brought back online and all errors cleared
System B is ungracefully shutdown &gt; System A sees the fault but does not start any of the resources
Is this correct. Is there a way to force it to try the first system in the list?

arangari · Accepted Answer

The 'AutoStartList' is used only in case of node-joining event. For example, for a failover group, when all of the nodes in its SystemList join the cluster, and AutoStartList is set with AutoStart attribute set to 1 (default), the Online of service group is initiated. This considers the AutoStartList order for group's possible target.
On the node-fault (for System B above), if the System A has not brought the resources online, it is more-likey that the fault was detected after ShutdownTimeout.
The groups are failed-over to other node on a node-fault only when &nbsp;following happens:
1. Node A - port-h is closed un-gracefully. (HAD dies). &nbsp;Node goes into DDNA (Daemon Dead, Node Alive)state and all other nodes mark the SGs configured on this node as 'AutoDisabled' - to avoid any concurrency violation.
2. Node A leaves port-a membership within ShutdownTimeout seconds. At this point other nodes will consider that node A is down and 'AutoEnable' the SGs configured on node A and start failover action.
2.a - if the port-a membership does not go within ShutDownTimeout, to protect concurrency violations, VCS will continue groups in AutoDisabled state. One can come out of this situation, by confirming that the Node A is indeed down / applications are not running on this node, issue 'hagrp -autoenable' command followed by 'hagrp -online' command.&nbsp;
&nbsp;
&nbsp;

mikebounds · Answer

Could you post extract from main.cf showing service group attributes and also post extract from engine log.
Mike

anand_raj · Answer

Excerpts from the engine_A.log at the time of shutdown of system B would definitely help to troubleshoot this. Also would need the main.cf snippet as Mike asked earlier.

mkruer · Answer

Sorry about the delay, I got pulled into another project. I have been unable to reproduce the issue on any other box, so I am thinking that this has to do with the system. I am attaching the logs for system and the main.cf, the ifconfig and the llttabs files. Something else that’s sort of worrying is that when I have one process (which is a looping rsyc process, running as a service) for this the system will crap out and reboot. I don’t know if this is related, So far I have disabled that service and the system seems to be up and stable.

Both boxes are HP DL360 G6

anand_raj · Answer

I checked the AutoStartList and SystemList attributes in VCS. They are fine.
	I'm assuming System A ==&gt; app-49-56 and System B ==&gt; app-49-59
	
	2012/08/20 18:01:30 VCS ERROR V-16-1-10322 System app-49-59 (Node '1') changed state from RUNNING to FAULTED
	2012/08/20 18:01:30 VCS NOTICE V-16-1-10301 Initiating Online of Resource autorsync (Owner: Unspecified, Group: App_Cluster) on System app-49-56
	2012/08/20 18:01:30 VCS NOTICE V-16-1-10301 Initiating Online of Resource rcsscheduler (Owner: Unspecified, Group: App_Cluster) on System app-49-56
Here VCS failed over the SG to other node, but it faulted due to a resource fault
2012/08/20 18:03:31 VCS ERROR V-16-2-13066 (app-49-56) Agent is calling clean for resource(autorsync) because the resource is not up even after online completed.
	..
	2012/08/20 18:03:43 VCS ERROR V-16-1-10205 Group App_Cluster is faulted on system app-49-56
Please let me know if there's any specific time line that you want me to focus on. It would be very helpful.
Thanks.
	&nbsp;

mkruer · Answer

Can you please look at 2012/08/20 14:00 onwards, this where I was seeing the systems seemly reboot after I enabled my autorsync process which seems to run fine for a while. If I am reading the log correctly it looks like the system just died and rebooted to come backup. Let me know if you need any other logs.

Forum Discussion

VCS AutoStartList ungracefully failover.

9 Replies

Related Content

VCS AutoStartList ungracefully failover part2

VCS AutoStartList configuration question

Re: Restoring using a different media server

Netbackup 8.1 client and Windows failover cluster

Automatic Failover

Recent Discussions

Configure two Mount type resources of nfs FStype attribute using the same share

order

key registration and reservation

Verifying that primary and dr clusters replication is synced

vcs can create logical nic