cancel
Showing results for 
Search instead for 
Did you mean: 

VCS AutoStartList ungracefully failover.

mkruer
Level 4

I have a cluster setup and everything seems to be working as expected except in one test case of an ungraceful shutdowns.

The outstanding issue seems to be with the ungraceful shutdown. At this time when it comes to an ungraceful shutdown, it seems to be able to fail only one way. This order seems to be determined by the AutoStartList. If you are already running on the last system in the list, it will not go back the first system.

System A is ungracefully shutdown > System B sees the fault and starts the resources

System A is brought back online and all errors cleared

System B is ungracefully shutdown > System A sees the fault but does not start any of the resources

Is this correct. Is there a way to force it to try the first system in the list?

1 ACCEPTED SOLUTION

Accepted Solutions

arangari
Level 5

The 'AutoStartList' is used only in case of node-joining event. For example, for a failover group, when all of the nodes in its SystemList join the cluster, and AutoStartList is set with AutoStart attribute set to 1 (default), the Online of service group is initiated. This considers the AutoStartList order for group's possible target.

On the node-fault (for System B above), if the System A has not brought the resources online, it is more-likey that the fault was detected after ShutdownTimeout.

The groups are failed-over to other node on a node-fault only when  following happens:

1. Node A - port-h is closed un-gracefully. (HAD dies).  Node goes into DDNA (Daemon Dead, Node Alive)state and all other nodes mark the SGs configured on this node as 'AutoDisabled' - to avoid any concurrency violation.

2. Node A leaves port-a membership within ShutdownTimeout seconds. At this point other nodes will consider that node A is down and 'AutoEnable' the SGs configured on node A and start failover action.

2.a - if the port-a membership does not go within ShutDownTimeout, to protect concurrency violations, VCS will continue groups in AutoDisabled state. One can come out of this situation, by confirming that the Node A is indeed down / applications are not running on this node, issue 'hagrp -autoenable' command followed by 'hagrp -online' command. 

 

 

View solution in original post

9 REPLIES 9

mikebounds
Level 6
Partner Accredited

Could you post extract from main.cf showing service group attributes and also post extract from engine log.

Mike

arangari
Level 5

The 'AutoStartList' is used only in case of node-joining event. For example, for a failover group, when all of the nodes in its SystemList join the cluster, and AutoStartList is set with AutoStart attribute set to 1 (default), the Online of service group is initiated. This considers the AutoStartList order for group's possible target.

On the node-fault (for System B above), if the System A has not brought the resources online, it is more-likey that the fault was detected after ShutdownTimeout.

The groups are failed-over to other node on a node-fault only when  following happens:

1. Node A - port-h is closed un-gracefully. (HAD dies).  Node goes into DDNA (Daemon Dead, Node Alive)state and all other nodes mark the SGs configured on this node as 'AutoDisabled' - to avoid any concurrency violation.

2. Node A leaves port-a membership within ShutdownTimeout seconds. At this point other nodes will consider that node A is down and 'AutoEnable' the SGs configured on node A and start failover action.

2.a - if the port-a membership does not go within ShutDownTimeout, to protect concurrency violations, VCS will continue groups in AutoDisabled state. One can come out of this situation, by confirming that the Node A is indeed down / applications are not running on this node, issue 'hagrp -autoenable' command followed by 'hagrp -online' command. 

 

 

anand_raj
Level 3
Employee Accredited Certified

Excerpts from the engine_A.log at the time of shutdown of system B would definitely help to troubleshoot this. Also would need the main.cf snippet as Mike asked earlier.

mkruer
Level 4
Sorry about the delay, I got pulled into another project. I have been unable to reproduce the issue on any other box, so I am thinking that this has to do with the system. I am attaching the logs for system and the main.cf, the ifconfig and the llttabs files. Something else that’s sort of worrying is that when I have one process (which is a looping rsyc process, running as a service) for this the system will crap out and reboot. I don’t know if this is related, So far I have disabled that service and the system seems to be up and stable. Both boxes are HP DL360 G6

anand_raj
Level 3
Employee Accredited Certified

I checked the AutoStartList and SystemList attributes in VCS. They are fine.
I'm assuming System A ==> app-49-56 and System B ==> app-49-59

2012/08/20 18:01:30 VCS ERROR V-16-1-10322 System app-49-59 (Node '1') changed state from RUNNING to FAULTED
2012/08/20 18:01:30 VCS NOTICE V-16-1-10301 Initiating Online of Resource autorsync (Owner: Unspecified, Group: App_Cluster) on System app-49-56
2012/08/20 18:01:30 VCS NOTICE V-16-1-10301 Initiating Online of Resource rcsscheduler (Owner: Unspecified, Group: App_Cluster) on System app-49-56

Here VCS failed over the SG to other node, but it faulted due to a resource fault

2012/08/20 18:03:31 VCS ERROR V-16-2-13066 (app-49-56) Agent is calling clean for resource(autorsync) because the resource is not up even after online completed.
..
2012/08/20 18:03:43 VCS ERROR V-16-1-10205 Group App_Cluster is faulted on system app-49-56

Please let me know if there's any specific time line that you want me to focus on. It would be very helpful.

Thanks.
 

mkruer
Level 4

Can you please look at 2012/08/20 14:00 onwards, this where I was seeing the systems seemly reboot after I enabled my autorsync process which seems to run fine for a while. If I am reading the log correctly it looks like the system just died and rebooted to come backup. Let me know if you need any other logs.

anand_raj
Level 3
Employee Accredited Certified

When 59 faulted here, VCS does the failover to 56


2012/08/20 18:01:30 VCS ERROR V-16-1-10322 System app-49-59 (Node '1') changed state from RUNNING to FAULTED
2012/08/20 18:01:30 VCS NOTICE V-16-1-10301 Initiating Online of Resource autorsync (Owner: Unspecified, Group: App_Cluster) on System app-49-56

But it faulted because a resource couldn't come up:
2012/08/20 18:03:31 VCS ERROR V-16-2-13066 (app-49-56) Agent is calling clean for resource(autorsync) because the resource is not up even after online completed.

It took the service group offline since it's a critical resource and brought it again on 59 when it came up.

2012/08/20 18:05:23 VCS NOTICE V-16-1-10442 Initiating auto-start online of group CMSApp_Cluster on system app-49-59

The box crashed again

2012/08/20 18:10:01 VCS NOTICE V-16-1-11022 VCS engine (had) started
2012/08/20 18:10:01 VCS NOTICE V-16-1-11050 VCS engine version=5.1

Since the service group fault wasn't cleared on 56, VCS didn't bring up the SG there:

2012/08/20 18:06:28 VCS ERROR V-16-1-10322 System app-49-59 (Node '1') changed state from RUNNING to FAULTED
2012/08/20 18:06:28 VCS NOTICE V-16-1-10446 Group App_Cluster is offline on system app-49-59
2012/08/20 18:06:28 VCS INFO V-16-1-10493 Evaluating app-49-56 as potential target node for group App_Cluster
2012/08/20 18:06:28 VCS INFO V-16-1-50010 Group App_Cluster is online or faulted on system app-49-56

VCS brought it again on 59 as a system start clears the fault flags

2012/08/20 18:10:08 VCS NOTICE V-16-1-10438 Group CMSApp_Cluster has been probed on system app-49-59
2012/08/20 18:10:08 VCS NOTICE V-16-1-10442 Initiating auto-start online of group CMSApp_Cluster on system app-49-59

2012/08/20 18:10:24 VCS NOTICE V-16-1-10447 Group App_Cluster is online on system app-49-59
 

When it faulted again, the failover failed for same reason:

2012/08/20 18:12:05 VCS ERROR V-16-1-10322 System app-49-59 (Node '1') changed state from RUNNING to FAULTED
2012/08/20 18:12:05 VCS NOTICE V-16-1-10446 Group App_Cluster is offline on system app-49-59
2012/08/20 18:12:05 VCS INFO V-16-1-10493 Evaluating app-49-56 as potential target node for group App_Cluster
2012/08/20 18:12:05 VCS INFO V-16-1-50010 Group App_Cluster is online or faulted on system app-49-56

If the fault was cleared on the 56 box, it would have onlined the service group on this box. Hope this helps.

anand_raj
Level 3
Employee Accredited Certified

Hi mkruer,

Any comments about my previous comment?
 

Thanks.

mkruer
Level 4
The script that was running had a bug in it that was looking for the state of VCS and then terminating if the state was not met. So what was happening on startup was while VCS was turning everything on, it would check the state of VCS still see nothing was up and then kill the process. VCS then querued the service and saw that pid was not up (because it exited) and failed the process. As for the other issue, I think it’s safe to say it’s a hardware related issues.