cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster Failover: Clearing VCS for Automatic Failover

mkruer
Level 4

Steps:

1. Failover Cluster from say app1 to App2 by shutting down App1.
2. Bring up the App1 server and now failover App2 to App1 server.

When the same scenario occurs multiple times then observed that all the services are offline at some point after failover and the system doesnt come up.

How can i check the state of the other box once brought backup and clear the error to allow the automatic failover to occure again?

-Matt-

1 ACCEPTED SOLUTION

Accepted Solutions

mikebounds
Level 6
Partner Accredited

When you down a system, the service group fails over and then it gets autodisabled.  When the downed system comes back, VCS is the very last thing to start (i.e rc script is S99) and when it starts it probes all the resources and this can take a minute and then it autoenables servicegroup once all resources are probed successfully.

So your issue is probably than you are downing the box too soon after it comes up, before the probes have finished - so you need to check this with "hastatus -sum" before downing the box.

Mike

View solution in original post

4 REPLIES 4

mikebounds
Level 6
Partner Accredited

If the service group fails due to a resource fault (example you kill an Oracle process or umount a filesystem) then you need to clear the fault for that system to be used for that service group (hagrp -clear sg_name)

When a system comes back online, you must wait until the all the resources are probed on that system before downing the system (you "hastatus -sum" to check all resources are probed).

If a service group is offline when one system is up and the other is down, then the system need to be in the AutoStartList for the offline service group, for the service group to start when that system joins the cluster (so I recommend that both systems are in the AutoStartList)

If you are still having issues, then provide extract from log where you down a server and the services do not failover.

Mike

mkruer
Level 4

Mike,

The test in question is when we pull the power on the active box and not a service group failure. Once power is restored to the system, the same action is performed on the second system (the one that was failover too) without any checking of the system that went out. They are expecting it to failover to the original system again.

mkruer
Level 4

I think what is happening is that sometime when the server comes back online its is then sets the groups to AutoDisabled to True preventing the group from automatically failing over again

mikebounds
Level 6
Partner Accredited

When you down a system, the service group fails over and then it gets autodisabled.  When the downed system comes back, VCS is the very last thing to start (i.e rc script is S99) and when it starts it probes all the resources and this can take a minute and then it autoenables servicegroup once all resources are probed successfully.

So your issue is probably than you are downing the box too soon after it comes up, before the probes have finished - so you need to check this with "hastatus -sum" before downing the box.

Mike