Solved: Testing cluster fails to restart the failed server

mmccurdy · ‎01-08-2013

Two nodes with disk fencing and a heartbeat.

During a cluster test, a graceful reboot or manual cluster failover properly moves the services to the passive node. On the next test, I'm killing had and hashadow to emulate a failover. This server never reboots, and the failover doesn't occur. I found an option to force a panic in this scenario, and that does work, but I cannot determine how to just make it reboot in the attempt to right itself.

Regards,

Mark McCurdy

Wally_Heim · ‎01-09-2013

Hi Mark,

It is very unlikely that 3 processes will terminate at the same time in the real world without something else causing the failure.

To answer your question, killing HAD, HAShadow and GAB at the same time (or very close) should cause the other node to try to online the service group. But it will not cause the problem node to reboot. The proble that you will have is without a reboot of the problem node or HAD being active to shutdown the serivce groups, the surviving node will not be able to take over all resources successfully.

What problem/disaster situation are you trying to simulate in your testing? Maybe your testing needs to be changed to more accurately reflect what you situation that you are trying to test.

Thank you,

Wally

View solution in original post

Wally_Heim · ‎01-08-2013

Hi Mark,

The issue that you are having with killing HAD and HAShadow is that the surviving nodes are monitoring both Port A and Port H memberships on the heartbeat. Killing HAD and HAShadow only takes our Port H membershipt but not Port A membership. Port H is HAD and Port A is GAB. The ShutdownTimeOut value is used to time the exiting of these two port memberships and will affect failover. However, with only stopping Port H membership, the ShutdownTimeOut value does not come into play.

There are some gab settings that you can put in the gabtab to have GAB panic a server if the heartbeat network is lost and then returns. It is a "Hault on Rejoin" and I think it is a -j in the gabtab. I have not seen it used much in Windows and I'm not sure how often it is used in Linux or Unix versions of the product. But it might be what you are looking for to cause a panic of the server by playing with the heartbeats. I would not recommend forcing a panic on a production server.

The only other thing that I can think of phyiscally pulling the power cable to force the server down. Depending on the hardware being used this might not be possible. But I would also not recommend this on a production server.

Thank you,

Wally

mmccurdy · ‎01-08-2013

Is there a more proper way to test the cluster before putting it into production besides hitting the virtual power button? I thought killing had and hashadow was enough. What if I killed gab at the same time? After ShutdownTimeout, would it move the resources to node B and reboot A?

Thanks for the reply,

Mark

Wally_Heim · ‎01-09-2013

Hi Mark,

It is very unlikely that 3 processes will terminate at the same time in the real world without something else causing the failure.

To answer your question, killing HAD, HAShadow and GAB at the same time (or very close) should cause the other node to try to online the service group. But it will not cause the problem node to reboot. The proble that you will have is without a reboot of the problem node or HAD being active to shutdown the serivce groups, the surviving node will not be able to take over all resources successfully.

What problem/disaster situation are you trying to simulate in your testing? Maybe your testing needs to be changed to more accurately reflect what you situation that you are trying to test.

Thank you,

Wally

mmccurdy · ‎01-09-2013

Thanks Wally. I'll skip that method of testing and use the built in fire drill instead.

Regards,

Mark

VOX

Testing cluster fails to restart the failed server