The hastatus shows OFFLINE|FAULTED

Kevin_Helmut · ‎02-10-2007

Hi Folks,

We recently installed VCS 4.1 on a two-node cluster ( mars / venus ). We implemented
clean monitor offline online scripts in PERL

So far so good. We created the Service Group (APP-RG) and then the following Resource ( IP, DiskGroup, Volume, MountPoints ), and also another Application Resource whose type looks like

# cat AppSrvTypes.cf
type APPSrv (
static int MonitorInterval = 180
static int MonitorTimeout = 180
static int OnlineRetryLimit = 1
static int OnlineWaitLimit = 1
static int RestartLimit = 2
static str ArgList[] = { State, InstanceName, LogHostName, PrtStatus, DebugMode }
str InstanceName
str LogHostName
str PrtStatus
str DebugMode
)

Then I created the required dependencies. Now over to testing:

Very Simple Test:

Fail over resource APP-RG from mars to venus and vice-versa

./hagrp -switch APP-RG -to mars
./hagrp -switch APP-RG -to venus

Every thing works fine, I was able to fail over, and was able to verify this via the ./hagui and also ./hastatus.

Second Test: ( Force Fail - Over )

Problem arises: As you can see ( from Above ) our APPSrvtypes.cf has
static int RestartLimit = 2. So to force fail over, we have to kill a process ( that is being monitored more than twice ?? )

So we tried to induce fail over by killing one of the process ( that is being monitored ) more than twice. The fail over is triggered :). But :( the sad thing is that, when I check the ./hastatus -sum it shows that the node on which the (process was killed more than twice ), shows the state as OFFLINE|FAULTED.

Example, the Service Group is currently on mars. and we tried forcing the fail over, by killing the process on mars twice. This results in fail over to node venus. Even though the Service Group is ONLINE on venus, the status of the Service Group on mars is OFFLINE|FAULTED

Please help me.. as we're planning on upgrading it to VCS 5.0 eventually.

Are there any other parameters that need to be set ??. Or is our application busted

# /opt/VRTSvcs/bin/hastatus -sum

-- SYSTEM STATE
-- System State Frozen

A mars RUNNING 0
A venus RUNNING 0

-- GROUP STATE
-- Group System Probed AutoDisabled State

B ClusterService mars Y N OFFLINE
B ClusterService venus Y N ONLINE
B APP-RG mars Y N OFFLINE|FAULTED
B APP-RG venus Y N ONLINE

-- RESOURCES FAILED
-- Group Type Resource System

C APP-RG AppSrv application_server_resource mars

Thanks Very Much In Adv
-Message was edited by:
DURGA TIRUNAGARI

Gene_Henriksen · ‎02-10-2007

It is working correctly from what you describe. The Restart Limit=2 tells the agent that if the resource goes offline unexpectedly it should be restarted. It can restart twice within the time limit specified in ConfInterval (default 600 seconds).

When you kill it to force the failover, VCS considers the App to have failed. It is now online on the other system and offline with a faulted stated where you killed it. That is correct. The idea is that you should research the reason it faulted and then clear the fault. This will allow it to fail back it it faults on the node where it is currently running.

To prevent failures resulting in a ping pong action of failovers bouncing back and forth, VCS will not failover a service group to a node where a critical resource in the service group is faulted.

If you plan to migrate the app from one server to another you should use the hagroup -switch command. If the app fails multiple times it should failover to the other node. You can configure a notifier to send you email or SNMP traps to make you aware of the problem.

What is it you expect VCS to do in this case?

Kevin_Helmut · ‎02-10-2007

Hi Gene,

Thanks very much for your prompt response, if i understand you correctly.

The way VCS works is that if an application is made HighlyAvaiable on a two -node cluster. Then if the application crashes (more number of times than the RestartLimit) on Node 1 it fails over to Node 2 and stays online there, And the Node-1 where the application crashed, is marked with FAULTED state. So till the cause of the crash is figured out ?. And once the cause of the crash is analyzed, the administrator has to clear the FAULTED flag ? explicitly. Hmmm so in case of a two node cluster, if the application again crashes on Node-2. The application is faulted on both nodes and the applcation is offline on all nodes of the cluster ? correct ?...

We also acheive HA with Sun Cluster as well as Vertias Cluster Server. I was confused, as Sun Cluster will not mark any node as Faulted, and keep shuttling between both the nodes of the cluster. And depending on some properties, it gives up after certain attempts.

Thank You
-

Gene_Henriksen · ‎02-10-2007

VCS considers that if a resource faults, there may be an underlying reason such as memory issue, disk controller, NIC, etc.

To fail back and forth could cause an app to possibly corrupt data. One student told me of a MSCS cluster where their SQL failed back and forth all night and corrupted the file system.

With notification, someone should be aware that the problem occurred when it first failed. This person should then investigate the problem. IF they can fix the cause of the problem, then they should clear the fault.

You could do the following, but I in no way recommend it: create a resfault trigger that will automatically send notification and clear the fault. That would allow you to bounce back and forth. Or less drastically, implement a "nofailover" trigger. This will only be fired when the service group fails on the last possible node and has nowhere to go. You could then decide, programatically, if you wanted to attempt to bring it back up.

So what would Sun cluster do with a corrupt filesystem? perhaps fsck with a -y? If so if you repeatedly did that after app faults could it remove data?

Hywel_Mallett · ‎02-12-2007

I agree with Gene that what you're seeing is as it should be.
Off topic, but I wasn't aware that MSCS doesn't mark resources as faulted. Seems a bit of an oversight! The way VCS works is always designed with reliability in mind. It basically performs lots of very simple tasks, based on very simple rules. It's only the number of rules and variables in use that make it seem complicated at first glance. If you can understand the rules it uses, you're half-way to understanding VCS.
Also, if your two cluster nodes are on Mars and Venus perhaps you should be looking at the Inter-planet Cluster Option... it's like the Global Cluster Option, but bigger ;) :) :)

Gene_Henriksen · ‎02-12-2007

Apparently, MSCS has the option not to fail to where you have faulted.

Kevin_Helmut · ‎02-14-2007

Right now we're exploring different HA Solutions, so that our QA can come up with a test plan

We currently don't have any customers in Mars / Venus ;-). Once we have, we'll definitely explore the "inter-planet Cluster option". Thanks for your answers Gene

_KH

VOX

The hastatus shows OFFLINE|FAULTED