02-20-2012 06:28 AM
Recently one of the cluster node got rebooted due to all heartbeat network down (Due to some changes on switch. It took about app 60 Secs)
We informed about the reboot to Network Team and in turn they suggested to change the heartbeat timeout value to 60 Secs.
Requesting your help - Is it advisable to change the heartbeat timeout value to 60 Secs.
I think the default value is 15 Secs. If we change the value from default, what is the consequences?
02-21-2012 06:06 AM
It is possible to set the heartbeat timeout to be 60 seconds. However, on the windows platform we don't recommend setting the heartbeat value above 30 seconds.
If you are concerned with the reboot there are several switches in the heartbeat configuration that control the reboot of the node in certain situations. It sounds like you have one or more of these swtiches set. You would check to see if disabling the reboot option would be more of what you are looking for.
02-22-2012 06:40 AM
I wouldn't really recommend that value .. couple of reasons ..
1. manually increasing the timeout value means you are increasing the time cluster will detect the fault which means delayed fault detection, delayed corrective actions .. business may not really permit it, if the running apps are mission critical even 30s may have value.
2. 30s we are talking on heartbeat, so in case of split brain situation you are intentionally delaying cluster to take action which could be serious (hope you are IOFencing in place)
LLT or heartbeat is a very crucial part of cluster, in a runing cluster heartbeats are exchanged every 1 second to know the status of other nodes, total 15s of LLT time out + 15s of Gab timeout gives 30s of failover detection which I believe is very prominent from stability & resilience.
To my opininion it would not be wise idea..
02-24-2012 06:49 AM
Thanks for your input.
Due to spanning tree problem, network engineer asked to change the value to 60s Sec to avoid cluster failover.
Customer also not intrested this on this failover :(
they are saying the spanning tree issue may take 45Sec to 60Sec to solve.
Could you please confim - what is the default heartbeat timeout value 15 Sec or 30 Sec
02-27-2012 06:37 AM
Agree with above comments that increasing timeout value is not advisable.
- To avoid failover, freezing SG is option.
- However, in case of LLT completely down, node will go down.
If you are using N/W switches between LLT links, there should be two switches for two High Priority LLT links. And, doing a change on one Switch at one time is advisable.
Having a single switch for all LLT links is again a risk on single poing of failure on LLT links.