Network connectivity loss solution

eu22106 · ‎07-03-2012

Hi all,

Some time ago we experienced a total network failure between our to Datacenters. This failure only took place for a couple of seconds but it resulted in a active node and a node that was trying to start all instances. When connectivity was restored both nodes stopt working (to prevent splitbrain)

My question: Is there a way to increase the timeinterval that dictates the nodes to take action in such a case? e.g. That the inactive node only starts when connectivity is lost for more than 5 minutes.

Ivo

Marianne · ‎07-03-2012

To prevent splitbrain, Symantec recommends 2 hearbeat links on completely separate infrastructure.
If this cannot be guaranteed, best to change failover to manual and rely on notification for administration intervention.
Read up in VCS Admin Guide on Controlling VCS behavior.
My problem with '5 minutes' is that network outages can last for longer than that. At one of our customers, a construction worker dug too deep and completely severed network cables between 2 buildings...

On Unix clusters, I/O fencing can be used, but this is not available on Windows clusters.

Handy NetBackup Links

joybanerjee81 · ‎07-05-2012

I agree to the above statement, we need to have two heart-beat networks while configuring VCS cluster for Windows.

How to Configure Veritas Global Cluster?

Follow this link: www.aikitsupport.com/HowtoConfigureVERITASGlobalClusterServer

But in case we have Unix OS then we have mechanism as I/O fencing to prevent split-brain and we have another technology for Windows VCS multiple Global Clusters is to configure Steward Server in a Network.

eu22106 · ‎07-06-2012

Correct. We have Heartbeats (@x LLT and one on Public IP) configured. Using dependencies we also prevent split brain. e.g. VMDg does not start without mayority of the disks and IP is also dependant on VMDg. In this particular case our Datacenter switch, (redundant set up) went completely down because of a bug. For 1 minute or so. Lately we have seen this more often. That is why I would like to increase that timeperiod without really creating a manual failover cluster and a 5 minutes period before things switch is acceptable.

mikebounds · ‎07-06-2012

If you loose all network connections at the same time, then from a node running VCS it cannot distinguish this from the other node dying, so you can increase timeouts, but this will mean if you loose a node, it will take longer for VCS on the other node to take action. The time you need to change is the LLT peer inactive timeout. I think by fault this is set to 16 seconds, so if you want to change to 20 say, then you would add the following line to %VCS_HOME%\comms\llt\llttab.txt file on each node (set in ms):

set-timer peerinact:20000

and then need to restart LLT - this means stopping VCS, but you can leave apps up:

hastop -all force (on one node)

net stop llt (on all nodes - this will stop all dependent services - gab and VCSComms)

net start had (on all nodes - this will start dependent services - llt, gab and VCSComms)

On VCS for UNIX you can run:

lltconfig -T peerinact: 20000

lltconfig -T query (to check peerinact is set

and this sets it straight away and you add to ltttab so that it is persistent, so this may work in Windows too, but couldn't find this in the Windows VCS admin guide, but it is worth trying if the command works.

I would not advise setting this timeout to 5 mins, as this will effect failover time when a node dies as mentioned earlier - you should instead investigate why all networks can go down together as they should be independent.

Regarding splitbrain, do you mean VMDg depends on IP (IP comes up first) as oppose to IP depends on VMDg - only the former we help to protect against split brain, although if network is down, then online of IP may succeed, even though IP is up on the other node as the IP can't be seen on the network.

Mike

VOX

Network connectivity loss solution