cancel
Showing results for 
Search instead for 
Did you mean: 

SFW HA 5.1 Cluster doesn't work when the LLT heartbeats disconnected

khlow
Level 2
Hi All,
 
For my case, I have SFW HA 5.1 installed and it has been almost configured successful, now we are reached to stage of perform UAT. One of our UAT test was to simulate the faulty of LLT heartbeat interface and public interface, during the simulation of disconnecting LLT heartbeat or public interface, the cluster services was initail service group failover to another node successful, but during the service group bringing online on another node, the resources of VMDg and MountV both that were unable to bring online successfully with some error said "Agent is calling clean for resource because the resource is not up even after online completed." . Then the partial online service group went hung and a while later, the status of service group has changed to faulted on passive node.
 
I suspected that this was due to the VMDg has been locked out by the active node when the cable is being disconnected and therefore, the service group is online on passive node unsuccessfully due to the agent is unable to call the clean state of resource to VMDg and MountV that is from original node. Please advise. Appreaciate it...
9 REPLIES 9

M__Braun
Level 5
Employee

Hi,

 

> the resources of VMDg and MountV both that were unable to bring online successfully with some error 

 

Could you please explain why VMDg and MountV resources fail to online?

 

The MountV resources depend on the VMDg resources. A simplified resource dependecy tree would look this way:

 

Application

|

MountV

|

VMDg 

 

I hope this helps.

 

Regards

 

Manuel

khlow
Level 2

Hi Manuel,

 

First of all, thanks for reply...

 

I am still finding out why was VMDg and MountV resources fail to online on passive node when TTL heartbeat connection has dropped (simulate a NIC faulty). The only error that I have got which said that Agent is calling clean for resource because the resource is not up even after online completed, when it happen the service group was change state to faulted.

I have tested the cluster failover by switch over service group between node manually and shutdown the active node to simulate the server down on active node. By doing so, the cluster has failover successful and bring up the service group on another node without any issue.


I understand the resource dependancy, MountV resource failed was properly due to dependency of VMDg resource did not come online in time manner... my question is when TTL heartbeat being disconnected, the active cluster is detecting the error and failover initial to bring online the service group on passive node, it should work that way, but in my case it did not work as expected...

 

What are the workaround that I could troubleshoot the problem to have service group failover successfully on another node when TTL is detected faulty? Please advise.

 

many thanks.

M__Braun
Level 5
Employee

Hi,

 

I already asked some colleagues for help which are more VCS Windows savvy.

 

Sorry for not being a help.

 

Regards

 

Manuel

 

pbelk
Not applicable
Employee

Hello,

 

I apologize for the delay on responding.  Could you please set the log level on the VMDg and MountV resources to the DBG_AGDEBUG level?  This will enable additional log information into the %VCS_HOME%/log log files for the two resources.

 

Has a support case been opened on this issue?  If not, that maybe the best method to ensure the review of the configuration, attributes, and service group definition for your environment. 

 

Please advise on testing with additional log level and whether support is going to be pursued.

 

Regards,

 

 

Paul

 

Joost
Level 3

Hi Khlow,

 

Try setting these attributes:

 

for VMDg attributes:

ForceDeport:  true

ForceImport:  true

 

for MountV attributes:

force Unmount:  ALL

 

Joost

TomerG
Level 6
Partner Employee Accredited Certified

Note that by disconnecting all network interfaces simultaneously you are simulating "Split-brain", or network partition, which is a condition that you should never allow your cluster to go into. VCS uses the term "concurrency violation" for any application that has gotten into this state of running simultaneously on more than one machine.

All clustering products hope to never get into this situation because it means that you might have possible data corruption, if the application is allowed to run on both sides of the cluster simultaneously.

Some clustering products use Quorum disks to help prevent this, however usually Quorum disks themselves become points of failure in the cluster, and usually cause more downtime then they actually prevent.

Some clustering products use disk heartbeats, but again, depending on how this is implemented it could also cause more downtown due to disk failure than it actually resolves.

Also both Quorum and disk heartbeats don't prevent certain types of split-brain, like those caused by intermitten system hangs (or on Solaris: "Stop-A" followed  by "go" a minute later).

The only feature I've seen that absolutely prevents split-brain in all these conditions is I/O Fencing, which is a VCS feature, but only on UNIX platforms, and only if you have modern disk arrays that support SCSI-3 Persistent Reservations.

I recommend having as many heartbeats as possible, specifically having low priority LLT (VCS) heartbeats on the public interface. The chance that all your private and public networks fail simultaneously (but the fiber cables and storage remain) is next to zero.

In my opinion: public low-priority hearbeats are the simplest thing you could do, involving the least amount of work and risk, that cause the greatest benefit in preventing split-brain.

Mohamed_Magdy
Level 2
Partner Accredited Certified
Hi ALL,

Me also is running VCS + Netbackup and i want to know the steps for doing the UAT for this setup and the expected behaviour for each test.

Thanks in advance.

regards

pachai
Level 2
Sorry to resurrect this old thread, but it points directly at a current issue for me.
We run a couple of older VCS clusters, but are having trouble getting LLT working on a new cluster (vcs 5.1).
VCS seems to be more stringent than in the past.

>>>I recommend having as many heartbeats as possible, specifically having low priority LLT (VCS) heartbeats on the public interface.
>>>The chance that all your private and public networks fail simultaneously (but the fiber cables and storage remain) is next to zero

We are presently awaiting the beaurocracy to approve a second truly Private network,
but it would be great to get started on testing.
This thread suggests that we could still use the "public" network - and in fact that we should.

e1000g0 is on a public network and has a 10.x.y.z address
e1000g1 is on the same vlan as g0, has a 192.168 IP - can rsh back and forth
e1000g2 is connected via crossover cable and has an IP in 192.168 - can rsh back and forth

the link records in /etc/lltab are identical on both hosts
link e1000g2 /dev/e1000g:2 - ether - -
link-lowpri e1000g1 /dev/e1000g:1 - udp - - - - -

I would appreciate any suggestions how to go to the next step (automated failover) -
which won't work without llt etc.     We would like to have as many heartbeats as possible.
But in a T5120, there are only 3 slots - 2 for HBAs and 1 for SCSI, so...only 4 ethernet ports.
(parenthetical note, it's a really good idea to use 2U machines for the extra slots :)

Thanks
Seth
PS does llt log its success/failure?

pachai
Level 2

>>>link-lowpri e1000g1 /dev/e1000g:1 - udp - - - - -

I found docs in the VCS Installlation Manual Appendix F and updated this entry...

link-lowpri e1000g1 /dev/udp - udp 50000 - 192.168.155.55 192.168.155.255

Has anyone gotten this working as one of the 2 heartbeats?
We are working with Symantec, but the tech cannot answer this question.
Appendix F says it can be done.

Thanks