cancel
Showing results for 
Search instead for 
Did you mean: 

Veritas High Availability 5.0 Data Loss & Split-Brain scenario

bob1_skew
Level 2
 

Hi,

 

I am relatively new to Veritas Clustering and have experience of only MSCS clusters and AutoStart and maybe this forum can answer a few questions I have.

 

I have been given the the task of managing a set of clusters that have the following high level configuration and I have a few questions maybe this forum could answer.

 

  • 4 VCS Clusters hosted on IBM Blades in two data centers - 1 node of each cluster in each data center
  • Veritas HA 5.0 (VCS and Storage Foundations); Windows Server 2003 Standard
  • Private and Public VLANs sahre the same physical network
  • IP Heartbeats go over two dedicated fibre links between two switches on the chassis
  • SAN based storage. Each host zoned and masked to LUNs on each site - each volume has a mirror from each site - access to the remote array via Inter Switch Links. Brocade Switches with ISL trunking; IBM arrays;

 

My questions are:

 

1. Is my understanding of an ISL link failure correct with VCS/VXIO?

For example, a failure situation occurs where all ISL communications are lost between sites while the application is active in the Production site. In this case, all resource groups will continue to function normal as the storage is alive at the production site – no data being replicated to the DR site at this point as all ISL links are down. 

An hour after this ISL link failure occurs, the production node (or indeed site) experiences a failure which causes its groups to fail over to the DR node. At this point, what does the DR node do? Will it bring this group online and import the disks from the DR array and thus be missing the last hour’s transactions since no ISLs where available to update the remote array?

 

What setting within VCS/VXIO controls automated site failover and is there anything to handle this scenario as this effectively results in data loss.

 

2. Split-Brain. If both fibre links used for Heartbeats fail between the sites will the cluster split-brain or does that require a complete communications failure?

 

1 ACCEPTED SOLUTION

Accepted Solutions

H__Shannon
Level 3

Bob,

 

You can post your question about ISL's to the either the Storage Foundation Family or Volume Manager forums of the Symantec Technology Network.

 

I would phrase your question like this:

In a VCS stretched cluster using VxVM to mirror data between the sites, will VxVM provide a message, warning or error when the ISL's between the sites are broken?  How long should this notification take?

Regards,

Hugh

View solution in original post

5 REPLIES 5

H__Shannon
Level 3

Hey Bob,

 

In your description you mentioned both mirrors and replication between the sites.  Which is it?

 

For question #1, your understanding is correct.  The VCS service groups will remain up and running.  However, you should be getting warnings that the link is broken.  The question is where those warnings will come from - most likely not from VCS.  That will be determined based on your answer of whether or not mirroring is used or replication is used to move data from the first data center to the second.  Sync mirroring or sync replication should be the default in this configuration so that no data is lost on an automatic failover, it is unlikely that the environment will be unaware of the broken ISL for much longer than a few seconds.

 

For question #2, a split-brain occurs in VCS when all communications links that are configured for VCS are broken.  So if you have 2 private heartbeat networks and a lowpri link running over the public network (the recommended configuration), you must lose all 3 links before a split-brain occurs.  In this configuration if you lose just the heartbeat links, the lowpri link gets promoted automatically and is used as the heartbeat link - no split-brain.

 

Regards,

Hugh

bob1_skew
Level 2

Hi,

 

Many thanks for your reply. As I am new to Veritas please excuse my ignorance of the product set.

 

The volumes have been presented as Dynamic, Mirrored Concatenated and in VCS the VMDg force import setting is set to true. Does this mean that if a condition manifested where all storage ISLs failed (and where not repaired) say an hour before a complete (active) site failure that the automatic failover and forced import of the disk group at the DR site would come online with data that is 1 hour old?

 

I have checked the HB configuration and along with the two dedicated HB fibre links there is a lowpri configuration on the public VLAN. If a situation occured where all Fibre and IP links fail - does this still result in a split brain cluster? I know Microsoft have an MNS design that if you have a third site you can use a file share witness - does VCS have something similar?

 

Thanks

 

 

H__Shannon
Level 3

Bob,

 

I think that you need to post a message (you'll have to get yourself added to the distribution list if you want to see the responses directly, or ask people to reply to you directly) to the following:

 

DL-SYMC-SymIQ-StorageFoundation@symantec.com

 

asking about how VxVM will respond in a situation where you are mirroring data between the sites, and you lose the ISL's carrying all the mirroring traffic.  I'm pretty sure that VxVM ought to be letting you know that the mirror is out of sync the first time that this happens, and not wait for an hour.  VCS expects the underlying storage to be functioning.  It has no visibility to a mirroring error.  As long as the application can write to the local storage, VCS will be very happy.

 

Regarding your split-brain question - if you have 2 private networks and a lowpri network, and you lose your fibre connections carrying the private networks, and you lose your IP connectivity carrying the lowpri network, you will have a split-brain situation.  The only way to mitigate this is to use I/O Fencing, which is very loosely analogous to MS's quorum disk.  In the event of a total communications failure between the 2 nodes, they will both initiate a race to obtain the keys placed on the coordinator disks (typically there are 3 of them).  The node that obtains the majority of the keys wins, and the other node will panic itself, even if it is running the application.  This way, both nodes will not try to write to the storage at the same time, thinking that the other node was down.  The challenge with I/O Fencing in stretch clusters, is that one site will get preferential treatment in the race, because you'll place 2 coordinator drives at that site, and the second site will get 1 coordinator drive.  Each site will only have visibility to the coordinator drives at that site.

 

Regards,

Hugh

H__Shannon
Level 3

Bob,

 

If you are not an employee of Symantec, you will not be able to join the distribution group that I mentioned.  That is an internal only group.

 

Regards,

Hugh

H__Shannon
Level 3

Bob,

 

You can post your question about ISL's to the either the Storage Foundation Family or Volume Manager forums of the Symantec Technology Network.

 

I would phrase your question like this:

In a VCS stretched cluster using VxVM to mirror data between the sites, will VxVM provide a message, warning or error when the ISL's between the sites are broken?  How long should this notification take?

Regards,

Hugh