Solved: Split Brain in GCO

vipulbhadra · ‎01-29-2012

I was just testing this in my VMWare workstation.

I configured two nodes as in my primary DC servers, providing HA for SQL2005 on Windows 2008 R2 (x64).

I also configured a single node cluster as in my DR site. Both of the cluster are connected using Global Cluster Option. Data is being replicated using Veritas Volume Replicator.

All the above have win 2008 R2 Std edition as the OS. Have configured HA-DR using Storage Foundation - High Availability & Disaster Recovery 6.0.

I'm facing issue with the following scenario :

1)I'm able to switchover the Service Group across the site. And also come back to primary site.

2)I keep the Service Group at primary nodes, and now forcefully shutdown the OS. At DR site VCS prompts me to take action and makes the Service Group come online at DR site.

3)Now i manually bring up the nodes at primary site (but im confused what steps to be taken to make the "previously primary" site to act as the new secondary. If i just manually start the SG , i end up making both the sites as primary site . ... How do i make the role reversal afte the primary site has come up.

Regards.

Wally_Heim · ‎01-30-2012

Hi vipulbhadra,

The answer to your question is a simple one. After doing the takeover as you describe and bringing the original primary site back online, you just have to go into VEA -> switch to Replication Network and find the RVG for the RDS in question that has the option "Resync Secondaries" and select it. The "Resync Seondaries" will switch the original Primary, which is now acting as a secondary, to a full secondary and begin the resync process to get it up to date with the changes at the DR site.

Thanks,

Wally

View solution in original post

joseph_dangelo · ‎01-29-2012

There are few concepts that you should familiarize yourself with prior to configuring HA/DR for SQL. When you configure Global Clusters for SQL Server the notion of Primary and Secondary is to denote which site has the active Data Volumes and which site is receiving the updates from whatever replication technology you are using. For example, in the case of VVR (Replicator Option for Storage Foundation), VCS will handle all of the migration controls (Primary/Secondary) using the bundled VVR Agents with SFWHA/DR. You may also choose any one of the supported Hardware based replication agents (EMC, IBM, HDS, NetApp, etc)

What's important to recognize here is that the replication agents assist in maintaining consistency between the two sites so as to prevent two concurrent primaries. In the event communication between the sites is severed but the systems themselves remain online, VCS offers the Steward Process to act as a third point of arbitration. This will prevent Global Split Brain.

If you are simply trying to demo the functionality of the Global Cluster Option using VMware workstation, you can configure VVR to do this, however you can create Global Service groups that do not use replication.

When you configure Global Service Groups, there a different failover options you may set. My suspicion is that you have it set to automatic rather than Manual. This will forcibly attempt to bring Global Service groups online when the primary site fails. As a best practice, this flag should be set to manual.

Hope this helps,

Joe D

mikebounds · ‎01-29-2012

When you perform step 3, the old primary should automatically become an acting secondary. It is only an "acting" second and some views (CLI or GUI) show this, but others show Primary-Primary configuration. In this state you need to do a fast back resync (assuming you have DCMs configured).

You can do this at least 3 ways:

Run "vxrds -g dgname fbsync rvgname" I think you run from new primary
Right click on RVGPrimary resource in VCS and select actions and run fbsync action
Right click on RDS or RVG on VEA and choose fastback resync.

Note in order to have any of these options you must NOT have selected reason "disaster" when VCS prompts you in step 2. By selecting "disaster" you are telling GCO that the site is gone for good and therefore there is not point tracking changes as old primary will never come back and will need restoring from scratch when a new server is built. I think this is a bit misleading and I have selected disaster a few times when testing before I realised this was disabling the fbsync option. So I think you are suppose to select something like "Network outage", but if you did select "disaster" you will need to resync from scratch.

Mike

Wally_Heim · ‎01-30-2012

Hi vipulbhadra,

The answer to your question is a simple one. After doing the takeover as you describe and bringing the original primary site back online, you just have to go into VEA -> switch to Replication Network and find the RVG for the RDS in question that has the option "Resync Secondaries" and select it. The "Resync Seondaries" will switch the original Primary, which is now acting as a secondary, to a full secondary and begin the resync process to get it up to date with the changes at the DR site.

Thanks,

Wally

VOX

Split Brain in GCO