cancel
Showing results for 
Search instead for 
Did you mean: 

Failover failed

bpdown
Level 4

VCS v5.1 (Windows)

Hello, 

Using VCS for NBU cluster..... We have mirrored catalog SAN disk at 2 sites in campus cluster.

For testing, SAN mirroring link was cut first, then 10 minutes later the IP network cut between the 2 sites. The failover started after the IP network was cut, but failed because the 2nd node 'failed to import cluster dynamic disk group' (SFW logs).

I have heard this may have occurred because the active nodes disk had the most up-to-date catalog information, and because the mirroring was cut the secondary nodes catalog disk was 'out of date' so not brought up in case of data loss. Basically we ended up with one node being offline, and the other one faulted.

I hope that made sense? Has anyone heard of this happening before?

Is there a way in SFH to force an 'out of date' catalog disk online after a failover has faulted?

Any help would appreciated.

2 REPLIES 2

Marianne
Level 6
Partner    VIP    Accredited Certified

Your problem is not with VCS. The problem is at Volume Manager level and with the way that " SAN mirroring link was cut".

Please tell us more about this step:
1. What exactly was done?
2. What was the purpose of this step?
3. What was the effect on both nodes?
Please save Event Viewer Application and System Logs as text files on both nodes, upload here as File Attachments. Let us know date and time that link was cut.
4. What visibility did 2nd node have of SAN disks at this point?

mikebounds
Level 6
Partner Accredited

For this test to work you need to set the ForceImport attribute on the VmDg resource to 1 (it is 0 by default) - see extract from Bundled Agent guide:

 

Defines whether the agent forcibly imports
the disk group when exactly half the disks
are available. The value 1 indicates the
agent imports the configured disk group
when half the disks are available. The
value 0 indicates it does not. Default is 0.
This means that the disk group will be
imported only when SFW acquires control
over majority of the disks.
Note: Set this attribute to 1 only after
verifying the integrity of your data. If due
caution is not exercised before setting this
attribute to 1, you risk a split-brain
condition, leading to potential data loss.
Basically if only half the disks can be seen and a node fails then VCS doesn't know if there is a network partition of IP and SAN so that it can't see other node and half the disks and if this were the case, the other node COULD have the disks imported which means if the node took control it would cause split-brain.  So if you set ForceImport to 1, this allows VCS to force import when only half the disks are there, but this could cause split-brain, if the situation was that you had a network partition (IP and SAN) as oppose to a SAN failure.

 

Mike