Failover failed

bpdown · ‎11-19-2012

VCS v5.1 (Windows)

Hello,

Using VCS for NBU cluster..... We have mirrored catalog SAN disk at 2 sites in campus cluster.

For testing, SAN mirroring link was cut first, then 10 minutes later the IP network cut between the 2 sites. The failover started after the IP network was cut, but failed because the 2nd node 'failed to import cluster dynamic disk group' (SFW logs).

I have heard this may have occurred because the active nodes disk had the most up-to-date catalog information, and because the mirroring was cut the secondary nodes catalog disk was 'out of date' so not brought up in case of data loss. Basically we ended up with one node being offline, and the other one faulted.

I hope that made sense? Has anyone heard of this happening before?

Is there a way in SFH to force an 'out of date' catalog disk online after a failover has faulted?

Any help would appreciated.

Marianne · ‎11-19-2012

Your problem is not with VCS. The problem is at Volume Manager level and with the way that " SAN mirroring link was cut".

Please tell us more about this step:
1. What exactly was done?
2. What was the purpose of this step?
3. What was the effect on both nodes?
Please save Event Viewer Application and System Logs as text files on both nodes, upload here as File Attachments. Let us know date and time that link was cut.
4. What visibility did 2nd node have of SAN disks at this point?

Handy NetBackup Links

mikebounds · ‎11-20-2012

For this test to work you need to set the ForceImport attribute on the VmDg resource to 1 (it is 0 by default) - see extract from Bundled Agent guide:

Defines whether the agent forcibly imports

the disk group when exactly half the disks

are available. The value 1 indicates the

agent imports the configured disk group

when half the disks are available. The

value 0 indicates it does not. Default is 0.

This means that the disk group will be

imported only when SFW acquires control

over majority of the disks.

Note: Set this attribute to 1 only after

verifying the integrity of your data. If due

caution is not exercised before setting this

attribute to 1, you risk a split-brain

condition, leading to potential data loss.

Basically if only half the disks can be seen and a node fails then VCS doesn't know if there is a network partition of IP and SAN so that it can't see other node and half the disks and if this were the case, the other node COULD have the disks imported which means if the node took control it would cause split-brain. So if you set ForceImport to 1, this allows VCS to force import when only half the disks are there, but this could cause split-brain, if the situation was that you had a network partition (IP and SAN) as oppose to a SAN failure.

Mike

VOX

Failover failed