Forum Discussion

ousor's avatar
ousor
Level 4
11 years ago

split brain

Hi, If I have 2 nodes vcs with fencing devices,and I experience "split brain",then the fenced node will be taken out of vcs.This node will panic and then reboot,but will not join to vcs.So in this case this node is up and running but not vcs started.right?Also llt,gab are loaded.right? If there is no fencing devices on the 2 nodes in vcs,and one node is fenced,then what is the best way to perform things on this node,in order to repair?I boot this node in single user mode?
  • Hi,

    With a 2 node with fencing enabled, if split brain occurs, as explained above, node will be waiting on gab to seed. I would recommend following actions

    1. Stop gab (/etc/init.d/gab stop  or use SMF in solaris 10/11) on fenced node

    2. Stop LLT (/etc/init.d/llt stop  or use SMF in solaris 10/11) on fenced node

    OR as an alternative to first two steps you can bring server to single user mode

    Next

    3. Fix the heartbeat & test using dlpiping or lltping

    4. Start the LLT & GAB on fenced node  or bring server to multiuser mode.

    With heartbeats working LLT & GAB should start normally.

    G

     

  • Hi,

    In a two node setup, If one node gets fenced out & paniced by vxfen, it will reboot. Now once the node is coming up, it will load LLT & gab but gab will keep waiting to seed as /etc/gabconfig would have defined "gabconfig -c -n2" i.e to seed once 2 nodes are available. So it will keep waiting for other node to seed. Because gab is not successfully formed membership, vxfen & VCS will not start.

    If there is no fencing how will one node get fenced ? there is no protection mechanism ... if both the private interconnect breaks, both the nodes will think other node has been dead (this is called split brain) & both nodes will try to online service groups & may result in potential corruption of data.

     

    G

  • Hi,

    I only wish to verify how  should repair things when i have split beain with 2 nodes vcs.Suppose i will shutdown a node,then this node i should boot in single mode to perform repair?tnx a million.

  • Hi,

    With a 2 node with fencing enabled, if split brain occurs, as explained above, node will be waiting on gab to seed. I would recommend following actions

    1. Stop gab (/etc/init.d/gab stop  or use SMF in solaris 10/11) on fenced node

    2. Stop LLT (/etc/init.d/llt stop  or use SMF in solaris 10/11) on fenced node

    OR as an alternative to first two steps you can bring server to single user mode

    Next

    3. Fix the heartbeat & test using dlpiping or lltping

    4. Start the LLT & GAB on fenced node  or bring server to multiuser mode.

    With heartbeats working LLT & GAB should start normally.

    G

     

  • Well, with fencing in place, you should never experience split brain. 
    Fencing is implemented to prevent split brain.

    Split brain is when both nodes believe they have exclusive access to the storage and both nodes mount and write to storage.
    This leads to filesystem corruption in almost all cases.
    (I have seen this happening where customers did not have fencing in place and heartbeats on same infrastructure. Believe me... it is not pretty... lots of unhappy users....)
    The first step to fix this is to recreate the filesystem and restore from backup.
    Cluster membership and vcs startup will be a consideration when filesystem is repaired and data restored.

    Please go through this section in VCS Admin Guide: 

    About communications, membership, and data protection in the cluster 

    This chapter includes the following topics:
    ■ About cluster communications
    ■ About cluster membership
    ■ About membership arbitration
    About data protection
    ■ About I/O fencing configuration files
    Examples of VCS operation with I/O fencing
    ■ About cluster membership and data protection without I/O fencing
    Examples of VCS operation without I/O fencing
    ■ Summary of best practices for cluster communications