Forum Discussion

Zahid_Haseeb's avatar
Zahid_Haseeb
Moderator
13 years ago
Solved

Some questions Regarding Heartbeat

Kindly correct if I am wrong

 

1.) What happen when both Heartbeat link fails ?

Split Brain situation.

 

2.) What we do to getrid Split Brain situation ?

Add a Low Priority Heartbeat but the Service Group will be AutoDisable and Clustered Application keep running.

 

3.) When both Heartbeat crashes except Low Priority ?

The Service Group will remain UP and running on the same Node but cant failover

 

4.) When one Heartbeat fails and another Heartbeat is running also the Low Priority is running, Can we Failover ?

The Application able to Failover and Service Group is also not AutoDisable 

 

5.) If you have two Heartbeat and one Low Priority Heartbeat means total three Heartbeats. if anyone fails and two Heartbeats are present, can Service Group Failover?

Yes

    1. If all 2 heartbeats fails at the same time you will get split-brain
       
    2. To help prevent split-brain you can :
      a. Add more than 2 heartbeats (high-pri or low-pri) - these should be completely independent - different NIC cards going to different switches
      b. Use I/O and/or Server (CPS) Fencing
      c. Make diskgroup dependent on Application IP in service group, so IP comes up first.  This means, even if you don't configure virtual App IP in VCS as a low-pri HB, if both heartbeats fails and App network is OK, then when SG tries to online it will fail when it tries to online IP and hence will prevent split-brain (prevents importing diskgroup and corrupting storage)  This is also protects against human errror if you force stop cluster and then start VCS up on inactive node
       
    3. When there is only one heartbeat left, if Node fails, then service groups will not fail over, but if a resource faults and node stays up, then service group can still fail over
       
    4. (and 5) If you have ANY 2 heartbeats remaining (including low-pri), then node failure results in service group failure as usual.  i.e failure of heartbeats when there are still 2 left has no effect on failover of service groups.

    Mike

    1. If all 2 heartbeats fails at the same time you will get split-brain
       
    2. To help prevent split-brain you can :
      a. Add more than 2 heartbeats (high-pri or low-pri) - these should be completely independent - different NIC cards going to different switches
      b. Use I/O and/or Server (CPS) Fencing
      c. Make diskgroup dependent on Application IP in service group, so IP comes up first.  This means, even if you don't configure virtual App IP in VCS as a low-pri HB, if both heartbeats fails and App network is OK, then when SG tries to online it will fail when it tries to online IP and hence will prevent split-brain (prevents importing diskgroup and corrupting storage)  This is also protects against human errror if you force stop cluster and then start VCS up on inactive node
       
    3. When there is only one heartbeat left, if Node fails, then service groups will not fail over, but if a resource faults and node stays up, then service group can still fail over
       
    4. (and 5) If you have ANY 2 heartbeats remaining (including low-pri), then node failure results in service group failure as usual.  i.e failure of heartbeats when there are still 2 left has no effect on failover of service groups.

    Mike

  • Thanks Mike

     

    As point # 3

    ""When there is only one heartbeat left, if Node fails, then service groups will not fail over, but if a resource faults and node stays up, then service group can still fail over""

     

    is there any logic behind that the resource fail can trigger but the Server/Node fails cant trigger Failover.

  • When there is only one heartbeat left, VCS cannot distinguish between network failure and node failure - i.e if NIC of last remaining HB fails on Node B, or if Node B fails completely, from Node A's perspective it looses the last heartbeat to Node B so doesn't know which of the 2 has occured.  This situation is called "jeopardy" and VCS will not fail service groups that were running on Node B to Node A as VCS on Node A can't be sure what has happened to Node B.  If a resource fails and nodes stay up, then VCS knows what is going on because it can still communicate on last heartbeat and therefore can fail service group over.

    Mike

  • i.e if NIC of last remaining HB fails on Node B, or if Node B fails completely, from Node A's perspective it looses the last heartbeat to Node B so doesn't know which of the 2 has occured.

     

    Why the VCS cant recognized that HB fails or either Node Fails. For example let suppose in normal circumstances if all things running fine and one node fails completely in a two node cluster . In this condition VCS nomally failover the Service Group.

    How VCS recognized at this time that both HB does not faulted and the only participating Node is Faulted. Because in both cases(System Fails completely OR both HB NIC's fails of Active Node) the partner node/idle node will not get response from Active Node. 

  • Both heartbeats must be independent and therefore the chances of 2 heartbeats failing at the same time (within 15 seconds of each other with default settings) are millions to one.  Therefore VCS interprets 2 heartbeats being lost at the same time as node failure.  Although the chances are millions to one, it can happen, although it tends to be human error, like an adminstrator pulling out both cables because they are standing at the wrong rack as the chances of 2 pieces of hardware failing within 15 seconds of each other really is rare, but this is why you can set up I/O fencing to protect against this rare scenario.

    Mike

  • Thanks Mike

    Thats the point what I am asking  "" Therefore VCS interprets 2 heartbeats being lost at the same time as node failure""

    So when the VCS will be in JEOPARDY this is programmed in VCS that after the JEOPARDY anyone thing can happens (Node Faulted or HB failed) so thats why dont do Failover.

  • One last discussion point:

     If a resource fails and nodes stay up, then VCS knows what is going on because it can still communicate on last heartbeat and therefore can fail service group over.

    Suppose      When both hi-pri HB got failed and only low-pri HB (low-pri HB is on Public NIC) is UP. Thats mean that only one HB is there means the VCS is in JEOPARDY. 

     

    Question :     What happens when the resource means public NIC faulted. How VCS recognized the NIC resource faulted and SG has to be failed over.

  • Supposing NIC fails on node B, then from:

    Node A's perspective:


    Last remaining heartbeat is lost when in Jeopardy so it does nothing

     

    Node B's perspective

    Last remaining heartbeat lost when in Jeopardy (no action to take) AND public NIC faults so VCS will fault service group, but cannot fail it over anywhere as this node has no communication left with other nodes.

     

    Note, do not get too hung up on these scenarios.  The chances of looses one network doesn't happen often and when it does you will usually have fixed network, before another network fails and if you haven't, then chances of then looses a third network, before you have fixed one of the first 2 is billions to 1, assuming that all your networks are truely independent.

    Mike