08-12-2010 09:35 PM
Solved! Go to Solution.
08-12-2010 09:49 PM
08-12-2010 09:49 PM
08-12-2010 09:50 PM
08-12-2010 10:04 PM
08-12-2010 10:04 PM
08-12-2010 10:27 PM
08-12-2010 10:30 PM
08-12-2010 10:33 PM
08-13-2010 12:23 AM
08-19-2010 01:03 PM
09-24-2010 05:41 AM
Yes i have got the answer. Actually HAD monitors agent to see the resources and replicate all the updates to other nodes of cluster to make them replicated with the resources status. If the HAD gets stop how can he distribute the resources status to other nodes if the HAD is not able to give the status thn how the other node can come to know that there is a failover required :)
thanks all friends.
09-24-2010 06:50 AM
If the HAD hangs then GAB will timeout communicating to HAD & hence GAB will initiate a panic of the node in trouble... this will initate groups to come online on other node (evacuation of resources)
Gaurav
09-28-2010 10:30 PM
Actually HAD replicates the status to other nodes, how the GAB initiate a panic of node in trouble ?
HAD is responsible for sending the updates of resources to other nodes of cluster how the GAB can do this responsibility
09-28-2010 11:58 PM
Hello Zahid,
I an short -
This is as per the VCS basic design, where HAD communicates with its agents and other cluster's HAD through LLT channel. However, to keep the HAD in expected running state, we have HAD and GAB communication too. And when this communication doesn't happen for specified duration GAB panics the host and reboots it to prevent any problem to cluster and data integrity. Since the GAB port is at the lowest level of the cluster port stacks, the developers would have given this extra responsibility to GAB.
I hope this helps a bit.
Thanx
Rajesh
09-29-2010 10:40 PM
the developers would have given this extra responsibility to GAB.
Would you please give me any reference of VCS guide which says that GAB also communicates with other cluster nodes
09-29-2010 11:02 PM
Zahid,
GAB is a kernel module while HAD is a realtime process, GAB act as a translator of messages coming from other node via LLT & updates to HAD... & yes it is ideal design to assign this responsibility to GAB since HAD already has many other jobs to do...
If you want to know more on GAB/LLT, read through below, this article also answers your question above, however not exactly in the same language, however explains it..
http://www.filibeto.org/unix/hp-ux/lib/cluster/vcs/vcs-gab-llt-basics.pdf
Gaurav
09-30-2010 04:30 AM
Zahid,
GAB communicates with other cluster nodes, as described in Veritas Cluster Server 5.1 Administrator’s Guide (Windows) - http://sfdoccentral.symantec.com/sf/5.1/windows/pdf/VCS_Admin.pdf
Introducing Veritas Cluster Server -> Logical components of VCS -> About cluster control, communications, and membership (p27)
--------------------
About Group Membership Services/Atomic Broadcast (GAB)
The Group Membership Services/Atomic Broadcast protocol (GAB) is responsible for cluster membership and cluster communications.
• Cluster Membership
GAB maintains cluster membership by receiving input on the status of the heartbeat from each node by LLT. When a system no longer receives heartbeats from a peer, it marks the peer as DOWN and excludes the peer from the cluster. In VCS, memberships are sets of systems participating in the cluster. VCS has the following types of membership:
> A regular membership includes systems that communicate with each other across one or more network channels.
> A jeopardy membership includes systems that have only one private communication links.
> Daemon Down Node Alive (DDNA) is a condition in which the VCS high availability daemon (HAD) on a node fails, but the node is running. In a DDNA condition, VCS does not have information about the state of service groups on the node. So, VCS places all service groups that were online on the affected node in the autodisabled state. The service groups that were online on the node cannot fail over. Manual intervention is required to enable failover of autodisabled service groups. The administrator must release the resources running on the affected node, clear resource faults, and bring the service groups online on another node.
• Cluster Communications
GAB's second function is reliable cluster communications. GAB provides guaranteed delivery of point-to-point and broadcast messages to all nodes. The VCS engine uses a private IOCTL (provided by GAB) to tell GAB that it is alive.
--------------------
ie: "receiving input on the status of the heartbeat from each node by LLT" necessarily implies that GAB must communicate with other cluster nodes (ie: it's receiving info from other nodes over LLT)
also, in response to your earlier question "how the GAB initiate a panic of node in trouble"
see: VCS performance considerations -> How cluster operations affect performance -> When a system panics (pp667-668)
--------------------
When a system panics
There are several instances in which GAB will intentionally panic a system, including if it detects an internal protocol error or discovers an LLT node-ID conflict. Three other instances are described below.
Client process failure
If a client process fails to heartbeat to GAB, the process is killed. If the process hangs in the kernel and cannot be killed, GAB halts the system. If the -k option is used in the gabconfig command, GAB tries to kill the client process until successful, which may have an impact on the entire cluster. If the -b option is used in gabconfig, GAB does not try to kill the client process. Instead, it panics the system when the client process fails to heartbeat. This option cannot be turned off once set.
HAD heartbeats with GAB at regular intervals. The heartbeat timeout is specified by HAD when it registers with GAB; the default is 15 seconds. If HAD gets stuck within the kernel and cannot heartbeat with GAB within the specified timeout, GAB tries to kill HAD by sending a SIGABRT signal. If it does not succeed, GAB sends a SIGKILL and closes the port. By default, GAB tries to kill HAD five times before closing the port. The number of times GAB tries to kill HAD is a kernel tunable parameter, gab_kill_ntries, and is configurable. The minimum value for this tunable is 3 and the maximum is 10.
This is an indication to other nodes that HAD on this node has been killed. Should HAD recover from its stuck state, it first processes pending signals. Here it will receive the SIGKILL first and get killed.
After sending a SIGKILL, GAB waits for a specific amount of time for HAD to get killed. If HAD survives beyond this time limit, GAB panics the system. This time limit is a kernel tunable parameter, gab_isolate_time and is configurable. The minimum value for this timer is 16 seconds and maximum is 4 minutes.
--------------------
10-01-2010 05:40 AM
What i have come to know correct me if i am wrong....
""How GAB works in two scenerio""
##When resources fails##
HAD monitor agents to take the status of resources and give it to GAB for transfer via LLT, and the status finally replicates to other node's of cluster HAD.
##When HAD fails##
If the HAD kill by GAB or node is in DDNA membership, so the HAD does not have the status of resources and HAD does not able to deliver the resources update to GAB so that GAB cant replicate the status of resources to other nodes but the GAB replicates the current status (DDNA) of that node(via using LLT) to other nodes of cluster that the specific node is in DDNA membership.