Solved: HAD errors

Zahid_Haseeb · ‎08-12-2010

OS = win2003
sfha = 5.0 with rp1a

I am facing HAD errors and information in Event viewer. i attached the snaps for reference

Gaurav_S · ‎08-12-2010

Hi Zahid,

These messages are relating to the cluster receiving new membership.... so HAD process on some node died & then restarted.... the messages are indicating that node DBxxx -SEC is joining the cluster....

However, 2 messages which are concerning are....
a) Image 213723 - There is a error in registry value ... Can you check the registry value described in that image for "SingleNode" .... because if that is set you "yes or 1" .. that means it is defined as single node cluster !

Though I am not a windows expert.. however that makes little sense to me....

b) Last image -213712 - THat clearly indicates you have insufficient memory on the server... so diagnose what is wrong with server & is there enough real /virtual memory available on the server ...

Gaurav

View solution in original post

Gaurav_S · ‎08-12-2010

Hi Zahid,

These messages are relating to the cluster receiving new membership.... so HAD process on some node died & then restarted.... the messages are indicating that node DBxxx -SEC is joining the cluster....

However, 2 messages which are concerning are....
a) Image 213723 - There is a error in registry value ... Can you check the registry value described in that image for "SingleNode" .... because if that is set you "yes or 1" .. that means it is defined as single node cluster !

Though I am not a windows expert.. however that makes little sense to me....

b) Last image -213712 - THat clearly indicates you have insufficient memory on the server... so diagnose what is wrong with server & is there enough real /virtual memory available on the server ...

Gaurav

Gaurav_S · ‎08-12-2010

Also, isn't it a duplicate post for

https://www-secure.symantec.com/connect/forums/vcs-errors-event-viewer-0

???

Gaurav

Gaurav_S · ‎08-12-2010

Zahid,

When any node joins, it is starting HAD right ? so it is joining the cluster (as cluster is formed by another node) & hence the message is there saying "receiving new cluster membership" .. so it means it is joining the cluster with other node which is already up....

it doesn't mean it needs to kill or to restart HAD

secondly, you did not changed the registry value... so what is current value ? can u paste here ?

Also, As Marianne also asked you to check in previous post, did you check performance of server why there is high memory utilization ?

Gaurav

Zahid_Haseeb · ‎08-12-2010

Thanks for your kind reply

1.) These messages are relating to the cluster receiving new membership.... so HAD process on some node died & then restarted

1st i have a 2nodes cluster and i did not join any node. and as you said when any new member try to join cluster thn HAD have to restart ?

2.) because if that is set you "yes or 1" .. that means it is defined as single node cluster !

i did not changed any rgistery entry.

Zahid_Haseeb · ‎08-12-2010

Gaurav actually i am looking that the resources of cluster got down as you can see the snap below. but before that errors i am not able to see any error which let me know why these resources got down ( may be because of insufficient memory ). even after the resources got down cluster is UP on the same node. why it did not performed failover

Gaurav_S · ‎08-12-2010

As said before, if your server is running with lack of resources (lack of memory).... that might not give you any indication in logs (since no enough memory) ... thats why it looks like all your agents died as well (again no memory for agents to work)...

so root cause is, find out if server runs busy or was running busy during the time this happened.... you won't get any hint from these logs... if you have any other utility which captures performance stats will give u clue to move ahead..

Gaurav

Zahid_Haseeb · ‎08-12-2010

Thanks Gaurav for all your comments. I can understand the agents DOWN cause could be insufficient menory .

But that node did not perform any failOVER is that not suspecious ? the active server is still same

Gaurav_S · ‎08-13-2010

If the agents & had doesn't have sufficient memory, how would it perform failover ? So for e.g your disk agent died.... then who will detect disks ? who will perform disk failover ? ... same way all your other agents died, so HAD never knew the status of resources ... & hence no failover.... hope it makes sense..

Gaurav

Gaurav_S · ‎08-19-2010

Zahid,

did u managed to get answer for this one ?

Gaurav

Zahid_Haseeb · ‎09-24-2010

Yes i have got the answer. Actually HAD monitors agent to see the resources and replicate all the updates to other nodes of cluster to make them replicated with the resources status. If the HAD gets stop how can he distribute the resources status to other nodes if the HAD is not able to give the status thn how the other node can come to know that there is a failover required :)

thanks all friends.

Gaurav_S · ‎09-24-2010

If the HAD hangs then GAB will timeout communicating to HAD & hence GAB will initiate a panic of the node in trouble... this will initate groups to come online on other node (evacuation of resources)

Gaurav

Zahid_Haseeb · ‎09-28-2010

Actually HAD replicates the status to other nodes, how the GAB initiate a panic of node in trouble ?

HAD is responsible for sending the updates of resources to other nodes of cluster how the GAB can do this responsibility

rregunta · ‎09-28-2010

Hello Zahid,

I an short -

This is as per the VCS basic design, where HAD communicates with its agents and other cluster's HAD through LLT channel. However, to keep the HAD in expected running state, we have HAD and GAB communication too. And when this communication doesn't happen for specified duration GAB panics the host and reboots it to prevent any problem to cluster and data integrity. Since the GAB port is at the lowest level of the cluster port stacks, the developers would have given this extra responsibility to GAB.

I hope this helps a bit.

Thanx

Rajesh

Zahid_Haseeb · ‎09-29-2010

the developers would have given this extra responsibility to GAB.

Would you please give me any reference of VCS guide which says that GAB also communicates with other cluster nodes

Gaurav_S · ‎09-29-2010

Zahid,

GAB is a kernel module while HAD is a realtime process, GAB act as a translator of messages coming from other node via LLT & updates to HAD... & yes it is ideal design to assign this responsibility to GAB since HAD already has many other jobs to do...

If you want to know more on GAB/LLT, read through below, this article also answers your question above, however not exactly in the same language, however explains it..

http://www.filibeto.org/unix/hp-ux/lib/cluster/vcs/vcs-gab-llt-basics.pdf

Gaurav

g_lee · ‎09-30-2010

Zahid,

GAB communicates with other cluster nodes, as described in Veritas Cluster Server 5.1 Administrator’s Guide (Windows) - http://sfdoccentral.symantec.com/sf/5.1/windows/pdf/VCS_Admin.pdf

Introducing Veritas Cluster Server -> Logical components of VCS -> About cluster control, communications, and membership (p27)
--------------------
About Group Membership Services/Atomic Broadcast (GAB)
The Group Membership Services/Atomic Broadcast protocol (GAB) is responsible for cluster membership and cluster communications.
• Cluster Membership
GAB maintains cluster membership by receiving input on the status of the heartbeat from each node by LLT. When a system no longer receives heartbeats from a peer, it marks the peer as DOWN and excludes the peer from the cluster. In VCS, memberships are sets of systems participating in the cluster. VCS has the following types of membership:
> A regular membership includes systems that communicate with each other across one or more network channels.
> A jeopardy membership includes systems that have only one private communication links.
> Daemon Down Node Alive (DDNA) is a condition in which the VCS high availability daemon (HAD) on a node fails, but the node is running. In a DDNA condition, VCS does not have information about the state of service groups on the node. So, VCS places all service groups that were online on the affected node in the autodisabled state. The service groups that were online on the node cannot fail over. Manual intervention is required to enable failover of autodisabled service groups. The administrator must release the resources running on the affected node, clear resource faults, and bring the service groups online on another node.
• Cluster Communications
GAB's second function is reliable cluster communications. GAB provides guaranteed delivery of point-to-point and broadcast messages to all nodes. The VCS engine uses a private IOCTL (provided by GAB) to tell GAB that it is alive.
--------------------

ie: "receiving input on the status of the heartbeat from each node by LLT" necessarily implies that GAB must communicate with other cluster nodes (ie: it's receiving info from other nodes over LLT)

also, in response to your earlier question "how the GAB initiate a panic of node in trouble"

see: VCS performance considerations -> How cluster operations affect performance -> When a system panics (pp667-668)
--------------------
When a system panics
There are several instances in which GAB will intentionally panic a system, including if it detects an internal protocol error or discovers an LLT node-ID conflict. Three other instances are described below.

Client process failure
If a client process fails to heartbeat to GAB, the process is killed. If the process hangs in the kernel and cannot be killed, GAB halts the system. If the -k option is used in the gabconfig command, GAB tries to kill the client process until successful, which may have an impact on the entire cluster. If the -b option is used in gabconfig, GAB does not try to kill the client process. Instead, it panics the system when the client process fails to heartbeat. This option cannot be turned off once set.
HAD heartbeats with GAB at regular intervals. The heartbeat timeout is specified by HAD when it registers with GAB; the default is 15 seconds. If HAD gets stuck within the kernel and cannot heartbeat with GAB within the specified timeout, GAB tries to kill HAD by sending a SIGABRT signal. If it does not succeed, GAB sends a SIGKILL and closes the port. By default, GAB tries to kill HAD five times before closing the port. The number of times GAB tries to kill HAD is a kernel tunable parameter, gab_kill_ntries, and is configurable. The minimum value for this tunable is 3 and the maximum is 10.
This is an indication to other nodes that HAD on this node has been killed. Should HAD recover from its stuck state, it first processes pending signals. Here it will receive the SIGKILL first and get killed.
After sending a SIGKILL, GAB waits for a specific amount of time for HAD to get killed. If HAD survives beyond this time limit, GAB panics the system. This time limit is a kernel tunable parameter, gab_isolate_time and is configurable. The minimum value for this timer is 16 seconds and maximum is 4 minutes.
--------------------

Zahid_Haseeb · ‎10-01-2010

What i have come to know correct me if i am wrong....

""How GAB works in two scenerio""

##When resources fails##

HAD monitor agents to take the status of resources and give it to GAB for transfer via LLT, and the status finally replicates to other node's of cluster HAD.

##When HAD fails##

If the HAD kill by GAB or node is in DDNA membership, so the HAD does not have the status of resources and HAD does not able to deliver the resources update to GAB so that GAB cant replicate the status of resources to other nodes but the GAB replicates the current status (DDNA) of that node(via using LLT) to other nodes of cluster that the specific node is in DDNA membership.

VOX

HAD errors