10-11-2011 03:16 PM
Hi Folks,
I've got a bit of a problem with a new cluster installation that I have not been able to find a solution to. I'm somewhat new in this arena as this is only my 3rd cluster install.
Firstly some important details:
OS: HP-UX B.11.31 March 10 Release
VCS: 5.0.31.0
This is a brand new install of VCS. The configuration is very basic, only what the installers asks for. There are two nodes in the cluster and I've confirmed that they can talk to each other over the private links.
Everthing seems to look ok in llttab, gabtab, sysname and main.cf.
When the cluster starts, I get the following errors:
Oct 11 21:45:29 xxxx Had[9242]: VCS WARNING V-16-1-51047 HAD Self Check: Excessive delay in the HAD heartbeat to GAB (10 seconds)
Oct 11 21:45:33 xxxx Had[9242]: VCS WARNING V-16-1-53034 HAD Signal SIGABRT received
Oct 11 21:45:33 xxxx Had[9242]: VCS NOTICE V-16-1-53038 Beginning execution of the diagnostics script
This occurs on both nodes.
Everything I've found so far suggests that the issue can occur if the system is under heavy load. In my case, the systems are basically idle and the cluster is in the most basic state. Could this be some sort of patch issue between VCS and HPUX?
Any pointers would be greatly appriciated.
Regards,
W
Solved! Go to Solution.
10-12-2011 01:17 PM
Hi,
I've found the solution to this issue. It appears that a patch (PHKL_41700) causes problems with hi resolution timers. The workaround is to enable the following in the kernel:
kctune hires_timeout_enable=1
When the HAD daemon does select() calls, which slow down the communication with other cluster components. The communication can take longer than 10 sec so HAD cannot send any heartbeat out, thus the message and the aborting of the daemon.
Hope this helps others.
Regards,
W
10-11-2011 04:23 PM
Check the priority of the HAD process. Even though the load may not be high, some other process may be stealing time/resources away from HAD which could affecting the timeliness of its heartbeat with GAB.
10-12-2011 08:29 AM
Hi,
Thanks for the suggestion, however no luck. The same issue is occuring. I'm going to try setting up a single node cluster with GAB and LLT enabled to see if the issue happens when only one node is present.
W
10-12-2011 01:17 PM
Hi,
I've found the solution to this issue. It appears that a patch (PHKL_41700) causes problems with hi resolution timers. The workaround is to enable the following in the kernel:
kctune hires_timeout_enable=1
When the HAD daemon does select() calls, which slow down the communication with other cluster components. The communication can take longer than 10 sec so HAD cannot send any heartbeat out, thus the message and the aborting of the daemon.
Hope this helps others.
Regards,
W