System unresponsive for a while when LLT/GAB error prints in /var/log/messages

Question

Environment
Linux RHEL = 6.2
SFHA/DR = 6.0.2
Query
I installed SFHA 6.0.2 and configured it. This is only one node in the cluster this time. I installed LLT and GAB as well so In future I can add the second node in that cluster. I noticed that the system gets unresponsive for few minutes and responsive back after few minutes. On one Linux Terminal I execute a command tail -f /var/log/messages and keep waiting to become the system unresponsive. I noticed that as the below messages printing in log, the cluster node become unresponsive.
Feb 27 16:11:46 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 5 secs (5274 ticks). Send out of context hbs to peers from llt_deliver. 174 secs more to go
	Feb 27 16:11:46 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 5275 ticks
	Feb 27 16:11:49 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1213 ticks
	Feb 27 16:11:49 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1850 ticks
	Feb 27 16:11:57 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 4246 ticks
	Feb 27 16:11:57 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1676 ticks
	Feb 27 16:12:02 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2592 ticks
	Feb 27 16:12:07 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 3 secs (3528 ticks). Send out of context hbs to peers from llt_deliver. 176 secs more to go
	Feb 27 16:12:07 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 3529 ticks
	Feb 27 16:12:17 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 9895 ticks
	Feb 27 16:12:17 CLUSTER-NODE1 kernel: GAB INFO V-15-1-20124 timer not called for 10 seconds
	Feb 27 16:12:19 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1833 ticks
	Feb 27 16:12:21 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1151 ticks
	Feb 27 16:12:25 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1318 ticks
	Feb 27 16:12:32 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2451 ticks
	Feb 27 16:12:43 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 4 secs (4513 ticks). Send out of context hbs to peers from llt_deliver. 175 secs more to go
	Feb 27 16:12:43 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 4514 ticks
	Feb 27 16:12:45 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1357 ticks
	Feb 27 16:12:48 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2017 ticks
	Feb 27 16:12:54 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 2 secs (2303 ticks). Send out of context hbs to peers from llt_deliver. 177 secs more to go
	Feb 27 16:12:54 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2304 ticks
	Feb 27 16:12:55 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1711 ticks
	Feb 27 16:13:13 CLUSTER-NODE1 rtkit-daemon[4047]: The canary thread is apparently starving. Taking action.
	Feb 27 16:13:13 CLUSTER-NODE1 rtkit-daemon[4047]: Demoting known real-time threads.
	Feb 27 16:13:13 CLUSTER-NODE1 rtkit-daemon[4047]: Demoted 0 threads.
	Feb 27 16:13:13 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 17538 ticks
	Feb 27 16:13:13 CLUSTER-NODE1 kernel: GAB INFO V-15-1-20124 timer not called for 18 seconds
	Feb 27 16:13:16 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1499 ticks
	Feb 27 16:13:19 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1912 ticks
	Feb 27 16:13:24 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2438 ticks
	Feb 27 16:13:35 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 7 secs (7464 ticks). Send out of context hbs to peers from llt_deliver. 172 secs more to go

gaurav_s · Accepted Answer

Hi,
the highlighted messages are linux messages &amp; not from VCS. I did some google around &amp; see below

The term "canary" as used here comes from coal mining originally. Coal miners used canaries to detect dangerous gases (if the canary they carried with them died, they knew they had to get out of the shaft/mine ASAP). As a result the term "canary" is now often used for anything that you use to get an (early) warning about a dangerous situation.
	In this case it seems like 'rtkit' starts a "normal" thread to test if the threads that get "real time" priorities are "starving" other threads (&amp; processes), where "starving" means that they get too little processor time. This is a safety measure to make sure that processes/threads that have access to real time priorities don't use up so much CPU time that other tasks get none anymore.
	So apparently some thread(s) that got real-time priorities from rtkit is/are misbehaving, and trying to monopolize the CPU, rtkit detects this with its "canary thread", and thus rtkit takes away the real-time priorities.
	To me, it appears that above messages are result of system going busy rather than saying because of above messages system is getting unresponsive, all above messages are symptoms.
	All the LLT messages also indicate that system is heavily loaded. So you need to troubleshoot from OS end as to what is happening.
	&nbsp;
	G
	&nbsp;
	&nbsp;

&nbsp;

zahid_haseeb · Answer

Seems RHEL 6.2 is buggy or not compitable with some hardware.
&nbsp;
I feel that we cannot go with RHEL 6.2. As to investigate the issue I installed 5.3 again (as it was working fine before) , just to make sure that the issueis related with OS or SFHA. I though that I go with RHEL 5.8 instead of going with 6x

gaurav_s · Answer

As per Symantec, RHEL 6.2 is pretty much supported so I would assume it is quite worth to diagnose on what is wrong with your setup then to revert back ..
Have yu visited the hardware technote or compatibility lists to ensure you have every thing right with right tunables ?
G

zahid_haseeb · Answer

I am not reverting back. I just needed to verify.

gaurav_s · Answer

Thats alright, my suggestion is to diagnose, if in case you have to deliver the system early &amp; you don't have time to troubleshoot then you can think of switching to 5.x as its tested by you &amp; is working correctly
G

zahid_haseeb · Answer

Thanks for your kind words Gaurav
&nbsp;
I saw that the hardware is certified by RHEL and mentioned in the below document.
https://hardware.redhat.com/show.cgi?id=632150
I wonder what could be the tunables. As I am only able to see the below in /var/log/messages. Moreover nothing is printing in top command when the system gets unresponsive.
rtkit-daemon[4047]:

Forum Discussion

System unresponsive for a while when LLT/GAB error prints in /var/log/messages

7 Replies

Related Content

Cannot unload GAB and LLT on RHEL 6.0

Re: Clarification on what happens during a rolling SFHA upgrade

change log folder of bptm messages from /var/log/messages to different

Configure LLT And GAB In A Veritas Cluster

Unable to stop GAB & LLT VCS5.1/Solaris 10

Recent Discussions

Configure two Mount type resources of nfs FStype attribute using the same share

order

key registration and reservation

Verifying that primary and dr clusters replication is synced

vcs can create logical nic