Forum Discussion

Zahid_Haseeb's avatar
Zahid_Haseeb
Moderator
11 years ago

System unresponsive for a while when LLT/GAB error prints in /var/log/messages

Environment

Linux RHEL = 6.2

SFHA/DR = 6.0.2

Query

I installed SFHA 6.0.2 and configured it. This is only one node in the cluster this time. I installed LLT and GAB as well so In future I can add the second node in that cluster. I noticed that the system gets unresponsive for few minutes and responsive back after few minutes. On one Linux Terminal I execute a command tail -f /var/log/messages and keep waiting to become the system unresponsive. I noticed that as the below messages printing in log, the cluster node become unresponsive.

Feb 27 16:11:46 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 5 secs (5274 ticks). Send out of context hbs to peers from llt_deliver. 174 secs more to go
Feb 27 16:11:46 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 5275 ticks
Feb 27 16:11:49 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1213 ticks
Feb 27 16:11:49 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1850 ticks
Feb 27 16:11:57 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 4246 ticks
Feb 27 16:11:57 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1676 ticks
Feb 27 16:12:02 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2592 ticks
Feb 27 16:12:07 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 3 secs (3528 ticks). Send out of context hbs to peers from llt_deliver. 176 secs more to go
Feb 27 16:12:07 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 3529 ticks
Feb 27 16:12:17 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 9895 ticks
Feb 27 16:12:17 CLUSTER-NODE1 kernel: GAB INFO V-15-1-20124 timer not called for 10 seconds
Feb 27 16:12:19 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1833 ticks
Feb 27 16:12:21 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1151 ticks
Feb 27 16:12:25 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1318 ticks
Feb 27 16:12:32 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2451 ticks
Feb 27 16:12:43 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 4 secs (4513 ticks). Send out of context hbs to peers from llt_deliver. 175 secs more to go
Feb 27 16:12:43 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 4514 ticks
Feb 27 16:12:45 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1357 ticks
Feb 27 16:12:48 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2017 ticks
Feb 27 16:12:54 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 2 secs (2303 ticks). Send out of context hbs to peers from llt_deliver. 177 secs more to go
Feb 27 16:12:54 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2304 ticks
Feb 27 16:12:55 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1711 ticks
Feb 27 16:13:13 CLUSTER-NODE1 rtkit-daemon[4047]: The canary thread is apparently starving. Taking action.
Feb 27 16:13:13 CLUSTER-NODE1 rtkit-daemon[4047]: Demoting known real-time threads.
Feb 27 16:13:13 CLUSTER-NODE1 rtkit-daemon[4047]: Demoted 0 threads.

Feb 27 16:13:13 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 17538 ticks
Feb 27 16:13:13 CLUSTER-NODE1 kernel: GAB INFO V-15-1-20124 timer not called for 18 seconds
Feb 27 16:13:16 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1499 ticks
Feb 27 16:13:19 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 1912 ticks
Feb 27 16:13:24 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10035 timer not called for 2438 ticks
Feb 27 16:13:35 CLUSTER-NODE1 kernel: LLT INFO V-14-1-10541 llt_send_hb: timer not called for 7 secs (7464 ticks). Send out of context hbs to peers from llt_deliver. 172 secs more to go

  • Hi,

    the highlighted messages are linux messages & not from VCS. I did some google around & see below

    The term "canary" as used here comes from coal mining originally. Coal miners used canaries to detect dangerous gases (if the canary they carried with them died, they knew they had to get out of the shaft/mine ASAP). As a result the term "canary" is now often used for anything that you use to get an (early) warning about a dangerous situation.

    In this case it seems like 'rtkit' starts a "normal" thread to test if the threads that get "real time" priorities are "starving" other threads (& processes), where "starving" means that they get too little processor time. This is a safety measure to make sure that processes/threads that have access to real time priorities don't use up so much CPU time that other tasks get none anymore.

    So apparently some thread(s) that got real-time priorities from rtkit is/are misbehaving, and trying to monopolize the CPU, rtkit detects this with its "canary thread", and thus rtkit takes away the real-time priorities.

    To me, it appears that above messages are result of system going busy rather than saying because of above messages system is getting unresponsive, all above messages are symptoms.

    All the LLT messages also indicate that system is heavily loaded. So you need to troubleshoot from OS end as to what is happening.

     

    G

     

     

     

7 Replies

  • Hi,

    the highlighted messages are linux messages & not from VCS. I did some google around & see below

    The term "canary" as used here comes from coal mining originally. Coal miners used canaries to detect dangerous gases (if the canary they carried with them died, they knew they had to get out of the shaft/mine ASAP). As a result the term "canary" is now often used for anything that you use to get an (early) warning about a dangerous situation.

    In this case it seems like 'rtkit' starts a "normal" thread to test if the threads that get "real time" priorities are "starving" other threads (& processes), where "starving" means that they get too little processor time. This is a safety measure to make sure that processes/threads that have access to real time priorities don't use up so much CPU time that other tasks get none anymore.

    So apparently some thread(s) that got real-time priorities from rtkit is/are misbehaving, and trying to monopolize the CPU, rtkit detects this with its "canary thread", and thus rtkit takes away the real-time priorities.

    To me, it appears that above messages are result of system going busy rather than saying because of above messages system is getting unresponsive, all above messages are symptoms.

    All the LLT messages also indicate that system is heavily loaded. So you need to troubleshoot from OS end as to what is happening.

     

    G

     

     

     

  • Seems RHEL 6.2 is buggy or not compitable with some hardware.

     

    I feel that we cannot go with RHEL 6.2. As to investigate the issue I installed 5.3 again (as it was working fine before) , just to make sure that the issueis related with OS or SFHA. I though that I go with RHEL 5.8 instead of going with 6x

  • As per Symantec, RHEL 6.2 is pretty much supported so I would assume it is quite worth to diagnose on what is wrong with your setup then to revert back ..

    Have yu visited the hardware technote or compatibility lists to ensure you have every thing right with right tunables ?

    G

  • Thats alright, my suggestion is to diagnose, if in case you have to deliver the system early & you don't have time to troubleshoot then you can think of switching to 5.x as its tested by you & is working correctly

    G

  • Thanks for your kind words Gaurav

     

    I saw that the hardware is certified by RHEL and mentioned in the below document.

    https://hardware.redhat.com/show.cgi?id=632150

    I wonder what could be the tunables. As I am only able to see the below in /var/log/messages. Moreover nothing is printing in top command when the system gets unresponsive.

    rtkit-daemon[4047]:

  • Try to look for open files if they are consuming resources... other wise best option would be to collect a crash dump & get it analyzed

    G