Forum Discussion

barani_129's avatar
barani_129
Level 3
9 years ago

HAD unregistration problem with gab (failed) : VCS 6.0

Dear Team,

I'm a newbi here. 

We have a VCS setup on two linux (6.4) nodes (SFCFSHA 6.0). When customer tried to reboot a master node, the following error messages (confirming that HAD has unregistered with VCS (retry 1) FAILED) appeared on iLOM and it caused the panic on slave node too. Now the node is up and everything looks okay. But we have to find out the cause for this error message.

Please check the attached file for screenshot of the error message.

Looking forward for your help.

 

Thanks and Regards

Barani

 

 

  • Hi Barani,

    We tried analyzing main.cf and engine log. Our findings are as follows:

    1.    Error while shutting down VCS: HAD unregister failed with GAB.

    In graceful shutdown/restart of node, events' flow is as follows: Evacuate active service groups on node --> VCS exits --> HAD(port h) unregisters with GAB --> Node shutdowns/restarts. Only after success of an event, subsequent event is executed. In this case, first event failed in validation phase. This behavior is expected and as per design. Service group dependency is

    #Parent      Child                   Relationship
    ec1-sg       cme-platform-sg         online global firm
    ec2-sg       cme-platform-sg         online global firm
    ec3-sg       cme-platform-sg         online global firm
    ec4-sg       cme-platform-sg         online global firm
    ec5-sg       cme-platform-sg         online global firm
    ec6-sg       cme-platform-sg         online global firm

    By definition of online-global-firm dependency; child service group must be online in cluster(not necessarily on same node) in order to online parent service group. Thus, cme-platform-sg group must remain online till any ec*-sg group is online somewhere in cluster. On 3rd August 12:00, cme-platform-sg and all ec*-sg service groups were online on system CGF01.

    2015/08/03 12:00:17 VCS NOTICE V-16-1-10447 Group cme-platform-sg is online on system CGF01
    2015/08/03 12:00:43 VCS NOTICE V-16-1-10447 Group ec3-sg is online on system CGF01
    2015/08/03 12:00:50 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF01
    2015/08/03 12:01:20 VCS NOTICE V-16-1-10447 Group ec1-sg is online on system CGF01
    2015/08/03 12:01:22 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF01
    2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec6-sg is online on system CGF01
    2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec2-sg is online on system CGF01

    20 minutes later, user manually switched over ec4-sg and ec5-sg service groups from system CGF01 to system CGF02.

    2015/08/03 12:19:01 VCS NOTICE V-16-1-10208 Initiating switch of group ec4-sg from system CGF01 to system CGF02
    2015/08/03 12:19:32 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF02
    2015/08/03 12:19:06 VCS NOTICE V-16-1-10208 Initiating switch of group ec5-sg from system CGF01 to system CGF02
    2015/08/03 12:21:08 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF02

    Graceful reboot was attempted on 3rd August, 15:14:20. It failed in validation phase itself. Offlining cme-platform-sg would have violated online-global-firm depenecncy as parent SGs ec4-sg & ec5-sg were still online on system CGF02. This error was seen in “Shutting down VCS: VCS WARNING V-16-1- 10483 Offlining system CGF01 would result in group dependency being violated for one of the groups online on CGF01; offline the parent of such a group first". Thus VCS didn’t initiate stopping and system CGF01 remained in RUNNING state. After 12 retries, HAD didn’t unregister with GAB.

    2.    CGF02 node panicked

    As can be observed in log below snippet, both LLT heartbeat links went down which lead to split brain in cluster. I/O fencing intervened and System CGF02 was fenced out. Hence node CGF02 panicked.

    2015/08/03 15:17:40 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth2, UP, eth6, UP; Current status =eth2, DOWN, eth6, DOWN.
    2015/08/03 15:17:41 VCS INFO V-16-1-10077 Received new cluster membership
    2015/08/03 15:17:41 VCS NOTICE V-16-1-10112 System (CGF01) - Membership: 0x1, DDNA: 0x0
    2015/08/03 15:17:41 VCS NOTICE V-16-1-10034 RECONFIG received. VCS waiting for I/O fencing to be completed
    2015/08/03 15:17:42 VCS NOTICE V-16-1-10036 I/O fencing completed
    2015/08/03 15:17:42 VCS ERROR V-16-1-10079 System CGF02 (Node '1') is in Down State - Membership: 0x1
    2015/08/03 15:17:42 VCS ERROR V-16-1-10322 System CGF02 (Node '1') changed state from RUNNING to FAULTED

     

    Hopefully we answered all queries. Please let us know if any further assistance needed.

    Thanks & Regards,
    Sunil Y

     

15 Replies