Solved: HAD unregistration problem with gab (failed) : VCS...

barani_129 · ‎08-04-2015

Dear Team,

I'm a newbi here.

We have a VCS setup on two linux (6.4) nodes (SFCFSHA 6.0). When customer tried to reboot a master node, the following error messages (confirming that HAD has unregistered with VCS (retry 1) FAILED) appeared on iLOM and it caused the panic on slave node too. Now the node is up and everything looks okay. But we have to find out the cause for this error message.

Please check the attached file for screenshot of the error message.

Looking forward for your help.

Thanks and Regards

Barani

Sunil_Yadav · ‎08-09-2015

Hi Barani,

We tried analyzing main.cf and engine log. Our findings are as follows:

1. Error while shutting down VCS: HAD unregister failed with GAB.

In graceful shutdown/restart of node, events' flow is as follows: Evacuate active service groups on node --> VCS exits --> HAD(port h) unregisters with GAB --> Node shutdowns/restarts. Only after success of an event, subsequent event is executed. In this case, first event failed in validation phase. This behavior is expected and as per design. Service group dependency is

#Parent      Child                   Relationship
ec1-sg       cme-platform-sg         online global firm
ec2-sg       cme-platform-sg         online global firm
ec3-sg       cme-platform-sg         online global firm
ec4-sg       cme-platform-sg         online global firm
ec5-sg       cme-platform-sg         online global firm
ec6-sg       cme-platform-sg         online global firm

By definition of online-global-firm dependency; child service group must be online in cluster(not necessarily on same node) in order to online parent service group. Thus, cme-platform-sg group must remain online till any ec*-sg group is online somewhere in cluster. On 3rd August 12:00, cme-platform-sg and all ec*-sg service groups were online on system CGF01.

2015/08/03 12:00:17 VCS NOTICE V-16-1-10447 Group cme-platform-sg is online on system CGF01
2015/08/03 12:00:43 VCS NOTICE V-16-1-10447 Group ec3-sg is online on system CGF01
2015/08/03 12:00:50 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF01
2015/08/03 12:01:20 VCS NOTICE V-16-1-10447 Group ec1-sg is online on system CGF01
2015/08/03 12:01:22 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF01
2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec6-sg is online on system CGF01
2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec2-sg is online on system CGF01

20 minutes later, user manually switched over ec4-sg and ec5-sg service groups from system CGF01 to system CGF02.

2015/08/03 12:19:01 VCS NOTICE V-16-1-10208 Initiating switch of group ec4-sg from system CGF01 to system CGF02
2015/08/03 12:19:32 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF02
2015/08/03 12:19:06 VCS NOTICE V-16-1-10208 Initiating switch of group ec5-sg from system CGF01 to system CGF02
2015/08/03 12:21:08 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF02

Graceful reboot was attempted on 3rd August, 15:14:20. It failed in validation phase itself. Offlining cme-platform-sg would have violated online-global-firm depenecncy as parent SGs ec4-sg & ec5-sg were still online on system CGF02. This error was seen in “Shutting down VCS: VCS WARNING V-16-1- 10483 Offlining system CGF01 would result in group dependency being violated for one of the groups online on CGF01; offline the parent of such a group first". Thus VCS didn’t initiate stopping and system CGF01 remained in RUNNING state. After 12 retries, HAD didn’t unregister with GAB.

2. CGF02 node panicked

As can be observed in log below snippet, both LLT heartbeat links went down which lead to split brain in cluster. I/O fencing intervened and System CGF02 was fenced out. Hence node CGF02 panicked.

2015/08/03 15:17:40 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth2, UP, eth6, UP; Current status =eth2, DOWN, eth6, DOWN.
2015/08/03 15:17:41 VCS INFO V-16-1-10077 Received new cluster membership
2015/08/03 15:17:41 VCS NOTICE V-16-1-10112 System (CGF01) - Membership: 0x1, DDNA: 0x0
2015/08/03 15:17:41 VCS NOTICE V-16-1-10034 RECONFIG received. VCS waiting for I/O fencing to be completed
2015/08/03 15:17:42 VCS NOTICE V-16-1-10036 I/O fencing completed
2015/08/03 15:17:42 VCS ERROR V-16-1-10079 System CGF02 (Node '1') is in Down State - Membership: 0x1
2015/08/03 15:17:42 VCS ERROR V-16-1-10322 System CGF02 (Node '1') changed state from RUNNING to FAULTED

Hopefully we answered all queries. Please let us know if any further assistance needed.

Thanks & Regards,
Sunil Y

View solution in original post

RiaanBadenhorst · ‎08-05-2015

Please post your main.cf / engine_A.log if you don't mind. Seems there is a service group dependency configured that was preventing the service groups on CGF01 going offline.

Sunil_Yadav · ‎08-05-2015

Hi Barani,

Before stopping HAD, active/online Service Groups are evacuated from it. While evacuating active/online SGs from node going down, VCS takes care of dependencies, stop order, etc. However, there was error regarding dependency violation because of SGs going down on CGF01. Dependnency type and SGs’ states may give us clue of this error. We need more evidences(atleast main.cf and engine log) for RCA of the issue.

Thanks & Regards,

Sunil Y

barani_129 · ‎08-05-2015

Hi Riaan/Sunil,

Thanks for your reply. Please find the attached main.cf and engine_A.log file. We also need to identify why the second node (CGF02) got rebooted at that time.

Looking forward to your support.

Thanks adn Regards,

Barani

mikebounds · ‎08-06-2015

The cluster was up and down a lot on the 4th Aug:

2015/08/04 14:17:29 VCS NOTICE V-16-1-10322 System CGF01 (Node '0') changed state from RUNNING to LEAVING
2015/08/04 14:17:29 VCS NOTICE V-16-1-10322 System CGF02 (Node '1') changed state from RUNNING to LEAVING

2015/08/04 15:35:14 VCS NOTICE V-16-1-10322 System CGF02 (Node '1') changed state from RUNNING to LEAVING

2015/08/04 18:27:19 VCS INFO V-16-1-50135 User root fired command: MSG_CLUSTER_STOP_ALL from localhost
2015/08/04 18:27:19 VCS NOTICE V-16-1-10322 System CGF01 (Node '0') changed state from RUNNING to LEAVING
2015/08/04 18:27:19 VCS NOTICE V-16-1-10322 System CGF02 (Node '1') changed state from RUNNING to LEAVING

2015/08/04 18:29:09 VCS NOTICE V-16-1-10322 System CGF02 (Node '1') changed state from LEAVING to EXITING
2015/08/04 18:29:09 VCS NOTICE V-16-1-10322 System CGF02 (Node '1') changed state from EXITING to EXITED

So can you provide the following:

Which node did you reboot (CGF01 or CGF02) when you had the issue and did you reboot because of an issue , or just for maintenance
What command you ran to reboot the node
What day and approximate time did you shutdown node
What approximate time the other node paniced (like 10 seconds after, 5 mins, 30 mins ?)
Engine log from the other node (you only provided log from one node)
System log from node that paniced

Mike

RiaanBadenhorst · ‎08-06-2015

There seem to be some issues with your agent scripts.

2015/07/27 22:23:38 VCS INFO V-16-2-13716 (CGF02) Resource(GTPReceiver2-wf-res): Output of the completed operation (clean)
==============================================
/opt/VRTSvcs/bin/Workflow/clean: line 67: hagrp: command not found

2015/07/28 19:43:12 VCS ERROR V-16-2-13067 (CGF01) Agent is calling clean for resource(wf-lister-res) because the resource became OFFLINE unexpectedly, on its own.
2015/07/28 19:43:12 VCS INFO V-16-10031-504 (CGF01) Application:wf-lister-res:clean:Executed /opt/cmd/Mediate/script/ha/cleanWfLister.sh as user cmd
2015/07/28 19:43:22 VCS WARNING V-16-10031-542 (CGF01) Application:wf-lister-res:clean:PidFile </var/opt/cmd/run/WfLister.sh.pid> does not exist, process will not be killed

There were a lot instances where the SG's went up and down (as mike said) and they kept flapping around between the nodes.

barani_129 · ‎08-06-2015

Hello Riaan/Mike,

The node CGF01 was rebooted on 3rd August 15:14. at that time only this error message happened and node CGF02 was panicked.

Br/Barani

mikebounds · ‎08-06-2015

Need engine log and system log from CGF02

Mike

mikebounds · ‎08-06-2015

The from engine log on CGF01 it looks like you had a network issue at the same time you shutdown the node:

2015/08/03 15:17:27 VCS INFO V-16-1-50135 User root fired command: MSG_CLUSTER_STOP_SYS from localhost
2015/08/03 15:17:37 VCS ERROR V-16-1-54031 Resource nic_om (Owner: Unspecified, Group: net-om-sg) is FAULTED on sys CGF01
2015/08/03 15:17:37 VCS NOTICE V-16-1-10300 Initiating Offline of Resource net-om-phantom-res (Owner: Unspecified, Group: net-om-sg) on System CGF01
2015/08/03 15:17:37 VCS ERROR V-16-1-54031 Resource nic_charging (Owner: Unspecified, Group: net-charging-sg) is FAULTED on sys CGF01
2015/08/03 15:17:37 VCS NOTICE V-16-1-10300 Initiating Offline of Resource net-charging-phantom-res (Owner: Unspecified, Group: net-charging-sg) on System CGF01
2015/08/03 15:17:37 VCS ERROR V-16-2-13067 (CGF01) Agent is calling clean for resource(ec3-ip-res) because the resource became OFFLINE unexpectedly, on its own.
2015/08/03 15:17:37 VCS INFO V-16-2-13068 (CGF01) Resource(ec3-ip-res) - clean completed successfully.
2015/08/03 15:17:37 VCS INFO V-16-1-10307 Resource ec3-ip-res (Owner: Unspecified, Group: ec3-sg) is offline on CGF01 (Not initiated by VCS)
2015/08/03 15:17:37 VCS NOTICE V-16-1-10300 Initiating Offline of Resource ec3-ec-res (Owner: Unspecified, Group: ec3-sg) on System CGF01
2015/08/03 15:17:37 VCS ERROR V-16-2-13067 (CGF01) Agent is calling clean for resource(ne3s-ip-res) because the resource became OFFLINE unexpectedly, on its own.
2015/08/03 15:17:37 VCS INFO V-16-2-13068 (CGF01) Resource(ne3s-ip-res) - clean completed successfully.
2015/08/03 15:17:37 VCS INFO V-16-1-10307 Resource ne3s-ip-res (Owner: Unspecified, Group: esymac-sg) is offline on CGF01 (Not initiated by VCS)

So here we see netwok reources nic_om, ec3-ip-res, ne3s-ip-res faulting. They should not have faulted due to shutting down, the node - in particular, nic_om is a persistent resource, so VCS cannot offlline this resource, it just monitors it. So if there was a network issue, then LLT network may have failed too, so fencing kicks in and loosing node panics.

So have a look in system log for both nodes to see if you see network errors.

Mike

barani_129 · ‎08-06-2015

Thanks Mike, really appreciate your support. :)

Unfortunately both nodes are now upgraded to latest application version (Yes, I wasn't informed about this).

So we have lost all the logs. I will pass your findings to customer and try to close this case.

@Riaan/Sunil: Thanks for your support too. :)

If we can't proceed without anyother log files then we can assume this as answered.

Br/Barani

Sunil_Yadav · ‎08-09-2015

Hi Barani,

We tried analyzing main.cf and engine log. Our findings are as follows:

1. Error while shutting down VCS: HAD unregister failed with GAB.

In graceful shutdown/restart of node, events' flow is as follows: Evacuate active service groups on node --> VCS exits --> HAD(port h) unregisters with GAB --> Node shutdowns/restarts. Only after success of an event, subsequent event is executed. In this case, first event failed in validation phase. This behavior is expected and as per design. Service group dependency is

#Parent      Child                   Relationship
ec1-sg       cme-platform-sg         online global firm
ec2-sg       cme-platform-sg         online global firm
ec3-sg       cme-platform-sg         online global firm
ec4-sg       cme-platform-sg         online global firm
ec5-sg       cme-platform-sg         online global firm
ec6-sg       cme-platform-sg         online global firm

By definition of online-global-firm dependency; child service group must be online in cluster(not necessarily on same node) in order to online parent service group. Thus, cme-platform-sg group must remain online till any ec*-sg group is online somewhere in cluster. On 3rd August 12:00, cme-platform-sg and all ec*-sg service groups were online on system CGF01.

2015/08/03 12:00:17 VCS NOTICE V-16-1-10447 Group cme-platform-sg is online on system CGF01
2015/08/03 12:00:43 VCS NOTICE V-16-1-10447 Group ec3-sg is online on system CGF01
2015/08/03 12:00:50 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF01
2015/08/03 12:01:20 VCS NOTICE V-16-1-10447 Group ec1-sg is online on system CGF01
2015/08/03 12:01:22 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF01
2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec6-sg is online on system CGF01
2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec2-sg is online on system CGF01

20 minutes later, user manually switched over ec4-sg and ec5-sg service groups from system CGF01 to system CGF02.

2015/08/03 12:19:01 VCS NOTICE V-16-1-10208 Initiating switch of group ec4-sg from system CGF01 to system CGF02
2015/08/03 12:19:32 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF02
2015/08/03 12:19:06 VCS NOTICE V-16-1-10208 Initiating switch of group ec5-sg from system CGF01 to system CGF02
2015/08/03 12:21:08 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF02

Graceful reboot was attempted on 3rd August, 15:14:20. It failed in validation phase itself. Offlining cme-platform-sg would have violated online-global-firm depenecncy as parent SGs ec4-sg & ec5-sg were still online on system CGF02. This error was seen in “Shutting down VCS: VCS WARNING V-16-1- 10483 Offlining system CGF01 would result in group dependency being violated for one of the groups online on CGF01; offline the parent of such a group first". Thus VCS didn’t initiate stopping and system CGF01 remained in RUNNING state. After 12 retries, HAD didn’t unregister with GAB.

2. CGF02 node panicked

As can be observed in log below snippet, both LLT heartbeat links went down which lead to split brain in cluster. I/O fencing intervened and System CGF02 was fenced out. Hence node CGF02 panicked.

2015/08/03 15:17:40 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth2, UP, eth6, UP; Current status =eth2, DOWN, eth6, DOWN.
2015/08/03 15:17:41 VCS INFO V-16-1-10077 Received new cluster membership
2015/08/03 15:17:41 VCS NOTICE V-16-1-10112 System (CGF01) - Membership: 0x1, DDNA: 0x0
2015/08/03 15:17:41 VCS NOTICE V-16-1-10034 RECONFIG received. VCS waiting for I/O fencing to be completed
2015/08/03 15:17:42 VCS NOTICE V-16-1-10036 I/O fencing completed
2015/08/03 15:17:42 VCS ERROR V-16-1-10079 System CGF02 (Node '1') is in Down State - Membership: 0x1
2015/08/03 15:17:42 VCS ERROR V-16-1-10322 System CGF02 (Node '1') changed state from RUNNING to FAULTED

Hopefully we answered all queries. Please let us know if any further assistance needed.

Thanks & Regards,
Sunil Y

barani_129 · ‎08-09-2015

Hi Sunil,

Thanks for such detailed analysis. So we should have all the ECs running on the master node (where cme-platform-sg is running) when we shutdown the platform? How to remove this dependancy in main.cf file?

Thanks adn Regards,

Barani

barani_129 · ‎08-09-2015

Hello Sunil,

We would also like to know how to avoid this kind of incident in furture.

Thanks and Regards,

Barani

Sunil_Yadav · ‎08-09-2015

Hi Barani,

By nature/definition of online-global-firm dependency between ec*-sg and cme-platform-sg, it not mandatory to keep all of them online on same cluster. Thus, we won’t mandate to keep ec*-sg and cme-platform-sg on master node while shutting down/rebooting. Before initiating shutdown/reboot, user may manually offline ec*-sg which are online on some other systems. Thus, dependency won’t be violated while offlining of cme-platform-sg as part of shutdown/reboot.

There is no issue at all with dependency and you needn’t remove it.
Nevertheless, just for reference...
When cluster is stopped, you can remove dependency by removing “requires group cme-platform-sg online global firm” clause from ec*-sg groups’ definition in main.cf. While cluster is running, you can use “hagrp –unlink” CLI.

Thanks & Regards,
Sunil Y

barani_129 · ‎08-10-2015

Thanks Sunil. All my queries are answered. :)

Br/Barani

Sunil_Yadav · ‎08-10-2015

It’s my pleasure.

Thanks & Regards,
Sunil Y

VOX

HAD unregistration problem with gab (failed) : VCS 6.0

1. Error while shutting down VCS: HAD unregister failed with GAB.

2. CGF02 node panicked

1. Error while shutting down VCS: HAD unregister failed with GAB.

2. CGF02 node panicked