Forum Discussion

Level 3

9 years ago

HAD unregistration problem with gab (failed) : VCS 6.0

Dear Team,

I'm a newbi here.

We have a VCS setup on two linux (6.4) nodes (SFCFSHA 6.0). When customer tried to reboot a master node, the following error messages (confirming that HAD has unregistered with VCS (retry 1) FAILED) appeared on iLOM and it caused the panic on slave node too. Now the node is up and everything looks okay. But we have to find out the cause for this error message.

Please check the attached file for screenshot of the error message.

Looking forward for your help.

Thanks and Regards

Barani

FAILED_ Capturexlsx.xls788 KB

Hi Barani,

We tried analyzing main.cf and engine log. Our findings are as follows:

1. Error while shutting down VCS: HAD unregister failed with GAB.

In graceful shutdown/restart of node, events' flow is as follows: Evacuate active service groups on node --> VCS exits --> HAD(port h) unregisters with GAB --> Node shutdowns/restarts. Only after success of an event, subsequent event is executed. In this case, first event failed in validation phase. This behavior is expected and as per design. Service group dependency is

#Parent      Child                   Relationship
ec1-sg       cme-platform-sg         online global firm
ec2-sg       cme-platform-sg         online global firm
ec3-sg       cme-platform-sg         online global firm
ec4-sg       cme-platform-sg         online global firm
ec5-sg       cme-platform-sg         online global firm
ec6-sg       cme-platform-sg         online global firm

By definition of online-global-firm dependency; child service group must be online in cluster(not necessarily on same node) in order to online parent service group. Thus, cme-platform-sg group must remain online till any ec*-sg group is online somewhere in cluster. On 3rd August 12:00, cme-platform-sg and all ec*-sg service groups were online on system CGF01.

2015/08/03 12:00:17 VCS NOTICE V-16-1-10447 Group cme-platform-sg is online on system CGF01
2015/08/03 12:00:43 VCS NOTICE V-16-1-10447 Group ec3-sg is online on system CGF01
2015/08/03 12:00:50 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF01
2015/08/03 12:01:20 VCS NOTICE V-16-1-10447 Group ec1-sg is online on system CGF01
2015/08/03 12:01:22 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF01
2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec6-sg is online on system CGF01
2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec2-sg is online on system CGF01

20 minutes later, user manually switched over ec4-sg and ec5-sg service groups from system CGF01 to system CGF02.

2015/08/03 12:19:01 VCS NOTICE V-16-1-10208 Initiating switch of group ec4-sg from system CGF01 to system CGF02
2015/08/03 12:19:32 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF02
2015/08/03 12:19:06 VCS NOTICE V-16-1-10208 Initiating switch of group ec5-sg from system CGF01 to system CGF02
2015/08/03 12:21:08 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF02

Graceful reboot was attempted on 3rd August, 15:14:20. It failed in validation phase itself. Offlining cme-platform-sg would have violated online-global-firm depenecncy as parent SGs ec4-sg & ec5-sg were still online on system CGF02. This error was seen in “Shutting down VCS: VCS WARNING V-16-1- 10483 Offlining system CGF01 would result in group dependency being violated for one of the groups online on CGF01; offline the parent of such a group first". Thus VCS didn’t initiate stopping and system CGF01 remained in RUNNING state. After 12 retries, HAD didn’t unregister with GAB.

2. CGF02 node panicked

As can be observed in log below snippet, both LLT heartbeat links went down which lead to split brain in cluster. I/O fencing intervened and System CGF02 was fenced out. Hence node CGF02 panicked.

2015/08/03 15:17:40 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth2, UP, eth6, UP; Current status =eth2, DOWN, eth6, DOWN.
2015/08/03 15:17:41 VCS INFO V-16-1-10077 Received new cluster membership
2015/08/03 15:17:41 VCS NOTICE V-16-1-10112 System (CGF01) - Membership: 0x1, DDNA: 0x0
2015/08/03 15:17:41 VCS NOTICE V-16-1-10034 RECONFIG received. VCS waiting for I/O fencing to be completed
2015/08/03 15:17:42 VCS NOTICE V-16-1-10036 I/O fencing completed
2015/08/03 15:17:42 VCS ERROR V-16-1-10079 System CGF02 (Node '1') is in Down State - Membership: 0x1
2015/08/03 15:17:42 VCS ERROR V-16-1-10322 System CGF02 (Node '1') changed state from RUNNING to FAULTED

Hopefully we answered all queries. Please let us know if any further assistance needed.

Thanks & Regards,
Sunil Y

15 Replies

Sunil_Yadav
Level 4
9 years ago
It’s my pleasure.

Thanks & Regards,
Sunil Y
barani_129
Level 3
9 years ago
Thanks Sunil. All my queries are answered. :)

Br/Barani
Sunil_Yadav
Level 4
9 years ago
Hi Barani,

By nature/definition of online-global-firm dependency between ec*-sg and cme-platform-sg, it not mandatory to keep all of them online on same cluster. Thus, we won’t mandate to keep ec*-sg and cme-platform-sg on master node while shutting down/rebooting. Before initiating shutdown/reboot, user may manually offline ec*-sg which are online on some other systems. Thus, dependency won’t be violated while offlining of cme-platform-sg as part of shutdown/reboot.

There is no issue at all with dependency and you needn’t remove it.
Nevertheless, just for reference...
When cluster is stopped, you can remove dependency by removing “requires group cme-platform-sg online global firm” clause from ec*-sg groups’ definition in main.cf. While cluster is running, you can use “hagrp –unlink” CLI.

Thanks & Regards,
Sunil Y
barani_129
Level 3
9 years ago
Hello Sunil,

We would also like to know how to avoid this kind of incident in furture.

Thanks and Regards,

Barani
barani_129
Level 3
9 years ago
Hi Sunil,

Thanks for such detailed analysis. So we should have all the ECs running on the master node (where cme-platform-sg is running) when we shutdown the platform? How to remove this dependancy in main.cf file?

Thanks adn Regards,

Barani

Sunil_Yadav

Level 4

9 years ago

Hi Barani,

We tried analyzing main.cf and engine log. Our findings are as follows:

1. Error while shutting down VCS: HAD unregister failed with GAB.

#Parent      Child                   Relationship
ec1-sg       cme-platform-sg         online global firm
ec2-sg       cme-platform-sg         online global firm
ec3-sg       cme-platform-sg         online global firm
ec4-sg       cme-platform-sg         online global firm
ec5-sg       cme-platform-sg         online global firm
ec6-sg       cme-platform-sg         online global firm

2015/08/03 12:00:17 VCS NOTICE V-16-1-10447 Group cme-platform-sg is online on system CGF01
2015/08/03 12:00:43 VCS NOTICE V-16-1-10447 Group ec3-sg is online on system CGF01
2015/08/03 12:00:50 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF01
2015/08/03 12:01:20 VCS NOTICE V-16-1-10447 Group ec1-sg is online on system CGF01
2015/08/03 12:01:22 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF01
2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec6-sg is online on system CGF01
2015/08/03 12:01:42 VCS NOTICE V-16-1-10447 Group ec2-sg is online on system CGF01

20 minutes later, user manually switched over ec4-sg and ec5-sg service groups from system CGF01 to system CGF02.

2015/08/03 12:19:01 VCS NOTICE V-16-1-10208 Initiating switch of group ec4-sg from system CGF01 to system CGF02
2015/08/03 12:19:32 VCS NOTICE V-16-1-10447 Group ec4-sg is online on system CGF02
2015/08/03 12:19:06 VCS NOTICE V-16-1-10208 Initiating switch of group ec5-sg from system CGF01 to system CGF02
2015/08/03 12:21:08 VCS NOTICE V-16-1-10447 Group ec5-sg is online on system CGF02

2. CGF02 node panicked

As can be observed in log below snippet, both LLT heartbeat links went down which lead to split brain in cluster. I/O fencing intervened and System CGF02 was fenced out. Hence node CGF02 panicked.

2015/08/03 15:17:40 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth2, UP, eth6, UP; Current status =eth2, DOWN, eth6, DOWN.
2015/08/03 15:17:41 VCS INFO V-16-1-10077 Received new cluster membership
2015/08/03 15:17:41 VCS NOTICE V-16-1-10112 System (CGF01) - Membership: 0x1, DDNA: 0x0
2015/08/03 15:17:41 VCS NOTICE V-16-1-10034 RECONFIG received. VCS waiting for I/O fencing to be completed
2015/08/03 15:17:42 VCS NOTICE V-16-1-10036 I/O fencing completed
2015/08/03 15:17:42 VCS ERROR V-16-1-10079 System CGF02 (Node '1') is in Down State - Membership: 0x1
2015/08/03 15:17:42 VCS ERROR V-16-1-10322 System CGF02 (Node '1') changed state from RUNNING to FAULTED

Hopefully we answered all queries. Please let us know if any further assistance needed.

Thanks & Regards,
Sunil Y

barani_129
Level 3
9 years ago
Thanks Mike, really appreciate your support. :)

Unfortunately both nodes are now upgraded to latest application version (Yes, I wasn't informed about this).

So we have lost all the logs. I will pass your findings to customer and try to close this case.

@Riaan/Sunil: Thanks for your support too. :)

If we can't proceed without anyother log files then we can assume this as answered.

Br/Barani
mikebounds
Level 6
9 years ago
The from engine log on CGF01 it looks like you had a network issue at the same time you shutdown the node:

2015/08/03 15:17:27 VCS INFO V-16-1-50135 User root fired command: MSG_CLUSTER_STOP_SYS from localhost
2015/08/03 15:17:37 VCS ERROR V-16-1-54031 Resource nic_om (Owner: Unspecified, Group: net-om-sg) is FAULTED on sys CGF01
2015/08/03 15:17:37 VCS NOTICE V-16-1-10300 Initiating Offline of Resource net-om-phantom-res (Owner: Unspecified, Group: net-om-sg) on System CGF01
2015/08/03 15:17:37 VCS ERROR V-16-1-54031 Resource nic_charging (Owner: Unspecified, Group: net-charging-sg) is FAULTED on sys CGF01
2015/08/03 15:17:37 VCS NOTICE V-16-1-10300 Initiating Offline of Resource net-charging-phantom-res (Owner: Unspecified, Group: net-charging-sg) on System CGF01
2015/08/03 15:17:37 VCS ERROR V-16-2-13067 (CGF01) Agent is calling clean for resource(ec3-ip-res) because the resource became OFFLINE unexpectedly, on its own.
2015/08/03 15:17:37 VCS INFO V-16-2-13068 (CGF01) Resource(ec3-ip-res) - clean completed successfully.
2015/08/03 15:17:37 VCS INFO V-16-1-10307 Resource ec3-ip-res (Owner: Unspecified, Group: ec3-sg) is offline on CGF01 (Not initiated by VCS)
2015/08/03 15:17:37 VCS NOTICE V-16-1-10300 Initiating Offline of Resource ec3-ec-res (Owner: Unspecified, Group: ec3-sg) on System CGF01
2015/08/03 15:17:37 VCS ERROR V-16-2-13067 (CGF01) Agent is calling clean for resource(ne3s-ip-res) because the resource became OFFLINE unexpectedly, on its own.
2015/08/03 15:17:37 VCS INFO V-16-2-13068 (CGF01) Resource(ne3s-ip-res) - clean completed successfully.
2015/08/03 15:17:37 VCS INFO V-16-1-10307 Resource ne3s-ip-res (Owner: Unspecified, Group: esymac-sg) is offline on CGF01 (Not initiated by VCS)

So here we see netwok reources nic_om, ec3-ip-res, ne3s-ip-res faulting. They should not have faulted due to shutting down, the node - in particular, nic_om is a persistent resource, so VCS cannot offlline this resource, it just monitors it. So if there was a network issue, then LLT network may have failed too, so fencing kicks in and loosing node panics.

So have a look in system log for both nodes to see if you see network errors.

Mike
mikebounds
Level 6
9 years ago
Need engine log and system log from CGF02

Mike
barani_129
Level 3
9 years ago
Hello Riaan/Mike,

The node CGF01 was rebooted on 3rd August 15:14. at that time only this error message happened and node CGF02 was panicked.

Br/Barani

Forum Discussion

HAD unregistration problem with gab (failed) : VCS 6.0

1. Error while shutting down VCS: HAD unregister failed with GAB.

2. CGF02 node panicked

15 Replies

1. Error while shutting down VCS: HAD unregister failed with GAB.

2. CGF02 node panicked

Related Content

Failing reporting service can be a problem to migrate

Problem clean

Synchronise fails msmq not enough resources

sisipsoverride.sh failing

Report Sucess and Failed Last 6 Months or More

Recent Discussions

Configure two Mount type resources of nfs FStype attribute using the same share

order

key registration and reservation

Verifying that primary and dr clusters replication is synced

vcs can create logical nic