β06-04-2015 05:51 AM
Solved! Go to Solution.
β06-13-2015 06:32 AM
Hi,
We tried analyzing config and log files. Our findings are as below:
1. We observed testing activities. For service groupsβ failures, it seems working correctly.
On MILWB02S:
2015/05/29 11:19:15 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/05/29 11:42:55 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/05/29 12:46:29 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/04 11:09:51 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/09 10:41:39 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/09 14:19:44 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
On MILWB03S:
2015/05/29 11:19:15 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/05/29 11:42:55 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/05/29 12:46:29 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/04 11:09:51 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/09 10:41:39 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/09 14:19:44 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2. There is no issue with ClusterFailoverPolicy. Value of βClusterFailoverPolicy = Autoβ is same across clusters. However, ClusterList is not same across clusters.
On MILWB02S
ClusterList = { MILWB03SCluster = 0, MILWB02SCluster = 1 }On MILWB03S
ClusterList = { MILWB02SCluster = 0, MILWB03SCluster = 1 }
This value too must be same across clusters. Otherwise, it can create concurrency violation during Auto-Start. However, this discrepancy havenβt created any issue till now. You should rectify this to avoid any future issues.
3. Cross cluster failover didnβt happen when MILWB02S went down?
On MILWB03S, we observed that MILWB02SCluster has exited(not faulted).
2015/05/29 13:31:58 VCS NOTICE V-16-1-50514 Remote cluster 'MILWB02SCluster' has exited
2015/05/29 13:31:58 VCS INFO V-16-3-18309 (MILWB03S) Cluster MILWB02SCluster exited
2015/05/29 13:42:09 VCS ERROR V-16-3-18211 (MILWB03S) Cluster MILWB03SCluster lost heartbeat Icmp to cluster MILWB02SCluster
Same state transition was confirmed by βhasys βstateβ
# hasys -state
# System Attribute Value
MILWB02SCluster:MILWB02S SysState EXITED
localclus:MILWB03S SysState RUNNING
Cross cluster failover happens only in case of cluster FAULT. In this case, cluster didnβt faulted, it EXITED. In case of cluster fault, expected log message is:
9999/99/99 23:59:59 VCS CRITICAL V-16-1-50513 Remote cluster 'Xxxx' has faulted
As there wasnβt cluster fault, there wasnβt cross cluster failover.
4. Automated switchover of AppService from MILWB03S to MILWB02S?
We verified all ocassions when AppService went online on MILWB02S. Everytime, it was user iniitated action. AppService never automatically switched-over from MILWB03S to MILWB02S
# 1
2015/05/29 14:33:32 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService MILWB02S from localhost
.
.
.
2015/05/29 14:35:11 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 2
2015/05/29 15:49:30 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService MILWB02S from localhost
.
.
.
2015/05/29 15:50:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 3
2015/06/01 08:39:34 VCS INFO V-16-1-50135 User root fired command: hagrp -switch AppService MILWB02S MILWB02SCluster from localhost
2015/06/01 08:39:34 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/01 08:42:00 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 4
2015/06/03 11:09:32 VCS INFO V-16-1-50135 User root fired command: hagrp -flush AppService MILWB02S 0 from localhost
.
.
.
2015/06/03 11:10:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 5
2015/06/03 16:38:36 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/03 16:45:10 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 6
2015/06/04 10:44:50 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService MILWB02S from localhost.
.
.
.
2015/06/04 10:46:24 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 7
2015/06/09 09:38:40 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/09 09:41:07 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 8
2015/06/12 08:11:00 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/12 08:13:25 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
Hopefully, we have addressed all your queries. Please let us if any further assistance needed.
Thanks & Regards,
Sunil Y
β06-09-2015 11:16 PM
Looks like something wrong in the configuration. Can you share the main.cf and engine_A.log for the period of your test?
Regards,
Venkat
β06-10-2015 12:15 AM
On 1st occasion, cross cluster failover didnβt happen when MILWB02S went down. You had to manually online AppService on MILWB03S.
On 2nd occasion, automated cross cluster failover seems to be kicked in. AppService automatically switched-over from MILWB03S to MILWB02S.
Prima facie, there seems to be issue with ClusterFailOverPolicy attribute. For AppService service group, can you make sure its value of ClusterFailOverPolicy attribute is same for both clusters. Execute βhagrp -display AppService -attribute ClusterFailOverPolicyβ on both cluster. Same value must be there for both clusters on both sides.
However, for accurate RCA, please share main.cf and engine log(from both clusters i.e. MILWB02S and MILWB03S).
Thanks & Regards,
Sunil Y
β06-12-2015 12:40 AM
Hi all.
the second main.cf file.
BR
Tiziano
β06-12-2015 12:41 AM
Hi all.
the first main.cf file.
root@MILWB02S # hagrp -display AppService -attribute ClusterFailOverPolicy
#Group Attribute System Value
AppService ClusterFailOverPolicy MILWB03SCluster Auto
AppService ClusterFailOverPolicy localclus Auto
root@MILWB03S # hagrp -display AppService -attribute ClusterFailOverPolicy
#Group Attribute System Value
AppService ClusterFailOverPolicy MILWB02SCluster Auto
AppService ClusterFailOverPolicy localclus Auto
BR
Tiziano
β06-12-2015 12:49 AM
Hi.
Tiziano
β06-12-2015 12:51 AM
Hi.
BR
Tiziano
β06-12-2015 12:52 AM
Hi all.
the first main.cf file.
BR
Tiziano
β06-12-2015 06:39 AM
Hi
In the log file you can verify a lot of activities on the global cluster, because I tested the HA simulating a lot of failure in the last days.
Tiziano
β06-13-2015 06:32 AM
Hi,
We tried analyzing config and log files. Our findings are as below:
1. We observed testing activities. For service groupsβ failures, it seems working correctly.
On MILWB02S:
2015/05/29 11:19:15 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/05/29 11:42:55 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/05/29 12:46:29 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/04 11:09:51 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/09 10:41:39 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/09 14:19:44 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
On MILWB03S:
2015/05/29 11:19:15 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/05/29 11:42:55 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/05/29 12:46:29 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/04 11:09:51 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/09 10:41:39 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/09 14:19:44 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2. There is no issue with ClusterFailoverPolicy. Value of βClusterFailoverPolicy = Autoβ is same across clusters. However, ClusterList is not same across clusters.
On MILWB02S
ClusterList = { MILWB03SCluster = 0, MILWB02SCluster = 1 }On MILWB03S
ClusterList = { MILWB02SCluster = 0, MILWB03SCluster = 1 }
This value too must be same across clusters. Otherwise, it can create concurrency violation during Auto-Start. However, this discrepancy havenβt created any issue till now. You should rectify this to avoid any future issues.
3. Cross cluster failover didnβt happen when MILWB02S went down?
On MILWB03S, we observed that MILWB02SCluster has exited(not faulted).
2015/05/29 13:31:58 VCS NOTICE V-16-1-50514 Remote cluster 'MILWB02SCluster' has exited
2015/05/29 13:31:58 VCS INFO V-16-3-18309 (MILWB03S) Cluster MILWB02SCluster exited
2015/05/29 13:42:09 VCS ERROR V-16-3-18211 (MILWB03S) Cluster MILWB03SCluster lost heartbeat Icmp to cluster MILWB02SCluster
Same state transition was confirmed by βhasys βstateβ
# hasys -state
# System Attribute Value
MILWB02SCluster:MILWB02S SysState EXITED
localclus:MILWB03S SysState RUNNING
Cross cluster failover happens only in case of cluster FAULT. In this case, cluster didnβt faulted, it EXITED. In case of cluster fault, expected log message is:
9999/99/99 23:59:59 VCS CRITICAL V-16-1-50513 Remote cluster 'Xxxx' has faulted
As there wasnβt cluster fault, there wasnβt cross cluster failover.
4. Automated switchover of AppService from MILWB03S to MILWB02S?
We verified all ocassions when AppService went online on MILWB02S. Everytime, it was user iniitated action. AppService never automatically switched-over from MILWB03S to MILWB02S
# 1
2015/05/29 14:33:32 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService MILWB02S from localhost
.
.
.
2015/05/29 14:35:11 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 2
2015/05/29 15:49:30 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService MILWB02S from localhost
.
.
.
2015/05/29 15:50:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 3
2015/06/01 08:39:34 VCS INFO V-16-1-50135 User root fired command: hagrp -switch AppService MILWB02S MILWB02SCluster from localhost
2015/06/01 08:39:34 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/01 08:42:00 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 4
2015/06/03 11:09:32 VCS INFO V-16-1-50135 User root fired command: hagrp -flush AppService MILWB02S 0 from localhost
.
.
.
2015/06/03 11:10:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 5
2015/06/03 16:38:36 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/03 16:45:10 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 6
2015/06/04 10:44:50 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService MILWB02S from localhost.
.
.
.
2015/06/04 10:46:24 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 7
2015/06/09 09:38:40 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/09 09:41:07 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
# 8
2015/06/12 08:11:00 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/12 08:13:25 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
Hopefully, we have addressed all your queries. Please let us if any further assistance needed.
Thanks & Regards,
Sunil Y
β06-16-2015 03:34 PM
Hi Sunil.
thank you very much for your hint, I understood that the Solaris command 'halt' is not a good way to simulate a cluster fault. You told me that EXITED status doesn't mean a fault (I'm sorry for this misunderstanding). Now I'm going to discover a command to simulate a real fault and so, I'm sure, the cluster status should be FAULTED and after that I'm going to wait for a node switch.
Now I'm going to analize the values of ClusterList attribute
Best Regards.
Tiziano
β06-17-2015 01:00 AM
Yes LLLT/GAB runs in kernel and so halt is seen by VCS so cluster shows as EXITED. Use "uadmin 2 1" - this is very quick and usually is not seen by VCS so other cluster will show as FAULTED. Alternatively you can power down the system from the ALOM
Mike
β06-24-2015 03:31 PM
Hi Mike.
Now I'm in vacation. As soon as possible, I'm going to test the primary node failure with 'uadmin' command and I'm going to give you a feedback.
thanks in advance.
Tiziano
β07-03-2015 12:31 AM
Hi Mike.
After my vacation ... I implemented your hint to simulate a node fault and I noted a FAULTED state instead of EXITED state. Obviously for you. I'm sorry for my misunderstanding.
Thank you very much.
Tiziano