Solved: fault of primary node in a global cluster

tgenova · ‎06-04-2015

Hi Guy. I have a global cluster formed by 2 mini-cluster of one node only each one synchronized in asynchronous mode. I wanted to simulate a fault of primary node with solaris command 'halt' root@MILWB02S # hagrp -state AppService #Group Attribute System Value AppService State MILWB03SCluster:MILWB03S |OFFLINE| AppService State localclus:MILWB02S |ONLINE| after 'halt' on primary node MILWB02S, we have: root@MILWB03S # hagrp -state AppService #Group Attribute System Value AppService State MILWB02SCluster:MILWB02S |OFFLINE| AppService State localclus:MILWB03S |OFFLINE| root@MILWB03S # hasys -state #System Attribute Value MILWB02SCluster:MILWB02S SysState EXITED localclus:MILWB03S SysState RUNNING root@MILWB03S # vradmin -g datadg repstatus datarvg VxVM VVR vradmin INFO V-5-52-1205 Primary is unreachable or RDS has configuration error. Displayed status information is from Secondary and can be out-of-date. Replicated Data Set: datarvg Primary: Host name: 10.66.28.53 RVG name: datarvg DG name: datadg RVG state: enabled for I/O Data volumes: 1 VSets: 0 SRL name: srl_vol SRL size: 1.00 G Total secondaries: 1 Secondary: Host name: 10.66.28.54 RVG name: datarvg DG name: datadg Data status: consistent, up-to-date Replication status: paused due to network disconnection Current mode: asynchronous Logging to: SRL (0 updates behind, last update ID 5730.50511) Timestamp Information: behind by 0h 0m 0s Last Update on Primary: May 29 13:32:06 Secondary up-to-date as of: May 29 13:32:06 Config Errors: 10.66.28.53: Pri or Sec IP not available or vradmind not running, stale information is this situation correct ? I decided to manually start the service (AppService) on secondary node, because MILWB02S is down root@MILWB03S # hagrp -online -force AppService -sys MILWB03S root@MILWB03S # hagrp -state AppService #Group Attribute System Value AppService State MILWB02SCluster:MILWB02S |OFFLINE| AppService State localclus:MILWB03S |ONLINE| root@MILWB03S # vradmin -g datadg repstatus datarvg Replicated Data Set: datarvg Primary: Host name: 10.66.28.54 RVG name: datarvg DG name: datadg RVG state: enabled for I/O Data volumes: 1 VSets: 0 SRL name: srl_vol SRL size: 1.00 G Total secondaries: 1 Config Errors: 10.66.28.53: Pri or Sec IP not available or vradmind not running after a lot of time, I booted the server down, and I noted a automatic switch of service from MILWB03S to MILWB02S root@MILWB02S # hagrp -state AppService #Group Attribute System Value AppService State MILWB03SCluster:MILWB03S |OFFLINE| AppService State localclus:MILWB02S |ONLINE| root@MILWB02S # vradmin -g datadg repstatus datarvg Replicated Data Set: datarvg Primary: Host name: 10.66.28.53 RVG name: datarvg DG name: datadg RVG state: enabled for I/O Data volumes: 1 VSets: 0 SRL name: srl_vol SRL size: 1.00 G Total secondaries: 1 Config Errors: 10.66.28.54: Primary-Primary configuration Is thi situation correct ? Why the cluster switched the service ?

Sunil_Yadav · ‎06-13-2015

Hi,

We tried analyzing config and log files. Our findings are as below:

1. We observed testing activities. For service groups’ failures, it seems working correctly.

On MILWB02S:

2015/05/29 11:19:15 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/05/29 11:42:55 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/05/29 12:46:29 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/04 11:09:51 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/09 10:41:39 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/09 14:19:44 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]

On MILWB03S:

2015/05/29 11:19:15 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/05/29 11:42:55 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/05/29 12:46:29 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/04 11:09:51 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/09 10:41:39 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/09 14:19:44 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster

2. There is no issue with ClusterFailoverPolicy. Value of “ClusterFailoverPolicy = Auto” is same across clusters. However, ClusterList is not same across clusters.

On MILWB02S
ClusterList = { MILWB03SCluster = 0, MILWB02SCluster = 1 }

On MILWB03S
ClusterList = { MILWB02SCluster = 0, MILWB03SCluster = 1 }

This value too must be same across clusters. Otherwise, it can create concurrency violation during Auto-Start. However, this discrepancy haven’t created any issue till now. You should rectify this to avoid any future issues.

3. Cross cluster failover didn’t happen when MILWB02S went down?

On MILWB03S, we observed that MILWB02SCluster has exited(not faulted).

2015/05/29 13:31:58 VCS NOTICE V-16-1-50514 Remote cluster 'MILWB02SCluster' has exited
2015/05/29 13:31:58 VCS INFO V-16-3-18309 (MILWB03S) Cluster MILWB02SCluster exited
2015/05/29 13:42:09 VCS ERROR V-16-3-18211 (MILWB03S) Cluster MILWB03SCluster lost heartbeat Icmp to cluster MILWB02SCluster

Same state transition was confirmed by “hasys –state”

# hasys -state
# System                    Attribute    Value
MILWB02SCluster:MILWB02S    SysState    EXITED
localclus:MILWB03S          SysState    RUNNING

Cross cluster failover happens only in case of cluster FAULT. In this case, cluster didn’t faulted, it EXITED. In case of cluster fault, expected log message is:

9999/99/99 23:59:59 VCS CRITICAL V-16-1-50513 Remote cluster 'Xxxx' has faulted

As there wasn’t cluster fault, there wasn’t cross cluster failover.

4. Automated switchover of AppService from MILWB03S to MILWB02S?

We verified all ocassions when AppService went online on MILWB02S. Everytime, it was user iniitated action. AppService never automatically switched-over from MILWB03S to MILWB02S

# 1
2015/05/29 14:33:32 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService  MILWB02S  from localhost
.
.
.
2015/05/29 14:35:11 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 2
2015/05/29 15:49:30 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService  MILWB02S  from localhost
.
.
.
2015/05/29 15:50:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 3
2015/06/01 08:39:34 VCS INFO V-16-1-50135 User root fired command: hagrp -switch AppService  MILWB02S  MILWB02SCluster  from localhost
2015/06/01 08:39:34 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/01 08:42:00 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 4
2015/06/03 11:09:32 VCS INFO V-16-1-50135 User root fired command: hagrp -flush AppService  MILWB02S  0  from localhost
.
.
.
2015/06/03 11:10:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 5
2015/06/03 16:38:36 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/03 16:45:10 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 6
2015/06/04 10:44:50 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService  MILWB02S  from localhost.
.
.
.
2015/06/04 10:46:24 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 7
2015/06/09 09:38:40 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/09 09:41:07 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 8
2015/06/12 08:11:00 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/12 08:13:25 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

Hopefully, we have addressed all your queries. Please let us if any further assistance needed.

Thanks & Regards,
Sunil Y

View solution in original post

Venkata_Reddy_C · ‎06-09-2015

Looks like something wrong in the configuration. Can you share the main.cf and engine_A.log for the period of your test?

Regards,

Venkat

Sunil_Yadav · ‎06-10-2015

On 1^st occasion, cross cluster failover didn’t happen when MILWB02S went down. You had to manually online AppService on MILWB03S.

On 2^nd occasion, automated cross cluster failover seems to be kicked in. AppService automatically switched-over from MILWB03S to MILWB02S.

Prima facie, there seems to be issue with ClusterFailOverPolicy attribute. For AppService service group, can you make sure its value of ClusterFailOverPolicy attribute is same for both clusters. Execute “hagrp -display AppService -attribute ClusterFailOverPolicy” on both cluster. Same value must be there for both clusters on both sides.

However, for accurate RCA, please share main.cf and engine log(from both clusters i.e. MILWB02S and MILWB03S).

Thanks & Regards,

Sunil Y

tgenova · ‎06-12-2015

Hi all.

the second main.cf file.

BR

Tiziano

tgenova · ‎06-12-2015

Hi all.

the first main.cf file.

root@MILWB02S # hagrp -display AppService -attribute ClusterFailOverPolicy
#Group Attribute System Value
AppService ClusterFailOverPolicy MILWB03SCluster Auto
AppService ClusterFailOverPolicy localclus Auto

root@MILWB03S # hagrp -display AppService -attribute ClusterFailOverPolicy
#Group Attribute System Value
AppService ClusterFailOverPolicy MILWB02SCluster Auto
AppService ClusterFailOverPolicy localclus Auto

BR

Tiziano

tgenova · ‎06-12-2015

Hi.

Tiziano

tgenova · ‎06-12-2015

Hi.

BR

Tiziano

tgenova · ‎06-12-2015

Hi all.

the first main.cf file.

BR

Tiziano

tgenova · ‎06-12-2015

Hi

In the log file you can verify a lot of activities on the global cluster, because I tested the HA simulating a lot of failure in the last days.

Tiziano

Sunil_Yadav · ‎06-13-2015

Hi,

We tried analyzing config and log files. Our findings are as below:

1. We observed testing activities. For service groups’ failures, it seems working correctly.

On MILWB02S:

2015/05/29 11:19:15 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/05/29 11:42:55 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/05/29 12:46:29 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/04 11:09:51 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/09 10:41:39 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
2015/06/09 14:19:44 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]

On MILWB03S:

2015/05/29 11:19:15 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/05/29 11:42:55 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/05/29 12:46:29 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/04 11:09:51 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/09 10:41:39 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
2015/06/09 14:19:44 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster

2. There is no issue with ClusterFailoverPolicy. Value of “ClusterFailoverPolicy = Auto” is same across clusters. However, ClusterList is not same across clusters.

On MILWB02S
ClusterList = { MILWB03SCluster = 0, MILWB02SCluster = 1 }

On MILWB03S
ClusterList = { MILWB02SCluster = 0, MILWB03SCluster = 1 }

This value too must be same across clusters. Otherwise, it can create concurrency violation during Auto-Start. However, this discrepancy haven’t created any issue till now. You should rectify this to avoid any future issues.

3. Cross cluster failover didn’t happen when MILWB02S went down?

On MILWB03S, we observed that MILWB02SCluster has exited(not faulted).

2015/05/29 13:31:58 VCS NOTICE V-16-1-50514 Remote cluster 'MILWB02SCluster' has exited
2015/05/29 13:31:58 VCS INFO V-16-3-18309 (MILWB03S) Cluster MILWB02SCluster exited
2015/05/29 13:42:09 VCS ERROR V-16-3-18211 (MILWB03S) Cluster MILWB03SCluster lost heartbeat Icmp to cluster MILWB02SCluster

Same state transition was confirmed by “hasys –state”

# hasys -state
# System                    Attribute    Value
MILWB02SCluster:MILWB02S    SysState    EXITED
localclus:MILWB03S          SysState    RUNNING

Cross cluster failover happens only in case of cluster FAULT. In this case, cluster didn’t faulted, it EXITED. In case of cluster fault, expected log message is:

9999/99/99 23:59:59 VCS CRITICAL V-16-1-50513 Remote cluster 'Xxxx' has faulted

As there wasn’t cluster fault, there wasn’t cross cluster failover.

4. Automated switchover of AppService from MILWB03S to MILWB02S?

We verified all ocassions when AppService went online on MILWB02S. Everytime, it was user iniitated action. AppService never automatically switched-over from MILWB03S to MILWB02S

# 1
2015/05/29 14:33:32 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService  MILWB02S  from localhost
.
.
.
2015/05/29 14:35:11 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 2
2015/05/29 15:49:30 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService  MILWB02S  from localhost
.
.
.
2015/05/29 15:50:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 3
2015/06/01 08:39:34 VCS INFO V-16-1-50135 User root fired command: hagrp -switch AppService  MILWB02S  MILWB02SCluster  from localhost
2015/06/01 08:39:34 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/01 08:42:00 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 4
2015/06/03 11:09:32 VCS INFO V-16-1-50135 User root fired command: hagrp -flush AppService  MILWB02S  0  from localhost
.
.
.
2015/06/03 11:10:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 5
2015/06/03 16:38:36 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/03 16:45:10 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 6
2015/06/04 10:44:50 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService  MILWB02S  from localhost.
.
.
.
2015/06/04 10:46:24 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 7
2015/06/09 09:38:40 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/09 09:41:07 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

# 8
2015/06/12 08:11:00 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
.
.
.
2015/06/12 08:13:25 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

Hopefully, we have addressed all your queries. Please let us if any further assistance needed.

Thanks & Regards,
Sunil Y

tgenova · ‎06-16-2015

Hi Sunil.

thank you very much for your hint, I understood that the Solaris command 'halt' is not a good way to simulate a cluster fault. You told me that EXITED status doesn't mean a fault (I'm sorry for this misunderstanding). Now I'm going to discover a command to simulate a real fault and so, I'm sure, the cluster status should be FAULTED and after that I'm going to wait for a node switch.

Now I'm going to analize the values of ClusterList attribute

Best Regards.

Tiziano

mikebounds · ‎06-17-2015

Yes LLLT/GAB runs in kernel and so halt is seen by VCS so cluster shows as EXITED. Use "uadmin 2 1" - this is very quick and usually is not seen by VCS so other cluster will show as FAULTED. Alternatively you can power down the system from the ALOM

Mike

tgenova · ‎06-24-2015

Hi Mike.

Now I'm in vacation. As soon as possible, I'm going to test the primary node failure with 'uadmin' command and I'm going to give you a feedback.

thanks in advance.

Tiziano

tgenova · ‎07-03-2015

Hi Mike.

After my vacation ... I implemented your hint to simulate a node fault and I noted a FAULTED state instead of EXITED state. Obviously for you. I'm sorry for my misunderstanding.

Thank you very much.

Tiziano

VOX

fault of primary node in a global cluster