Forum Discussion

tgenova's avatar
tgenova
Level 4
10 years ago

fault of primary node in a global cluster

Hi Guy. I have a global cluster formed by 2 mini-cluster of one node only each one synchronized in asynchronous mode. I wanted to simulate a fault of primary node with solaris command 'halt' root@MILWB02S # hagrp -state AppService #Group Attribute System Value AppService State MILWB03SCluster:MILWB03S |OFFLINE| AppService State localclus:MILWB02S |ONLINE| after 'halt' on primary node MILWB02S, we have: root@MILWB03S # hagrp -state AppService #Group Attribute System Value AppService State MILWB02SCluster:MILWB02S |OFFLINE| AppService State localclus:MILWB03S |OFFLINE| root@MILWB03S # hasys -state #System Attribute Value MILWB02SCluster:MILWB02S SysState EXITED localclus:MILWB03S SysState RUNNING root@MILWB03S # vradmin -g datadg repstatus datarvg VxVM VVR vradmin INFO V-5-52-1205 Primary is unreachable or RDS has configuration error. Displayed status information is from Secondary and can be out-of-date. Replicated Data Set: datarvg Primary: Host name: 10.66.28.53 RVG name: datarvg DG name: datadg RVG state: enabled for I/O Data volumes: 1 VSets: 0 SRL name: srl_vol SRL size: 1.00 G Total secondaries: 1 Secondary: Host name: 10.66.28.54 RVG name: datarvg DG name: datadg Data status: consistent, up-to-date Replication status: paused due to network disconnection Current mode: asynchronous Logging to: SRL (0 updates behind, last update ID 5730.50511) Timestamp Information: behind by 0h 0m 0s Last Update on Primary: May 29 13:32:06 Secondary up-to-date as of: May 29 13:32:06 Config Errors: 10.66.28.53: Pri or Sec IP not available or vradmind not running, stale information is this situation correct ? I decided to manually start the service (AppService) on secondary node, because MILWB02S is down root@MILWB03S # hagrp -online -force AppService -sys MILWB03S root@MILWB03S # hagrp -state AppService #Group Attribute System Value AppService State MILWB02SCluster:MILWB02S |OFFLINE| AppService State localclus:MILWB03S |ONLINE| root@MILWB03S # vradmin -g datadg repstatus datarvg Replicated Data Set: datarvg Primary: Host name: 10.66.28.54 RVG name: datarvg DG name: datadg RVG state: enabled for I/O Data volumes: 1 VSets: 0 SRL name: srl_vol SRL size: 1.00 G Total secondaries: 1 Config Errors: 10.66.28.53: Pri or Sec IP not available or vradmind not running after a lot of time, I booted the server down, and I noted a automatic switch of service from MILWB03S to MILWB02S root@MILWB02S # hagrp -state AppService #Group Attribute System Value AppService State MILWB03SCluster:MILWB03S |OFFLINE| AppService State localclus:MILWB02S |ONLINE| root@MILWB02S # vradmin -g datadg repstatus datarvg Replicated Data Set: datarvg Primary: Host name: 10.66.28.53 RVG name: datarvg DG name: datadg RVG state: enabled for I/O Data volumes: 1 VSets: 0 SRL name: srl_vol SRL size: 1.00 G Total secondaries: 1 Config Errors: 10.66.28.54: Primary-Primary configuration Is thi situation correct ? Why the cluster switched the service ?
  • Hi,

     

    We tried analyzing config and log files. Our findings are as below:

     

    1.    We observed testing activities. For service groups’ failures, it seems working correctly.

    On MILWB02S:

    2015/05/29 11:19:15 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
    2015/05/29 11:42:55 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
    2015/05/29 12:46:29 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
    2015/06/04 11:09:51 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
    2015/06/09 10:41:39 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
    2015/06/09 14:19:44 VCS WARNING V-16-1-50911 Unable to fail over global group AppService in local cluster. Attempting to fail group over to a remote cluster [ClusterFailoverPolicy = Auto]
    

    On MILWB03S:

    2015/05/29 11:19:15 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
    2015/05/29 11:42:55 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
    2015/05/29 12:46:29 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
    2015/06/04 11:09:51 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
    2015/06/09 10:41:39 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
    2015/06/09 14:19:44 VCS INFO V-16-1-50925 Proceeding to online group AppService on the best possible system in the local cluster
    

     


    2.    There is no issue with ClusterFailoverPolicy. Value of “ClusterFailoverPolicy = Auto” is same across clusters. However, ClusterList is not same across clusters.

    On MILWB02S 
        ClusterList = { MILWB03SCluster = 0, MILWB02SCluster = 1 }

    On MILWB03S 
        ClusterList = { MILWB02SCluster = 0, MILWB03SCluster = 1 }

    This value too must be same across clusters. Otherwise, it can create concurrency violation during Auto-Start. However, this discrepancy haven’t created any issue till now. You should rectify this to avoid any future issues.

     

     

    3.    Cross cluster failover didn’t happen when MILWB02S went down?

    On MILWB03S, we observed that MILWB02SCluster has exited(not faulted).

    2015/05/29 13:31:58 VCS NOTICE V-16-1-50514 Remote cluster 'MILWB02SCluster' has exited
    2015/05/29 13:31:58 VCS INFO V-16-3-18309 (MILWB03S) Cluster MILWB02SCluster exited
    2015/05/29 13:42:09 VCS ERROR V-16-3-18211 (MILWB03S) Cluster MILWB03SCluster lost heartbeat Icmp to cluster MILWB02SCluster

    Same state transition was confirmed by “hasys –state”

    # hasys -state
    # System                    Attribute    Value
    MILWB02SCluster:MILWB02S    SysState    EXITED
    localclus:MILWB03S          SysState    RUNNING

    Cross cluster failover happens only in case of cluster FAULT. In this case, cluster didn’t faulted, it EXITED. In case of cluster fault, expected log message is:

    9999/99/99 23:59:59 VCS CRITICAL V-16-1-50513 Remote cluster 'Xxxx' has faulted

    As there wasn’t cluster fault, there wasn’t cross cluster failover.

     

     

    4.    Automated switchover of AppService from MILWB03S to MILWB02S?


    We verified all ocassions when AppService went online on MILWB02S. Everytime, it was user iniitated action. AppService never automatically switched-over from MILWB03S to MILWB02S

    # 1
    2015/05/29 14:33:32 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService  MILWB02S  from localhost
    .
    .
    .
    2015/05/29 14:35:11 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
    
    # 2
    2015/05/29 15:49:30 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService  MILWB02S  from localhost
    .
    .
    .
    2015/05/29 15:50:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
    
    # 3
    2015/06/01 08:39:34 VCS INFO V-16-1-50135 User root fired command: hagrp -switch AppService  MILWB02S  MILWB02SCluster  from localhost
    2015/06/01 08:39:34 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
    .
    .
    .
    2015/06/01 08:42:00 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
    
    # 4
    2015/06/03 11:09:32 VCS INFO V-16-1-50135 User root fired command: hagrp -flush AppService  MILWB02S  0  from localhost
    .
    .
    .
    2015/06/03 11:10:58 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
    
    # 5
    2015/06/03 16:38:36 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
    .
    .
    .
    2015/06/03 16:45:10 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
    
    # 6
    2015/06/04 10:44:50 VCS INFO V-16-1-50135 User root fired command: hagrp -online AppService  MILWB02S  from localhost.
    .
    .
    .
    2015/06/04 10:46:24 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
    
    # 7
    2015/06/09 09:38:40 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
    .
    .
    .
    2015/06/09 09:41:07 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S
    
    # 8
    2015/06/12 08:11:00 VCS INFO V-16-1-50803 Received request to switch group AppService from remote system MILWB03S to local system MILWB02S
    .
    .
    .
    2015/06/12 08:13:25 VCS NOTICE V-16-1-10447 Group AppService is online on system MILWB02S

     

    Hopefully, we have addressed all your queries. Please let us if any further assistance needed.

    Thanks & Regards,
    Sunil Y

13 Replies

  • Yes LLLT/GAB runs in kernel and so halt is seen by VCS so cluster shows as EXITED.  Use "uadmin 2 1" - this is very quick and usually is not seen by VCS so other cluster will show as  FAULTED.  Alternatively you can power down the system from the ALOM

     

    Mike

  • Hi Mike.

    Now I'm in vacation. As soon as possible, I'm going to test the primary node failure with 'uadmin' command and I'm going to give you a feedback.

    thanks in advance.

    Tiziano

  • Hi Mike.

    After my vacation ... I implemented your hint to simulate a node fault and I noted a FAULTED state instead of EXITED state. Obviously for you.   I'm sorry for my misunderstanding.

    Thank you very much.

    Tiziano