Forum Discussion

kongzzzz's avatar
kongzzzz
Level 3
10 years ago

Why the service group can't be token online automatically after fixing brain-split

The title of this thread is changed from "what's means of "restart mode"" to "Why the service group can't be token online automatically after fixing brain-split".

------------------------------------------------


We are running VCS 6.0.2 on RHEL 6.5, below is our cluster configuration:

Heartbeat link: eth3, eth4
Low-priority heartbeat link: not be enabled
Fencing: not be enabled
Cluster contains 2 servers: jarry-crf1, jarry-crf2.

Server groups:
   TestGrp1 contains a "FileOnOff" resource and Parallel mode is enabled.
   TestGrp2 contains a "FileOnOff" resource, depends on TestGrp1 and Failover mode is enabled.

Test steps:

1. Take TestGrp1 online on both server, take TestGrp2 online on server "jarry-crf2"
2. Stop both heartbeat links on server "jarry-crf1" by command "ifdown eth3; sleep 60; ifdown eth4"
3. Recover heartbeat links by command "ifup eth3; ifup eth4"

Then we found the "had" process is restared on server "jarry-crf2", and we found below logs in engine_A.log

2015/02/10 00:48:43 VCS NOTICE V-16-1-10433 Group TestGrp2 will not start automatically on System jarry-crf2 as the system is in restart mode.
2015/02/10 00:48:43 VCS NOTICE V-16-1-10433 Group TestGrp1 will not start automatically on System jarry-crf2 as the system is in restart mode.
2015/02/10 00:48:43 VCS NOTICE V-16-1-10445 Group TestGrp1 will not start automatically as atleast one system in the SystemList attribute of the group is in restart mode.
2015/02/10 00:48:47 VCS NOTICE V-16-1-10433 Group VCShmg will not start automatically on System jarry-crf2 as the system is in restart mode.
2015/02/10 00:48:47 VCS NOTICE V-16-1-10445 Group VCShmg will not start automatically as atleast one system in the SystemList attribute of the group is in restart mode.


My Questions:

1. Why "had" process is restarted after heartbeat being recovered.
2. What's means of "restart mode", how to bring service group leave "restart mode" and start automatically.

 

Thanks in advance!

  • Hello,

    Thanks for detailed information ..  this gives much clarity now ..

    So your question is, once VCS is recovered, ideally it should start the resource & the group as a clean close of process was called before ..

    I came across below technote which suggests that VCS will not bring resource online in the event of HAD getting restarted by "hashadow" process which makes me to believe that this is default behaviour of VCS.

    http://www.symantec.com/docs/HOWTO79931

    However couple of things I can suggest

    1. See if you can mark that resource as "critical" & see if that makes any difference to group behavior (this is just a test ).

    2. To solve this problem, as suggested before, you can use preonline triggers which can help you to run some scripts.

     

    G

  • Hello Gaurav

    Now we know this is default behaviour of VCS.

    For suggestion 1, our application resource has already been marked as "critical", so what we see is the behaviour of a critical resource.

    For suggestion 2, we will try it.

     

    Thanks you your helping!!

     

     

  • Hello,

    Thanks for detailed information ..  this gives much clarity now ..

    So your question is, once VCS is recovered, ideally it should start the resource & the group as a clean close of process was called before ..

    I came across below technote which suggests that VCS will not bring resource online in the event of HAD getting restarted by "hashadow" process which makes me to believe that this is default behaviour of VCS.

    http://www.symantec.com/docs/HOWTO79931

    However couple of things I can suggest

    1. See if you can mark that resource as "critical" & see if that makes any difference to group behavior (this is just a test ).

    2. To solve this problem, as suggested before, you can use preonline triggers which can help you to run some scripts.

     

    G

  • Hello Gaurav

    Maybe FileOnOff agent make this case confusing, let's introduct real scenario and try to describe my case more clear.

    We have an application named "CRF" running one the server, we configured one service group "CrfGrp" with one resource for our application "CRF"

    the configuration in main.cf is:

    group CrfGrp (
            SystemList = { jarry-crf1 = 0, jarry-crf2 = 1 }
            Parallel = 1
            AutoStartList = { jarry-crf1, jarry-crf2 }
            )
    
            CrfMonitor CrfRes (
                    )

                    
    And implements below script based entry points for CrfRes:

    online: start "CRF"
    offline: stop "CRF"
    monitor: use "ps" command to check whether "CRF" is running
    clean: stop "CRF"
    close: stop "CRF"

     

    Start testing:

    1. execute command "hastatus -sum", result is:

    # hastatus -sum
    
    -- SYSTEM STATE
    -- System               State                Frozen              
    
    A  jarry-crf1           RUNNING              0                    
    A  jarry-crf2           RUNNING              0                    
    
    -- GROUP STATE
    -- Group           System               Probed     AutoDisabled    State          
    
    B  ClusterService  jarry-crf1           Y          N               ONLINE         
    B  ClusterService  jarry-crf2           Y          N               OFFLINE        
    B  CrfGrp          jarry-crf1           Y          N               ONLINE         
    B  CrfGrp          jarry-crf2           Y          N               ONLINE 


    2. On server "jarry-crf1", execute command "ifdown eth3; sleep 60; ifdown eth4; sleep 60;ifup eth3; ifup eth4".

    3. We can monitor that the "close" entry point of CrfRes is called at 01:59:33. Therefore, application "CRF" is stopped by VCS at this time point.

    4. Several minutes later, HAD is restarted completely, the status of "CrfGrp" on "jarry-crf2" still is OFFLINE.


    Above all, there are 3 phases:

    before brain-split: CrfGrp is ONLINE on jarry-crf2
    during brain-split: CrfGrp is ONLINE on jarry-crf2
    brain-split fixed:  CrfGrp is OFFLINE on jarry-crf2


    Our question: why "CrfGrp" is not token online automatically on jarry-crf2?


    related server group, resource status, logs listed below (after HAD being restarted completely):

    # hastatus -sum
    
    -- SYSTEM STATE
    -- System               State                Frozen              
    
    A  jarry-crf1           RUNNING              0                    
    A  jarry-crf2           RUNNING              0                    
    
    -- GROUP STATE
    -- Group           System               Probed     AutoDisabled    State          
    
    B  ClusterService  jarry-crf1           Y          N               ONLINE         
    B  ClusterService  jarry-crf2           Y          N               OFFLINE        
    B  CrfGrp          jarry-crf1           Y          N               ONLINE         
    B  CrfGrp          jarry-crf2           Y          N               OFFLINE
    
    
    # hagrp -display CrfGrp
    #Group       Attribute             System     Value
    CrfGrp       AdministratorGroups   global     
    CrfGrp       Administrators        global     
    CrfGrp       Authority             global     0
    CrfGrp       AutoFailOver          global     1
    CrfGrp       AutoRestart           global     1
    CrfGrp       AutoStart             global     1
    CrfGrp       AutoStartIfPartial    global     1
    CrfGrp       AutoStartList         global     jarry-crf1    jarry-crf2
    CrfGrp       AutoStartPolicy       global     Order
    CrfGrp       ClusterFailOverPolicy global     Manual
    CrfGrp       ClusterList           global     
    CrfGrp       ContainerInfo         global     
    CrfGrp       DisableFaultMessages  global     0
    CrfGrp       Evacuate              global     1
    CrfGrp       ExtMonApp             global     
    CrfGrp       ExtMonArgs            global     
    CrfGrp       FailOverPolicy        global     Priority
    CrfGrp       FaultPropagation      global     1
    CrfGrp       Frozen                global     0
    CrfGrp       GroupOwner            global     
    CrfGrp       GroupRecipients       global     
    CrfGrp       Guests                global     
    CrfGrp       Load                  global     0
    CrfGrp       ManageFaults          global     ALL
    CrfGrp       ManualOps             global     1
    CrfGrp       OnlineClearParent     global     0
    CrfGrp       OnlineRetryInterval   global     0
    CrfGrp       OnlineRetryLimit      global     0
    CrfGrp       OperatorGroups        global     
    CrfGrp       Operators             global     
    CrfGrp       Parallel              global     1
    CrfGrp       PreOnline             global     0
    CrfGrp       PreOnlining           global     0
    CrfGrp       PreSwitch             global     0
    CrfGrp       PreSwitching          global     0
    CrfGrp       PreonlineTimeout      global     300
    CrfGrp       Prerequisites         global     
    CrfGrp       PrintTree             global     1
    CrfGrp       Priority              global     0
    CrfGrp       ProPCV                global     0
    CrfGrp       SourceFile            global     ./main.cf
    CrfGrp       SysDownPolicy         global     
    CrfGrp       SystemList            global     jarry-crf1    0    jarry-crf2    1
    CrfGrp       SystemZones           global     
    CrfGrp       TFrozen               global     0
    CrfGrp       Tag                   global     
    CrfGrp       TriggerEvent          global     1
    CrfGrp       TriggerPath           global     
    CrfGrp       TriggerResFault       global     1
    CrfGrp       TriggerResRestart     global     0
    CrfGrp       TriggerResStateChange global     0
    CrfGrp       TriggersEnabled       global     
    CrfGrp       TypeDependencies      global     
    CrfGrp       UserAssoc             global     
    CrfGrp       UserIntGlobal         global     0
    CrfGrp       UserStrGlobal         global     
    CrfGrp       AutoDisabled          jarry-crf1 0
    CrfGrp       AutoDisabled          jarry-crf2 0
    CrfGrp       Enabled               jarry-crf1 1
    CrfGrp       Enabled               jarry-crf2 1
    CrfGrp       IntentOnline          jarry-crf1 1
    CrfGrp       IntentOnline          jarry-crf2 0
    CrfGrp       NumRetries            jarry-crf1 0
    CrfGrp       NumRetries            jarry-crf2 0
    CrfGrp       PCVAllowOnline        jarry-crf1 1
    CrfGrp       PCVAllowOnline        jarry-crf2 1
    CrfGrp       Probed                jarry-crf1 1
    CrfGrp       Probed                jarry-crf2 1
    CrfGrp       ProbesPending         jarry-crf1 0
    CrfGrp       ProbesPending         jarry-crf2 0
    CrfGrp       Restart               jarry-crf1 0
    CrfGrp       Restart               jarry-crf2 0
    CrfGrp       State                 jarry-crf1 |ONLINE|
    CrfGrp       State                 jarry-crf2 |OFFLINE|
    CrfGrp       UserIntLocal          jarry-crf1 0
    CrfGrp       UserIntLocal          jarry-crf2 0
    CrfGrp       UserStrLocal          jarry-crf1 
    CrfGrp       UserStrLocal          jarry-crf2 
    CrfGrp       VCSi3Info             jarry-crf1 
    CrfGrp       VCSi3Info             jarry-crf2 
    
    
    # hares -display CrfRes
    #Resource    Attribute             System     Value
    CrfRes       Group                 global     CrfGrp
    CrfRes       Type                  global     CrfMonitor
    CrfRes       AutoStart             global     1
    CrfRes       Critical              global     1
    CrfRes       Enabled               global     1
    CrfRes       LastOnline            global     jarry-crf2
    CrfRes       MonitorOnly           global     0
    CrfRes       ResourceOwner         global     
    CrfRes       TriggerEvent          global     0
    CrfRes       ArgListValues         jarry-crf1 ""
    CrfRes       ArgListValues         jarry-crf2 ""
    CrfRes       ConfidenceLevel       jarry-crf1 100
    CrfRes       ConfidenceLevel       jarry-crf2 0
    CrfRes       ConfidenceMsg         jarry-crf1 
    CrfRes       ConfidenceMsg         jarry-crf2 
    CrfRes       Flags                 jarry-crf1 
    CrfRes       Flags                 jarry-crf2 
    CrfRes       IState                jarry-crf1 not waiting
    CrfRes       IState                jarry-crf2 not waiting
    CrfRes       MonitorMethod         jarry-crf1 Traditional
    CrfRes       MonitorMethod         jarry-crf2 Traditional
    CrfRes       Probed                jarry-crf1 1
    CrfRes       Probed                jarry-crf2 1
    CrfRes       Start                 jarry-crf1 1
    CrfRes       Start                 jarry-crf2 0
    CrfRes       State                 jarry-crf1 ONLINE
    CrfRes       State                 jarry-crf2 OFFLINE
    CrfRes       ComputeStats          global     0
    CrfRes       ContainerInfo         global     Type        Name        Enabled    
    CrfRes       ResContainerInfo      global     Type        Name        Enabled    
    CrfRes       ResourceRecipients    global         
    CrfRes       TriggerPath           global     
    CrfRes       TriggerResRestart     global     0
    CrfRes       TriggerResStateChange global     0
    CrfRes       TriggersEnabled       global     
    CrfRes       dummy                 global     
    CrfRes       MonitorTimeStats      jarry-crf1 Avg    0    TS    
    CrfRes       MonitorTimeStats      jarry-crf2 Avg    0    TS    
    CrfRes       ResourceInfo          jarry-crf1 State    Valid    Msg        TS    
    CrfRes       ResourceInfo          jarry-crf2 State    Valid    Msg        TS    
    
    
    
    # hasys -display jarry-crf2
    #System    Attribute          Value
    jarry-crf2 AgentsStopped      0
    jarry-crf2 AvailableCapacity  100
    jarry-crf2 CPUThresholdLevel  Critical    90    Warning    80    Note    70    Info    60
    jarry-crf2 CPUUsage           0
    jarry-crf2 CPUUsageMonitoring Enabled    0    ActionThreshold    0    ActionTimeLimit    0    Action    NONE    NotifyThreshold    0    NotifyTimeLimit    0
    jarry-crf2 Capacity           100
    jarry-crf2 ConfigBlockCount   299
    jarry-crf2 ConfigCheckSum     42524
    jarry-crf2 ConfigDiskState    CURRENT
    jarry-crf2 ConfigFile         /etc/VRTSvcs/conf/config
    jarry-crf2 ConfigInfoCnt      0
    jarry-crf2 ConfigModDate      Thu 12 Feb 2015 02:19:30 AM EST
    jarry-crf2 ConnectorState     Down
    jarry-crf2 CurrentLimits      
    jarry-crf2 DiskHbStatus       
    jarry-crf2 DynamicLoad        0
    jarry-crf2 EngineRestarted    0
    jarry-crf2 EngineVersion      6.0.30.0
    jarry-crf2 FencingWeight      0
    jarry-crf2 Frozen             0
    jarry-crf2 GUIIPAddr          
    jarry-crf2 HostUtilization    CPU    0    Swap    0
    jarry-crf2 LLTNodeId          1
    jarry-crf2 LicenseType        PERMANENT_SITE
    jarry-crf2 Limits             
    jarry-crf2 LinkHbStatus       eth3    UP    eth4    UP
    jarry-crf2 LoadTimeCounter    0
    jarry-crf2 LoadTimeThreshold  600
    jarry-crf2 LoadWarningLevel   80
    jarry-crf2 NoAutoDisable      0
    jarry-crf2 NodeId             1
    jarry-crf2 OnGrpCnt           1
    jarry-crf2 PhysicalServer     
    jarry-crf2 ShutdownTimeout    600
    jarry-crf2 SourceFile         ./main.cf
    jarry-crf2 SwapThresholdLevel Critical    90    Warning    80    Note    70    Info    60
    jarry-crf2 SysInfo            Linux:jarry-crf2,#1 SMP Sun Jul 27 15:55:46 EDT 2014,2.6.32-431.29.2.el6.x86_64,x86_64
    jarry-crf2 SysName            jarry-crf2
    jarry-crf2 SysState           RUNNING
    jarry-crf2 SystemLocation     
    jarry-crf2 SystemOwner        
    jarry-crf2 SystemRecipients   
    jarry-crf2 TFrozen            0
    jarry-crf2 TRSE               0
    jarry-crf2 UpDownState        Up
    jarry-crf2 UserInt            0
    jarry-crf2 UserStr            
    jarry-crf2 VCSFeatures        NONE
    jarry-crf2 VCSMode            VCS
    
    
    
    2015/02/12 01:59:33 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth3, DOWN, eth4, DOWN; Current status =eth3, UP, eth4, UP.
    2015/02/12 01:59:43 VCS NOTICE V-16-1-11022 VCS engine (had) started
    2015/02/12 01:59:43 VCS NOTICE V-16-1-11027 VCS engine startup arguments=-restart 
    2015/02/12 01:59:43 VCS NOTICE V-16-1-11050 VCS engine version=6.0
    2015/02/12 01:59:43 VCS NOTICE V-16-1-11051 VCS engine join version=6.0.30.0
    2015/02/12 01:59:43 VCS NOTICE V-16-1-11052 VCS engine pstamp=6.0.300.000-GA-2013-01-10-16.00.01
    2015/02/12 01:59:43 VCS NOTICE V-16-1-10114 Opening GAB library
    2015/02/12 01:59:43 VCS NOTICE V-16-1-10619 'HAD' starting on: jarry-crf2
    2015/02/12 01:59:43 VCS INFO V-16-1-10196 Cluster logger started
    2015/02/12 01:59:43 VCS INFO V-16-1-10125 GAB timeout set to 30000 ms
    2015/02/12 01:59:43 VCS NOTICE V-16-1-11057 GAB registration monitoring timeout set to 200000 ms
    2015/02/12 01:59:43 VCS NOTICE V-16-1-11059 GAB registration monitoring action set to log system message
    2015/02/12 01:59:43 VCS INFO V-16-1-53504 VCS Engine Alive message!!
    2015/02/12 01:59:48 VCS INFO V-16-1-10077 Received new cluster membership
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10112 System (jarry-crf2) - Membership: 0x3, DDNA: 0x0
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10322 System  (Node '0') changed state from UNKNOWN to INITING
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10086 System  (Node '0') is in Regular Membership - Membership: 0x3
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10086 System jarry-crf2 (Node '1') is in Regular Membership - Membership: 0x3
    2015/02/12 01:59:48 VCS WARNING V-16-1-50129 Operation 'haclus -modify' rejected as the node is in CURRENT_DISCOVER_WAIT state
    2015/02/12 01:59:48 VCS WARNING V-16-1-50129 Operation 'haclus -modify' rejected as the node is in CURRENT_DISCOVER_WAIT state
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10453 Node: 0 changed name from: '' to: 'jarry-crf1'
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10322 System jarry-crf1 (Node '0') changed state from INITING to RUNNING
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10322 System jarry-crf2 (Node '1') changed state from CURRENT_DISCOVER_WAIT to REMOTE_BUILD
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10464 Requesting snapshot from node: 0
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10465 Getting snapshot.  snapped_membership: 0x3 current_membership: 0x3 current_jeopardy_membership: 0x0
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10181 Group ClusterService AutoRestart set to 1
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10181 Group CrfGrp AutoRestart set to 1
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10181 Group VCShmg AutoRestart set to 1
    2015/02/12 01:59:48 VCS INFO V-16-1-10466 End of snapshot received from node: 0.  snapped_membership: 0x3 current_membership: 0x3 current_jeopardy_membership: 0x0
    2015/02/12 01:59:48 VCS WARNING V-16-1-10030 UseFence=NONE. Hence do not need fencing
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10467 Replaying broadcast queue. snapped_membership: 0x3 current_membership: 0x3 current_jeopardy_membership: 0x0
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10322 System jarry-crf2 (Node '1') changed state from REMOTE_BUILD to RUNNING
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/CrfMonitor/CrfMonitorAgent for resource type CrfMonitor successfully started at Thu Feb 12 01:59:48 2015
    
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/NIC/NICAgent for resource type NIC successfully started at Thu Feb 12 01:59:48 2015
    
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/NotifierMngr/NotifierMngrAgent for resource type NotifierMngr successfully started at Thu Feb 12 01:59:48 2015
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/HostMonitor for resource type HostMonitor successfully started at Thu Feb 12 01:59:48 2015
    
    2015/02/12 01:59:48 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =UNKNOWN; Current status =eth3, UP, eth4, UP.
    2015/02/12 01:59:48 VCS INFO V-16-6-15015 (jarry-crf2) hatrigger:/opt/VRTSvcs/bin/triggers/sysjoin is not a trigger scripts directory or can not be executed
    2015/02/12 01:59:48 VCS INFO V-16-1-10297 Resource ntfr (Owner: Unspecified, Group: ClusterService) is online on jarry-crf2 (First probe)
    2015/02/12 01:59:48 VCS ERROR V-16-1-10214 Concurrency Violation:CurrentCount increased above 1 for failover group ClusterService
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group ClusterService on all nodes
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10447 Group ClusterService is online on system jarry-crf2
    2015/02/12 01:59:48 VCS WARNING V-16-6-15034 (jarry-crf2) violation:Offlining group ClusterService on system jarry-crf2
    2015/02/12 01:59:48 VCS INFO V-16-1-50135 User root fired command: hagrp -offline -force ClusterService  jarry-crf2  from localhost
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10167 Initiating manual offline of group ClusterService on system jarry-crf2
    2015/02/12 01:59:48 VCS NOTICE V-16-1-10300 Initiating Offline of Resource ntfr (Owner: Unspecified, Group: ClusterService) on System jarry-crf2
    2015/02/12 01:59:48 VCS INFO V-16-6-15002 (jarry-crf2) hatrigger:hatrigger executed /opt/VRTSvcs/bin/internal_triggers/violation jarry-crf2 ClusterService   successfully
    2015/02/12 01:59:49 VCS INFO V-16-1-10305 Resource ntfr (Owner: Unspecified, Group: ClusterService) is offline on jarry-crf2 (VCS initiated)
    2015/02/12 01:59:49 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system jarry-crf2
    2015/02/12 01:59:49 VCS NOTICE V-16-1-10438 Group ClusterService has been probed on system jarry-crf2
    2015/02/12 01:59:49 VCS NOTICE V-16-1-10433 Group ClusterService will not start automatically on System jarry-crf2 as the system is in restart mode.
    2015/02/12 01:59:49 VCS ERROR V-16-10031-10001 (jarry-crf2) CrfMonitor:CrfMonitorRes:monitor:Failed to get CRF monitor's PID.
    2015/02/12 01:59:50 VCS INFO V-16-1-10304 Resource CrfRes (Owner: Unspecified, Group: CrfGrp) is offline on jarry-crf2 (First probe)
    2015/02/12 01:59:50 VCS NOTICE V-16-1-10438 Group CrfGrp has been probed on system jarry-crf2
    2015/02/12 01:59:50 VCS NOTICE V-16-1-10433 Group CrfGrp will not start automatically on System jarry-crf2 as the system is in restart mode.
    2015/02/12 01:59:50 VCS NOTICE V-16-1-10445 Group CrfGrp will not start automatically as atleast one system in the SystemList attribute of the group is in restart mode.
    2015/02/12 01:59:52 VCS NOTICE V-16-1-10438 Group VCShmg has been probed on system jarry-crf2
    2015/02/12 01:59:52 VCS NOTICE V-16-1-10433 Group VCShmg will not start automatically on System jarry-crf2 as the system is in restart mode.
    2015/02/12 01:59:52 VCS NOTICE V-16-1-10445 Group VCShmg will not start automatically as atleast one system in the SystemList attribute of the group is in restart mode.
  • Hello,

    Thanks for defining the case clearly ..

    Well I would say VCS is still behaving as expected here. FileOnOff agent is a testing agent & has 1 required attribute. What I understand from your case is, the required attribute is defind inVCS config however you have deleted the flag file manually. That means VCS would have tried onlining the resource but it can't find the file as it was deleted manually. To confirm this theory, I am sure engine_A.log for above test case would indicate that VCS did try to bring the resource online however couldn't find the file.

    To overcome this situation, I would say you would need to plan use of triggers. Especially "pre-online" trigger. In this pre-online trigger, you would need to define to create the test file before group can be onlined.

    More info about preonline trigger can be found below

    https://sort.symantec.com/public/documents/sfha/6.1/solaris/productguides/html/vcs_admin/ch14s03s09.htm

    Hope this helps

    G

     

  • The title of this thread is changed from "what's means of "restart mode"" to "Why the service group can't be token online automatically after fixing brain-split".


  • Hi Gaurav

    Thanks for your support!

    There is neither coordination disk nor coordination server in our product deployment, so we want to test the VCS behavior when both heartbeat links are broken.

    Today we simplified the test scenario, only one parallel group "TestGrp1" is introduced.

    Testing steps are:

    1. execute command "hastatus -sum", result is:

    # hastatus -sum
    
    -- SYSTEM STATE
    -- System               State                Frozen              
    
    A  jarry-crf1           RUNNING              0                    
    A  jarry-crf2           RUNNING              0                    
    
    -- GROUP STATE
    -- Group           System               Probed     AutoDisabled    State          
    
    B  ClusterService  jarry-crf1           Y          N               ONLINE         
    B  ClusterService  jarry-crf2           Y          N               OFFLINE        
    B  TestGrp1        jarry-crf1           Y          N               ONLINE         
    B  TestGrp1        jarry-crf2           Y          N               ONLINE 


    2. On server "jarry-crf1", execute command "ifdown eth3; sleep 60; ifdown eth4; sleep 60;ifup eth3; ifup eth4".

    3. On server "jarry-crf2", when the "HAD" process being restarted (after had shutdown and before had start), we deleted the flag file of FileOnOff (make it offline intentionally)

    4. wait several minutes, execute command "hastatus -sum", result is:

    # hastatus -sum
    
    -- SYSTEM STATE
    -- System               State                Frozen              
    
    A  jarry-crf1           RUNNING              0                    
    A  jarry-crf2           RUNNING              0                    
    
    -- GROUP STATE
    -- Group           System               Probed     AutoDisabled    State          
    
    B  ClusterService  jarry-crf1           Y          N               ONLINE         
    B  ClusterService  jarry-crf2           Y          N               OFFLINE        
    B  TestGrp1        jarry-crf1           Y          N               ONLINE         
    B  TestGrp1        jarry-crf2           Y          N               OFFLINE 


    The "TestGrp1" on "jarry-crf2" is still OFFLINE, we want to know why the TestGrp1 can't be token online automatically.

    The reason of why deleting the flag file of FileOnOff intentionally when HAD being restarted is: our product deployment implemented "close" entry point, when the HAD process being exit, the "close" entry point was called, our application exited with HAD exiting, then after HAD being restarted, our application can not be token online automatically. Deleting flag file of FileOnOff is simulating our application's behavior.

     

     

  • Hello,

    I would say that you have shutdown heartbeat for 60 sec, the timeout for LLT & GAB to timeout is lesser than 60 sec (15 sec for LLT & another 15 for GAB). In the logs, you would have seen messages from LLT ticks timed out followed by GAB timing out.

    That is the reason why you would have HAD process getting restarted... I presume these logs are pertaining to when HAD process was restarting ... The node which was good would generate messages because the group would have been landed in "autodisabled" state once LLT/GAB/HAD went down on node 2 which is why you get the above message.You can see this in "hastatus -sum" command output to confirm autodisabled flag.

    Now question here is,

    After waiting for sometime, do you still see group not coming online ? (does it continue to be in autodisabled state) OR after sometime group comes online as both nodes start to talk to each other.

    What exactly is the objective of testing ? you want to see failover / dependency behaviours ?

    Also a recommendation, in production clusters, it is always recommended to use IOFencing for data protection

     

    G