Hello Gaurav
Maybe FileOnOff agent make this case confusing, let's introduct real scenario and try to describe my case more clear.
We have an application named "CRF" running one the server, we configured one service group "CrfGrp" with one resource for our application "CRF"
the configuration in main.cf is:
group CrfGrp (
SystemList = { jarry-crf1 = 0, jarry-crf2 = 1 }
Parallel = 1
AutoStartList = { jarry-crf1, jarry-crf2 }
)
CrfMonitor CrfRes (
)
And implements below script based entry points for CrfRes:
online: start "CRF"
offline: stop "CRF"
monitor: use "ps" command to check whether "CRF" is running
clean: stop "CRF"
close: stop "CRF"
Start testing:
1. execute command "hastatus -sum", result is:
# hastatus -sum
-- SYSTEM STATE
-- System State Frozen
A jarry-crf1 RUNNING 0
A jarry-crf2 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B ClusterService jarry-crf1 Y N ONLINE
B ClusterService jarry-crf2 Y N OFFLINE
B CrfGrp jarry-crf1 Y N ONLINE
B CrfGrp jarry-crf2 Y N ONLINE
2. On server "jarry-crf1", execute command "ifdown eth3; sleep 60; ifdown eth4; sleep 60;ifup eth3; ifup eth4".
3. We can monitor that the "close" entry point of CrfRes is called at 01:59:33. Therefore, application "CRF" is stopped by VCS at this time point.
4. Several minutes later, HAD is restarted completely, the status of "CrfGrp" on "jarry-crf2" still is OFFLINE.
Above all, there are 3 phases:
before brain-split: CrfGrp is ONLINE on jarry-crf2
during brain-split: CrfGrp is ONLINE on jarry-crf2
brain-split fixed: CrfGrp is OFFLINE on jarry-crf2
Our question: why "CrfGrp" is not token online automatically on jarry-crf2?
related server group, resource status, logs listed below (after HAD being restarted completely):
# hastatus -sum
-- SYSTEM STATE
-- System State Frozen
A jarry-crf1 RUNNING 0
A jarry-crf2 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B ClusterService jarry-crf1 Y N ONLINE
B ClusterService jarry-crf2 Y N OFFLINE
B CrfGrp jarry-crf1 Y N ONLINE
B CrfGrp jarry-crf2 Y N OFFLINE
# hagrp -display CrfGrp
#Group Attribute System Value
CrfGrp AdministratorGroups global
CrfGrp Administrators global
CrfGrp Authority global 0
CrfGrp AutoFailOver global 1
CrfGrp AutoRestart global 1
CrfGrp AutoStart global 1
CrfGrp AutoStartIfPartial global 1
CrfGrp AutoStartList global jarry-crf1 jarry-crf2
CrfGrp AutoStartPolicy global Order
CrfGrp ClusterFailOverPolicy global Manual
CrfGrp ClusterList global
CrfGrp ContainerInfo global
CrfGrp DisableFaultMessages global 0
CrfGrp Evacuate global 1
CrfGrp ExtMonApp global
CrfGrp ExtMonArgs global
CrfGrp FailOverPolicy global Priority
CrfGrp FaultPropagation global 1
CrfGrp Frozen global 0
CrfGrp GroupOwner global
CrfGrp GroupRecipients global
CrfGrp Guests global
CrfGrp Load global 0
CrfGrp ManageFaults global ALL
CrfGrp ManualOps global 1
CrfGrp OnlineClearParent global 0
CrfGrp OnlineRetryInterval global 0
CrfGrp OnlineRetryLimit global 0
CrfGrp OperatorGroups global
CrfGrp Operators global
CrfGrp Parallel global 1
CrfGrp PreOnline global 0
CrfGrp PreOnlining global 0
CrfGrp PreSwitch global 0
CrfGrp PreSwitching global 0
CrfGrp PreonlineTimeout global 300
CrfGrp Prerequisites global
CrfGrp PrintTree global 1
CrfGrp Priority global 0
CrfGrp ProPCV global 0
CrfGrp SourceFile global ./main.cf
CrfGrp SysDownPolicy global
CrfGrp SystemList global jarry-crf1 0 jarry-crf2 1
CrfGrp SystemZones global
CrfGrp TFrozen global 0
CrfGrp Tag global
CrfGrp TriggerEvent global 1
CrfGrp TriggerPath global
CrfGrp TriggerResFault global 1
CrfGrp TriggerResRestart global 0
CrfGrp TriggerResStateChange global 0
CrfGrp TriggersEnabled global
CrfGrp TypeDependencies global
CrfGrp UserAssoc global
CrfGrp UserIntGlobal global 0
CrfGrp UserStrGlobal global
CrfGrp AutoDisabled jarry-crf1 0
CrfGrp AutoDisabled jarry-crf2 0
CrfGrp Enabled jarry-crf1 1
CrfGrp Enabled jarry-crf2 1
CrfGrp IntentOnline jarry-crf1 1
CrfGrp IntentOnline jarry-crf2 0
CrfGrp NumRetries jarry-crf1 0
CrfGrp NumRetries jarry-crf2 0
CrfGrp PCVAllowOnline jarry-crf1 1
CrfGrp PCVAllowOnline jarry-crf2 1
CrfGrp Probed jarry-crf1 1
CrfGrp Probed jarry-crf2 1
CrfGrp ProbesPending jarry-crf1 0
CrfGrp ProbesPending jarry-crf2 0
CrfGrp Restart jarry-crf1 0
CrfGrp Restart jarry-crf2 0
CrfGrp State jarry-crf1 |ONLINE|
CrfGrp State jarry-crf2 |OFFLINE|
CrfGrp UserIntLocal jarry-crf1 0
CrfGrp UserIntLocal jarry-crf2 0
CrfGrp UserStrLocal jarry-crf1
CrfGrp UserStrLocal jarry-crf2
CrfGrp VCSi3Info jarry-crf1
CrfGrp VCSi3Info jarry-crf2
# hares -display CrfRes
#Resource Attribute System Value
CrfRes Group global CrfGrp
CrfRes Type global CrfMonitor
CrfRes AutoStart global 1
CrfRes Critical global 1
CrfRes Enabled global 1
CrfRes LastOnline global jarry-crf2
CrfRes MonitorOnly global 0
CrfRes ResourceOwner global
CrfRes TriggerEvent global 0
CrfRes ArgListValues jarry-crf1 ""
CrfRes ArgListValues jarry-crf2 ""
CrfRes ConfidenceLevel jarry-crf1 100
CrfRes ConfidenceLevel jarry-crf2 0
CrfRes ConfidenceMsg jarry-crf1
CrfRes ConfidenceMsg jarry-crf2
CrfRes Flags jarry-crf1
CrfRes Flags jarry-crf2
CrfRes IState jarry-crf1 not waiting
CrfRes IState jarry-crf2 not waiting
CrfRes MonitorMethod jarry-crf1 Traditional
CrfRes MonitorMethod jarry-crf2 Traditional
CrfRes Probed jarry-crf1 1
CrfRes Probed jarry-crf2 1
CrfRes Start jarry-crf1 1
CrfRes Start jarry-crf2 0
CrfRes State jarry-crf1 ONLINE
CrfRes State jarry-crf2 OFFLINE
CrfRes ComputeStats global 0
CrfRes ContainerInfo global Type Name Enabled
CrfRes ResContainerInfo global Type Name Enabled
CrfRes ResourceRecipients global
CrfRes TriggerPath global
CrfRes TriggerResRestart global 0
CrfRes TriggerResStateChange global 0
CrfRes TriggersEnabled global
CrfRes dummy global
CrfRes MonitorTimeStats jarry-crf1 Avg 0 TS
CrfRes MonitorTimeStats jarry-crf2 Avg 0 TS
CrfRes ResourceInfo jarry-crf1 State Valid Msg TS
CrfRes ResourceInfo jarry-crf2 State Valid Msg TS
# hasys -display jarry-crf2
#System Attribute Value
jarry-crf2 AgentsStopped 0
jarry-crf2 AvailableCapacity 100
jarry-crf2 CPUThresholdLevel Critical 90 Warning 80 Note 70 Info 60
jarry-crf2 CPUUsage 0
jarry-crf2 CPUUsageMonitoring Enabled 0 ActionThreshold 0 ActionTimeLimit 0 Action NONE NotifyThreshold 0 NotifyTimeLimit 0
jarry-crf2 Capacity 100
jarry-crf2 ConfigBlockCount 299
jarry-crf2 ConfigCheckSum 42524
jarry-crf2 ConfigDiskState CURRENT
jarry-crf2 ConfigFile /etc/VRTSvcs/conf/config
jarry-crf2 ConfigInfoCnt 0
jarry-crf2 ConfigModDate Thu 12 Feb 2015 02:19:30 AM EST
jarry-crf2 ConnectorState Down
jarry-crf2 CurrentLimits
jarry-crf2 DiskHbStatus
jarry-crf2 DynamicLoad 0
jarry-crf2 EngineRestarted 0
jarry-crf2 EngineVersion 6.0.30.0
jarry-crf2 FencingWeight 0
jarry-crf2 Frozen 0
jarry-crf2 GUIIPAddr
jarry-crf2 HostUtilization CPU 0 Swap 0
jarry-crf2 LLTNodeId 1
jarry-crf2 LicenseType PERMANENT_SITE
jarry-crf2 Limits
jarry-crf2 LinkHbStatus eth3 UP eth4 UP
jarry-crf2 LoadTimeCounter 0
jarry-crf2 LoadTimeThreshold 600
jarry-crf2 LoadWarningLevel 80
jarry-crf2 NoAutoDisable 0
jarry-crf2 NodeId 1
jarry-crf2 OnGrpCnt 1
jarry-crf2 PhysicalServer
jarry-crf2 ShutdownTimeout 600
jarry-crf2 SourceFile ./main.cf
jarry-crf2 SwapThresholdLevel Critical 90 Warning 80 Note 70 Info 60
jarry-crf2 SysInfo Linux:jarry-crf2,#1 SMP Sun Jul 27 15:55:46 EDT 2014,2.6.32-431.29.2.el6.x86_64,x86_64
jarry-crf2 SysName jarry-crf2
jarry-crf2 SysState RUNNING
jarry-crf2 SystemLocation
jarry-crf2 SystemOwner
jarry-crf2 SystemRecipients
jarry-crf2 TFrozen 0
jarry-crf2 TRSE 0
jarry-crf2 UpDownState Up
jarry-crf2 UserInt 0
jarry-crf2 UserStr
jarry-crf2 VCSFeatures NONE
jarry-crf2 VCSMode VCS
2015/02/12 01:59:33 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =eth3, DOWN, eth4, DOWN; Current status =eth3, UP, eth4, UP.
2015/02/12 01:59:43 VCS NOTICE V-16-1-11022 VCS engine (had) started
2015/02/12 01:59:43 VCS NOTICE V-16-1-11027 VCS engine startup arguments=-restart
2015/02/12 01:59:43 VCS NOTICE V-16-1-11050 VCS engine version=6.0
2015/02/12 01:59:43 VCS NOTICE V-16-1-11051 VCS engine join version=6.0.30.0
2015/02/12 01:59:43 VCS NOTICE V-16-1-11052 VCS engine pstamp=6.0.300.000-GA-2013-01-10-16.00.01
2015/02/12 01:59:43 VCS NOTICE V-16-1-10114 Opening GAB library
2015/02/12 01:59:43 VCS NOTICE V-16-1-10619 'HAD' starting on: jarry-crf2
2015/02/12 01:59:43 VCS INFO V-16-1-10196 Cluster logger started
2015/02/12 01:59:43 VCS INFO V-16-1-10125 GAB timeout set to 30000 ms
2015/02/12 01:59:43 VCS NOTICE V-16-1-11057 GAB registration monitoring timeout set to 200000 ms
2015/02/12 01:59:43 VCS NOTICE V-16-1-11059 GAB registration monitoring action set to log system message
2015/02/12 01:59:43 VCS INFO V-16-1-53504 VCS Engine Alive message!!
2015/02/12 01:59:48 VCS INFO V-16-1-10077 Received new cluster membership
2015/02/12 01:59:48 VCS NOTICE V-16-1-10112 System (jarry-crf2) - Membership: 0x3, DDNA: 0x0
2015/02/12 01:59:48 VCS NOTICE V-16-1-10322 System (Node '0') changed state from UNKNOWN to INITING
2015/02/12 01:59:48 VCS NOTICE V-16-1-10086 System (Node '0') is in Regular Membership - Membership: 0x3
2015/02/12 01:59:48 VCS NOTICE V-16-1-10086 System jarry-crf2 (Node '1') is in Regular Membership - Membership: 0x3
2015/02/12 01:59:48 VCS WARNING V-16-1-50129 Operation 'haclus -modify' rejected as the node is in CURRENT_DISCOVER_WAIT state
2015/02/12 01:59:48 VCS WARNING V-16-1-50129 Operation 'haclus -modify' rejected as the node is in CURRENT_DISCOVER_WAIT state
2015/02/12 01:59:48 VCS NOTICE V-16-1-10453 Node: 0 changed name from: '' to: 'jarry-crf1'
2015/02/12 01:59:48 VCS NOTICE V-16-1-10322 System jarry-crf1 (Node '0') changed state from INITING to RUNNING
2015/02/12 01:59:48 VCS NOTICE V-16-1-10322 System jarry-crf2 (Node '1') changed state from CURRENT_DISCOVER_WAIT to REMOTE_BUILD
2015/02/12 01:59:48 VCS NOTICE V-16-1-10464 Requesting snapshot from node: 0
2015/02/12 01:59:48 VCS NOTICE V-16-1-10465 Getting snapshot. snapped_membership: 0x3 current_membership: 0x3 current_jeopardy_membership: 0x0
2015/02/12 01:59:48 VCS NOTICE V-16-1-10181 Group ClusterService AutoRestart set to 1
2015/02/12 01:59:48 VCS NOTICE V-16-1-10181 Group CrfGrp AutoRestart set to 1
2015/02/12 01:59:48 VCS NOTICE V-16-1-10181 Group VCShmg AutoRestart set to 1
2015/02/12 01:59:48 VCS INFO V-16-1-10466 End of snapshot received from node: 0. snapped_membership: 0x3 current_membership: 0x3 current_jeopardy_membership: 0x0
2015/02/12 01:59:48 VCS WARNING V-16-1-10030 UseFence=NONE. Hence do not need fencing
2015/02/12 01:59:48 VCS NOTICE V-16-1-10467 Replaying broadcast queue. snapped_membership: 0x3 current_membership: 0x3 current_jeopardy_membership: 0x0
2015/02/12 01:59:48 VCS NOTICE V-16-1-10322 System jarry-crf2 (Node '1') changed state from REMOTE_BUILD to RUNNING
2015/02/12 01:59:48 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/CrfMonitor/CrfMonitorAgent for resource type CrfMonitor successfully started at Thu Feb 12 01:59:48 2015
2015/02/12 01:59:48 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/NIC/NICAgent for resource type NIC successfully started at Thu Feb 12 01:59:48 2015
2015/02/12 01:59:48 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/NotifierMngr/NotifierMngrAgent for resource type NotifierMngr successfully started at Thu Feb 12 01:59:48 2015
2015/02/12 01:59:48 VCS NOTICE V-16-1-10016 Agent /opt/VRTSvcs/bin/HostMonitor for resource type HostMonitor successfully started at Thu Feb 12 01:59:48 2015
2015/02/12 01:59:48 VCS WARNING V-16-1-11141 LLT heartbeat link status changed. Previous status =UNKNOWN; Current status =eth3, UP, eth4, UP.
2015/02/12 01:59:48 VCS INFO V-16-6-15015 (jarry-crf2) hatrigger:/opt/VRTSvcs/bin/triggers/sysjoin is not a trigger scripts directory or can not be executed
2015/02/12 01:59:48 VCS INFO V-16-1-10297 Resource ntfr (Owner: Unspecified, Group: ClusterService) is online on jarry-crf2 (First probe)
2015/02/12 01:59:48 VCS ERROR V-16-1-10214 Concurrency Violation:CurrentCount increased above 1 for failover group ClusterService
2015/02/12 01:59:48 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group ClusterService on all nodes
2015/02/12 01:59:48 VCS NOTICE V-16-1-10447 Group ClusterService is online on system jarry-crf2
2015/02/12 01:59:48 VCS WARNING V-16-6-15034 (jarry-crf2) violation:Offlining group ClusterService on system jarry-crf2
2015/02/12 01:59:48 VCS INFO V-16-1-50135 User root fired command: hagrp -offline -force ClusterService jarry-crf2 from localhost
2015/02/12 01:59:48 VCS NOTICE V-16-1-10167 Initiating manual offline of group ClusterService on system jarry-crf2
2015/02/12 01:59:48 VCS NOTICE V-16-1-10300 Initiating Offline of Resource ntfr (Owner: Unspecified, Group: ClusterService) on System jarry-crf2
2015/02/12 01:59:48 VCS INFO V-16-6-15002 (jarry-crf2) hatrigger:hatrigger executed /opt/VRTSvcs/bin/internal_triggers/violation jarry-crf2 ClusterService successfully
2015/02/12 01:59:49 VCS INFO V-16-1-10305 Resource ntfr (Owner: Unspecified, Group: ClusterService) is offline on jarry-crf2 (VCS initiated)
2015/02/12 01:59:49 VCS NOTICE V-16-1-10446 Group ClusterService is offline on system jarry-crf2
2015/02/12 01:59:49 VCS NOTICE V-16-1-10438 Group ClusterService has been probed on system jarry-crf2
2015/02/12 01:59:49 VCS NOTICE V-16-1-10433 Group ClusterService will not start automatically on System jarry-crf2 as the system is in restart mode.
2015/02/12 01:59:49 VCS ERROR V-16-10031-10001 (jarry-crf2) CrfMonitor:CrfMonitorRes:monitor:Failed to get CRF monitor's PID.
2015/02/12 01:59:50 VCS INFO V-16-1-10304 Resource CrfRes (Owner: Unspecified, Group: CrfGrp) is offline on jarry-crf2 (First probe)
2015/02/12 01:59:50 VCS NOTICE V-16-1-10438 Group CrfGrp has been probed on system jarry-crf2
2015/02/12 01:59:50 VCS NOTICE V-16-1-10433 Group CrfGrp will not start automatically on System jarry-crf2 as the system is in restart mode.
2015/02/12 01:59:50 VCS NOTICE V-16-1-10445 Group CrfGrp will not start automatically as atleast one system in the SystemList attribute of the group is in restart mode.
2015/02/12 01:59:52 VCS NOTICE V-16-1-10438 Group VCShmg has been probed on system jarry-crf2
2015/02/12 01:59:52 VCS NOTICE V-16-1-10433 Group VCShmg will not start automatically on System jarry-crf2 as the system is in restart mode.
2015/02/12 01:59:52 VCS NOTICE V-16-1-10445 Group VCShmg will not start automatically as atleast one system in the SystemList attribute of the group is in restart mode.