09-07-2018 06:13 AM - edited 09-07-2018 06:45 AM
Environment
OS = RHEL 7.4
InfoScale HighAvailability = 7.4
SAN = ISCSI SAN(targetcli) connected with a separate ethernet card
Cluster nodes = 2
Resources = LVM-VolumeGroup > LVM-LogicalVolume > Mount > Redis(custom agent) > Application(custom agent) > VIP
Incident
I hard boot active node and expected a successful failover. However a resource (redis server which is a custom agent) got stuck with a red exclamation mark. I suspect that this make the failover hung in middle. Its my assumption that redis resource was not able to be stopped due to its child resource become already unavailable/crashed/faulted and this seems a real reason for a unsuccessful failover.(image is attached for reference). Why red exclamation sign hung the failover?
09-07-2018 06:47 AM
What do you mean by hard boot?
09-07-2018 07:00 AM
Power unplugged
09-08-2018 10:39 PM
Share the engine logs as well as the main.cf
09-09-2018 10:03 PM
STANDBY NODE
2018/09/10 09:51:38 VCS INFO V-16-6-15002 (COM2-XXXXX) hatrigger:hatrigger executed /opt/VRTSvcs/bin/internal_triggers/dump_tunables COM2-XXXXX 1 successfully
2018/09/10 09:51:41 VCS NOTICE V-16-1-10438 Group VCShmg has been probed on system COM2-XXXXX
2018/09/10 09:51:41 VCS NOTICE V-16-1-10435 Group VCShmg will not start automatically on System COM2-XXXXX as the system is not a part of AutoStartList attribute of the group.
2018/09/10 09:52:49 VCS NOTICE V-16-1-10322 System COM1-XXXXX (Node '0') changed state from RUNNING to LEAVING
2018/09/10 09:52:49 VCS NOTICE V-16-1-10300 Initiating Offline of Resource VIP (Owner: Unspecified, Group: SG) on System COM1-XXXXX
2018/09/10 09:52:50 VCS INFO V-16-1-10305 Resource VIP (Owner: Unspecified, Group: SG) is offline on COM1-XXXXX (VCS initiated)
2018/09/10 09:52:50 VCS NOTICE V-16-1-10300 Initiating Offline of Resource xxxx (Owner: Unspecified, Group: SG) on System COM1-XXXXX
2018/09/10 09:52:50 VCS ERROR V-16-2-13067 (COM1-XXXXX) Agent is calling clean for resource(mount_lv) because the resource became OFFLINE unexpectedly, on its own.
2018/09/10 09:52:50 VCS INFO V-16-2-13068 (COM1-XXXXX) Resource(mount_lv) - clean completed successfully.
2018/09/10 09:52:50 VCS INFO V-16-1-10307 Resource mount_lv (Owner: Unspecified, Group: SG) is offline on COM1-XXXXX (Not initiated by VCS)
2018/09/10 09:52:50 VCS INFO V-16-6-15015 (COM1-XXXXX) hatrigger:/opt/VRTSvcs/bin/triggers/resfault is not a trigger scripts directory or can not be executed
2018/09/10 09:52:50 VCS ERROR V-16-2-13067 (COM1-XXXXX) Agent is calling clean for resource(xxxx) because the resource became OFFLINE unexpectedly, on its own.
2018/09/10 09:52:52 VCS INFO V-16-2-13068 (COM1-XXXXX) Resource(xxxx) - clean completed successfully.
2018/09/10 09:52:53 VCS INFO V-16-1-10305 Resource xxxx (Owner: Unspecified, Group: SG) is offline on COM1-XXXXX (VCS initiated)
2018/09/10 09:52:53 VCS NOTICE V-16-1-10300 Initiating Offline of Resource REDIS (Owner: Unspecified, Group: SG) on System COM1-XXXXX
2018/09/10 09:52:53 VCS INFO V-16-1-55029 Resource xxxx in offline state received recurring offline message on system COM1-XXXXX
2018/09/10 09:52:56 VCS ERROR V-16-2-13064 (COM1-XXXXX) Agent is calling clean for resource(REDIS) because the resource is up even after offline completed.
2018/09/10 09:52:57 VCS INFO V-16-2-13068 (COM1-XXXXX) Resource(REDIS) - clean completed successfully.
2018/09/10 09:52:59 VCS ERROR V-16-2-13077 (COM1-XXXXX) Agent is unable to offline resource(REDIS). Administrative intervention may be required.
2018/09/10 09:53:16 VCS ERROR V-16-2-13067 (COM1-XXXXX) Agent is calling clean for resource(VG) because the resource became OFFLINE unexpectedly, on its own.
2018/09/10 09:53:17 VCS ERROR V-16-2-13069 (COM1-XXXXX) Resource(VG) - clean failed.
2018/09/10 09:53:22 VCS ERROR V-16-2-13067 (COM1-XXXXX) Agent is calling clean for resource(LV) because the resource became OFFLINE unexpectedly, on its own.
2018/09/10 09:53:22 VCS INFO V-16-2-13068 (COM1-XXXXX) Resource(LV) - clean completed successfully.
2018/09/10 09:53:22 VCS INFO V-16-1-10307 Resource LV (Owner: Unspecified, Group: SG) is offline on COM1-XXXXX (Not initiated by VCS)
2018/09/10 09:53:22 VCS INFO V-16-6-15015 (COM1-XXXXX) hatrigger:/opt/VRTSvcs/bin/triggers/resfault is not a trigger scripts directory or can not be executed
MAIN.CF FILE
cat /etc/VRTSvcs/conf/config/main.cf
include "OracleASMTypes.cf"
include "types.cf"
include "Db2udbTypes.cf"
include "OracleTypes.cf"
include "SybaseTypes.cf"
cluster CLUSTER-XXXXX (
UserNames = { admin = cpqIpkPmqLqqOyqKpn }
Administrators = { admin }
)
system COM1-XXXXX (
)
system COM2-XXXXX (
)
group SG (
SystemList = { COM1-XXXXX = 0, COM2-XXXXX = 1 }
AutoStartList = { COM1-XXXXX, COM2-XXXXX }
)
Application xxxx (
Critical = 0
User = xxxx
StartProgram = "/home/xxxx/HA/xxxx.sh start"
StopProgram = "/home/xxxx/HA/xxxx.sh stop"
MonitorProcesses = { "xxxx.exe COMMS" }
UseSUDash = 1
)
Application REDIS (
StartProgram = "/home/xxxx/HA/redis-start.sh"
StopProgram = "/home/xxxx/HA/redis-stop.sh"
MonitorProcesses = { "/usr/bin/redis-server 192.168.2.12:8999" }
)
IP REDIS-VIPVIP (
Device = eno16780032
Address = "192.168.2.12"
NetMask = "255.255.248.0"
)
IP VIP (
Device = eno16780032
Address = "192.168.2.20"
NetMask = "255.255.248.0"
)
LVMLogicalVolume LV (
LogicalVolume = xxxx-VG-VOL1
VolumeGroup = xxxx-VG
)
LVMVolumeGroup VG (
VolumeGroup = xxxx-VG
StartVolumes = 1
ScanDevices = 1
)
Mount mount_lv (
MountPoint = "/home/xxxx"
BlockDevice = "/dev/mapper/xxxx--VG-xxxx--VG--VOL1"
FSType = ext4
FsckOpt = "-y"
)
xxxx requires REDIS
LV requires VG
REDIS requires REDIS-VIPVIP
REDIS requires mount_lv
VIP requires xxxx
mount_lv requires LV
// resource dependency tree
//
// group SG
// {
// IP VIP
// {
// Application xxxx
// {
// Application REDIS
// {
// Mount mount_lv
// {
// LVMLogicalVolume LV
// {
// LVMVolumeGroup VG
// }
// }
// IP REDIS-VIPVIP
// }
// }
// }
// }
09-12-2018 04:25 AM
Hello,
Can you please run the commands below and post the output?
1. hastatus -sum
2. gabconfig -a
3. hares -display | grep -i "fail|err|fault"
4. vxdisk -o alldgs list
5. vxprint -ht | egrep -i "disable|err|fail"
6. haclus -display | grep -i vers
7. rpm -aq | egrep -i "vxc|llt|gab|vx"
Please run commands in 4, 5 and 7 on both nodes.
09-28-2018 06:21 AM
You need to check how your custom agent is configured.
How is it monitored? How does the agent offline or clean the resource? Or detect offline state?
Have a look here:
2018/09/10 09:52:59 VCS ERROR V-16-2-13077 (COM1-XXXXX) Agent is unable to offline resource(REDIS). Administrative intervention may be required.