cancel
Showing results for 
Search instead for 
Did you mean: 

Red exclamation mark make the cluster hung / Failover stuck

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

Environment

OS = RHEL 7.4

InfoScale HighAvailability = 7.4

SAN = ISCSI SAN(targetcli) connected with a separate ethernet card

Cluster nodes = 2

Resources = LVM-VolumeGroup > LVM-LogicalVolume > Mount > Redis(custom agent) > Application(custom agent) > VIP

Incident

I hard boot active node and expected a successful failover. However a resource (redis server which is a custom agent) got stuck with a red exclamation mark. I suspect that this make the failover hung in middle. Its my assumption that redis resource was not able to be stopped due to its child resource become already unavailable/crashed/faulted and this seems a real reason for a unsuccessful failover.(image is attached for reference). Why red exclamation sign hung the failover?

 

 

 

 

6 REPLIES 6

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

What do you mean by hard boot?

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

Power unplugged

RiaanBadenhorst
Moderator
Moderator
Partner    VIP    Accredited Certified

Share the engine logs as well as the main.cf

Zahid_Haseeb
Moderator
Moderator
Partner    VIP    Accredited

STANDBY NODE

2018/09/10 09:51:38 VCS INFO V-16-6-15002 (COM2-XXXXX) hatrigger:hatrigger executed /opt/VRTSvcs/bin/internal_triggers/dump_tunables COM2-XXXXX 1 successfully
2018/09/10 09:51:41 VCS NOTICE V-16-1-10438 Group VCShmg has been probed on system COM2-XXXXX
2018/09/10 09:51:41 VCS NOTICE V-16-1-10435 Group VCShmg will not start automatically on System COM2-XXXXX as the system is not a part of AutoStartList attribute of the group.
2018/09/10 09:52:49 VCS NOTICE V-16-1-10322 System COM1-XXXXX (Node '0') changed state from RUNNING to LEAVING
2018/09/10 09:52:49 VCS NOTICE V-16-1-10300 Initiating Offline of Resource VIP (Owner: Unspecified, Group: SG) on System COM1-XXXXX
2018/09/10 09:52:50 VCS INFO V-16-1-10305 Resource VIP (Owner: Unspecified, Group: SG) is offline on COM1-XXXXX (VCS initiated)
2018/09/10 09:52:50 VCS NOTICE V-16-1-10300 Initiating Offline of Resource xxxx (Owner: Unspecified, Group: SG) on System COM1-XXXXX
2018/09/10 09:52:50 VCS ERROR V-16-2-13067 (COM1-XXXXX) Agent is calling clean for resource(mount_lv) because the resource became OFFLINE unexpectedly, on its own.
2018/09/10 09:52:50 VCS INFO V-16-2-13068 (COM1-XXXXX) Resource(mount_lv) - clean completed successfully.
2018/09/10 09:52:50 VCS INFO V-16-1-10307 Resource mount_lv (Owner: Unspecified, Group: SG) is offline on COM1-XXXXX (Not initiated by VCS)
2018/09/10 09:52:50 VCS INFO V-16-6-15015 (COM1-XXXXX) hatrigger:/opt/VRTSvcs/bin/triggers/resfault is not a trigger scripts directory or can not be executed
2018/09/10 09:52:50 VCS ERROR V-16-2-13067 (COM1-XXXXX) Agent is calling clean for resource(xxxx) because the resource became OFFLINE unexpectedly, on its own.
2018/09/10 09:52:52 VCS INFO V-16-2-13068 (COM1-XXXXX) Resource(xxxx) - clean completed successfully.
2018/09/10 09:52:53 VCS INFO V-16-1-10305 Resource xxxx (Owner: Unspecified, Group: SG) is offline on COM1-XXXXX (VCS initiated)
2018/09/10 09:52:53 VCS NOTICE V-16-1-10300 Initiating Offline of Resource REDIS (Owner: Unspecified, Group: SG) on System COM1-XXXXX
2018/09/10 09:52:53 VCS INFO V-16-1-55029 Resource xxxx in offline state received recurring offline message on system COM1-XXXXX
2018/09/10 09:52:56 VCS ERROR V-16-2-13064 (COM1-XXXXX) Agent is calling clean for resource(REDIS) because the resource is up even after offline completed.
2018/09/10 09:52:57 VCS INFO V-16-2-13068 (COM1-XXXXX) Resource(REDIS) - clean completed successfully.
2018/09/10 09:52:59 VCS ERROR V-16-2-13077 (COM1-XXXXX) Agent is unable to offline resource(REDIS). Administrative intervention may be required.
2018/09/10 09:53:16 VCS ERROR V-16-2-13067 (COM1-XXXXX) Agent is calling clean for resource(VG) because the resource became OFFLINE unexpectedly, on its own.
2018/09/10 09:53:17 VCS ERROR V-16-2-13069 (COM1-XXXXX) Resource(VG) - clean failed.
2018/09/10 09:53:22 VCS ERROR V-16-2-13067 (COM1-XXXXX) Agent is calling clean for resource(LV) because the resource became OFFLINE unexpectedly, on its own.
2018/09/10 09:53:22 VCS INFO V-16-2-13068 (COM1-XXXXX) Resource(LV) - clean completed successfully.
2018/09/10 09:53:22 VCS INFO V-16-1-10307 Resource LV (Owner: Unspecified, Group: SG) is offline on COM1-XXXXX (Not initiated by VCS)
2018/09/10 09:53:22 VCS INFO V-16-6-15015 (COM1-XXXXX) hatrigger:/opt/VRTSvcs/bin/triggers/resfault is not a trigger scripts directory or can not be executed

MAIN.CF FILE

 cat /etc/VRTSvcs/conf/config/main.cf
include "OracleASMTypes.cf"
include "types.cf"
include "Db2udbTypes.cf"
include "OracleTypes.cf"
include "SybaseTypes.cf"

cluster CLUSTER-XXXXX (
        UserNames = { admin = cpqIpkPmqLqqOyqKpn }
        Administrators = { admin }
        )

system COM1-XXXXX (
        )

system COM2-XXXXX (
        )

group SG (
        SystemList = { COM1-XXXXX = 0, COM2-XXXXX = 1 }
        AutoStartList = { COM1-XXXXX, COM2-XXXXX }
        )

        Application xxxx (
                Critical = 0
                User = xxxx
                StartProgram = "/home/xxxx/HA/xxxx.sh start"
                StopProgram = "/home/xxxx/HA/xxxx.sh stop"
                MonitorProcesses = { "xxxx.exe COMMS" }
                UseSUDash = 1
                )

        Application REDIS (
                StartProgram = "/home/xxxx/HA/redis-start.sh"
                StopProgram = "/home/xxxx/HA/redis-stop.sh"
                MonitorProcesses = { "/usr/bin/redis-server 192.168.2.12:8999" }
                )

        IP REDIS-VIPVIP (
                Device = eno16780032
                Address = "192.168.2.12"
                NetMask = "255.255.248.0"
                )

        IP VIP (
                Device = eno16780032
                Address = "192.168.2.20"
                NetMask = "255.255.248.0"
                )

        LVMLogicalVolume LV (
                LogicalVolume = xxxx-VG-VOL1
                VolumeGroup = xxxx-VG
                )

        LVMVolumeGroup VG (
                VolumeGroup = xxxx-VG
                StartVolumes = 1
                ScanDevices = 1
                )

        Mount mount_lv (
                MountPoint = "/home/xxxx"
                BlockDevice = "/dev/mapper/xxxx--VG-xxxx--VG--VOL1"
                FSType = ext4
                FsckOpt = "-y"
                )

        xxxx requires REDIS
        LV requires VG
        REDIS requires REDIS-VIPVIP
        REDIS requires mount_lv
        VIP requires xxxx
        mount_lv requires LV


        // resource dependency tree
        //
        //      group SG
        //      {
        //      IP VIP
        //          {
        //          Application xxxx
        //              {
        //              Application REDIS
        //                  {
        //                  Mount mount_lv
        //                      {
        //                      LVMLogicalVolume LV
        //                          {
        //                          LVMVolumeGroup VG
        //                          }
        //                      }
        //                  IP REDIS-VIPVIP
        //                  }
        //              }
        //          }
        //      }

frankgfan
Moderator
Moderator
   VIP   

Hello,

Can you please run the commands below and post the output?

1. hastatus -sum

2. gabconfig -a

3. hares -display | grep -i "fail|err|fault"

4. vxdisk -o alldgs list              

5. vxprint -ht | egrep -i "disable|err|fail"

6. haclus -display | grep -i vers

7. rpm -aq | egrep -i "vxc|llt|gab|vx"

Please run commands in 4, 5 and 7 on both nodes.

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

@Zahid_Haseeb

You need to check how your custom agent is configured. 
How is it monitored? How does the agent offline or clean the resource? Or detect offline state?

Have a look here:

2018/09/10 09:52:59 VCS ERROR V-16-2-13077 (COM1-XXXXX) Agent is unable to offline resource(REDIS). Administrative intervention may be required.