Resource faults issue
Hi Team,
Many times we face resource faults alerts on our AIX server,but when we check it shows all are running fine.
Also, when checked logs it shows not (initiated by VCS) alerts.
Our main concern is to troubleshoot, why we get these alerts and if we get these alerts then why cluster resource is not showing faulty.
Below are the logs,
-- SYSTEM STATE
-- System State Frozen
A xxxibm012 RUNNING 0
A xxxibm014 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B ClusterService xxxibm012 Y N ONLINE
B ClusterService xxxibm014 Y N OFFLINE
B DB_INSIGHT_STAGE xxxibm012 Y N ONLINE
B DB_INSIGHT_STAGE xxxibm014 Y N OFFLINE
=============================================================
2015/04/21 10:14:53 VCS INFO V-16-1-53504 VCS Engine Alive message!!
2015/04/21 12:57:32 VCS WARNING V-16-10011-5611 (clnibm014) NIC:csgnic:monitor:Second PingTest failed for Virtual Interface en4. Resource is OFFLINE
2015/04/21 12:57:32 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic ConfidenceMsg from localhost
2015/04/21 12:57:33 VCS ERROR V-16-1-54031 Resource csgnic (Owner: Unspecified, Group: ClusterService) is FAULTED on sys clnibm014
2015/04/21 12:57:33 VCS INFO V-16-6-0 (clnibm014) resfault:(resfault) Invoked with arg0=clnibm014, arg1=csgnic, arg2=ONLINE
2015/04/21 12:57:49 VCS INFO V-16-6-15002 (clnibm014) hatrigger:hatrigger executed /opt/VRTSvcs/bin/triggers/resfault clnibm014 csgnic ONLINE successfully
2015/04/21 12:58:18 VCS ERROR V-16-1-54031 Resource proxy_DB_INSPRD (Owner: Unspecified, Group: DB_INSIGHT_STAGE) is FAULTED on sys clnibm014
2015/04/21 12:58:18 VCS INFO V-16-6-0 (clnibm014) resfault:(resfault) Invoked with arg0=clnibm014, arg1=proxy_DB_INSPRD, arg2=ONLINE
2015/04/21 12:58:29 VCS INFO V-16-6-15002 (clnibm014) hatrigger:hatrigger executed /opt/VRTSvcs/bin/triggers/resfault clnibm014 proxy_DB_INSPRD ONLINE successfully
2015/04/21 12:58:33 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic ConfidenceMsg Primary test to confirm Online status succeeded. from localhost
2015/04/21 12:58:34 VCS INFO V-16-1-10299 Resource csgnic (Owner: Unspecified, Group: ClusterService) is online on clnibm014 (Not initiated by VCS)
2015/04/21 12:58:34 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group ClusterService on all nodes
2015/04/21 12:58:34 VCS NOTICE V-16-1-51034 Failover group ClusterService is already active. Ignoring Restart
2015/04/21 12:59:18 VCS INFO V-16-1-10299 Resource proxy_DB_INSPRD (Owner: Unspecified, Group: DB_INSIGHT_STAGE) is online on clnibm014 (Not initiated by VCS)
2015/04/21 12:59:18 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group DB_INSIGHT_STAGE on all nodes
2015/04/21 12:59:18 VCS NOTICE V-16-1-51034 Failover group DB_INSIGHT_STAGE is already active. Ignoring Restart
2015/04/21 12:59:34 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic ConfidenceMsg Primary test to confirm Online status succeeded. from localhost
2015/04/21 13:18:53 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic ConfidenceMsg Relying on secondary test to confirm Online status. from localhost
2015/04/21 13:19:34 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic ConfidenceMsg Primary test to confirm Online status succeeded. from localhost
2015/04/21 13:44:49 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic ConfidenceMsg Relying on secondary test to confirm Online status. from localhost
2015/04/21 13:45:34 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic ConfidenceMsg Primary test to confirm Online status succeeded. from localhost
2015/04/21 14:14:54 VCS INFO V-16-1-53504 VCS Engine Alive message!!
2015/04/21 16:48:59 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic ConfidenceMsg Relying on secondary test to confirm Online status. from localhost
2015/04/21 16:49:34 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic ConfidenceMsg Primary test to confirm Online status succeeded.
Hi allaboutunix,
From the log snippet that you provided it is you NIC resource (csgnic) in the ClusterService group that is going faulted from time to time that is causing problems for the proxy resources that point to it.
NIC resources are persistent resources and show the state of the NIC hardware that they are configured to monitor. If the NIC resource probes as online for persistent resources the fault is cleared automatically and the resource state is shown as online.
The proxy resource mirrors the state of another resource that it is configured to monitor. In this case, I would guess that proxy_DB_INSPRD is configured to mirror the csgnic resource state. So in this case, its faults would also be automatically cleared when the csgnic resource probes as online.
You can check the NIC_A.log file on the clnibm014 node for issues around 2015/04/21 12:57:32. This might give you a better idea as to why the csgnic resource is having issues at that time.
This also seems to be a temporary issue and one that resources itself in a short amount of time. Possibly system load related??
Anyway, you can try increasing the ToleranceLimit attribute on the csgnic to a value of say 1 or 2 (default is zero) to try and tolerate a breif period of incorrect probing state of the nic resource. But this is just a way to hide these events from happening and not really a way to fix the underlying issue.
Thank you,
Wally
Agree with Wally that NIC seems to be issue and setting ToleranceLimit may help, but you should also make sure you have set NetworkHosts on your NIC resource. Setting this attribute is best practice for any interface, but it is required for virtual interfaces on AIX which you seem to be using - see extract from bundled agents guide:
NetworkHosts
List of hosts on the same network that are pinged to determine if the
network connection is alive. Enter the IP address of the host, instead
of the host name, to prevent the monitor from timing out. DNS lookup
causes the ping to hang. If more than one network host is listed, the
monitor returns ONLINE if at least one of the hosts is reachable.
If you do not specify network hosts, the monitor tests the NIC by
sending pings to the broadcast address on the NIC.
For a virtual device, you must configure the NetworkHosts attribute.
Symantec recommends configuring more than one host to take care
of the NetworkHost itself failing.
Type and dimension: string-vector
Example: { "166.96.15.22", "166.97.1.2" }Mike
Hi Heim/Mark,
Is there any way to fix it permanently?
We have checked and didn't fing any logs of year 2015, all are older one's.
We have checked and didn't fing any logs of year 2015, all are older one's.
This means that there was probably a real issue at hardware level that got fixed in the meantime.
Remember that VCS is reporting faults/issues, not causing it.
Troubleshooting of NIC issues must be done at OS and hardware level.