Resource faults issue

Question

Hi Team,

Many times we face resource faults alerts on our AIX server,but when we check it shows all are running fine.&nbsp;

Also, when checked logs it shows not (initiated by VCS) alerts.

Our main concern is to troubleshoot, why we get these alerts and if we get these alerts then why cluster resource is not showing faulty.

Below are the logs,

&nbsp;

-- SYSTEM STATE

-- System&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; State&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Frozen

&nbsp;

A &nbsp;xxxibm012&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; RUNNING&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0

A &nbsp;xxxibm014&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; RUNNING&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0

&nbsp;

-- GROUP STATE

-- Group&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; System&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Probed&nbsp;&nbsp;&nbsp;&nbsp; AutoDisabled&nbsp;&nbsp;&nbsp; State

&nbsp;

B&nbsp; ClusterService &nbsp;xxxibm012&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Y&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; N&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ONLINE

B&nbsp; ClusterService &nbsp;xxxibm014&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Y&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; N&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; OFFLINE

B&nbsp; DB_INSIGHT_STAGE xxxibm012&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Y&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; N&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ONLINE

B&nbsp; DB_INSIGHT_STAGE xxxibm014&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Y&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;N&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; OFFLINE

=============================================================

2015/04/21 10:14:53 VCS INFO V-16-1-53504 VCS Engine Alive message!!
	2015/04/21 12:57:32 VCS WARNING V-16-10011-5611 (clnibm014) NIC:csgnic:monitor:Second PingTest failed for Virtual Interface en4. Resource is OFFLINE
	2015/04/21 12:57:32 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic &nbsp;ConfidenceMsg &nbsp; &nbsp;from localhost
	2015/04/21 12:57:33 VCS ERROR V-16-1-54031 Resource csgnic (Owner: Unspecified, Group: ClusterService) is FAULTED on sys clnibm014
	2015/04/21 12:57:33 VCS INFO V-16-6-0 (clnibm014) resfault:(resfault) Invoked with arg0=clnibm014, arg1=csgnic, arg2=ONLINE
	2015/04/21 12:57:49 VCS INFO V-16-6-15002 (clnibm014) hatrigger:hatrigger executed /opt/VRTSvcs/bin/triggers/resfault clnibm014 csgnic ONLINE &nbsp;successfully
	2015/04/21 12:58:18 VCS ERROR V-16-1-54031 Resource proxy_DB_INSPRD (Owner: Unspecified, Group: DB_INSIGHT_STAGE) is FAULTED on sys clnibm014
	2015/04/21 12:58:18 VCS INFO V-16-6-0 (clnibm014) resfault:(resfault) Invoked with arg0=clnibm014, arg1=proxy_DB_INSPRD, arg2=ONLINE
	2015/04/21 12:58:29 VCS INFO V-16-6-15002 (clnibm014) hatrigger:hatrigger executed /opt/VRTSvcs/bin/triggers/resfault clnibm014 proxy_DB_INSPRD ONLINE &nbsp;successfully
	2015/04/21 12:58:33 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic &nbsp;ConfidenceMsg &nbsp;Primary test to confirm Online status succeeded. &nbsp;from localhost
	2015/04/21 12:58:34 VCS INFO V-16-1-10299 Resource csgnic (Owner: Unspecified, Group: ClusterService) is online on clnibm014 (Not initiated by VCS)
	2015/04/21 12:58:34 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group ClusterService on all nodes
	2015/04/21 12:58:34 VCS NOTICE V-16-1-51034 Failover group ClusterService is already active. Ignoring Restart
	2015/04/21 12:59:18 VCS INFO V-16-1-10299 Resource proxy_DB_INSPRD (Owner: Unspecified, Group: DB_INSIGHT_STAGE) is online on clnibm014 (Not initiated by VCS)
	2015/04/21 12:59:18 VCS NOTICE V-16-1-10233 Clearing Restart attribute for group DB_INSIGHT_STAGE on all nodes
	2015/04/21 12:59:18 VCS NOTICE V-16-1-51034 Failover group DB_INSIGHT_STAGE is already active. Ignoring Restart
	2015/04/21 12:59:34 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic &nbsp;ConfidenceMsg &nbsp;Primary test to confirm Online status succeeded. &nbsp;from localhost
	2015/04/21 13:18:53 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic &nbsp;ConfidenceMsg &nbsp;Relying on secondary test to confirm Online status. &nbsp;from localhost
	2015/04/21 13:19:34 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic &nbsp;ConfidenceMsg &nbsp;Primary test to confirm Online status succeeded. &nbsp;from localhost
	2015/04/21 13:44:49 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic &nbsp;ConfidenceMsg &nbsp;Relying on secondary test to confirm Online status. &nbsp;from localhost
	2015/04/21 13:45:34 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic &nbsp;ConfidenceMsg &nbsp;Primary test to confirm Online status succeeded. &nbsp;from localhost
	2015/04/21 14:14:54 VCS INFO V-16-1-53504 VCS Engine Alive message!!
	2015/04/21 16:48:59 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic &nbsp;ConfidenceMsg &nbsp;Relying on secondary test to confirm Online status. &nbsp;from localhost
	2015/04/21 16:49:34 VCS INFO V-16-1-50135 User root fired command: hares -modify csgnic &nbsp;ConfidenceMsg &nbsp;Primary test to confirm Online status succeeded.&nbsp;

wally_heim · Accepted Answer

Hi allaboutunix,

From the log snippet that you provided it is you NIC resource (csgnic) in the ClusterService group that is going faulted from time to time that is causing problems for the proxy resources that point to it.

NIC resources are persistent resources and show the state of the NIC hardware that they are configured to monitor.&nbsp; If the NIC resource probes as online for persistent resources the fault is cleared automatically and the resource state is shown as online.

The proxy resource mirrors the state of another resource that it is configured to monitor.&nbsp; In this case, I would guess that&nbsp;proxy_DB_INSPRD is configured to mirror the csgnic resource state.&nbsp; So in this case, its faults would also be automatically cleared when the csgnic resource probes as online.

You can check the NIC_A.log file on the clnibm014 node for issues around 2015/04/21 12:57:32.&nbsp; This might give you a better idea as to why the csgnic resource is having issues at that time.

This also seems to be a temporary issue and one that resources itself in a short amount of time.&nbsp; Possibly system load related??

Anyway, you can try increasing the ToleranceLimit attribute on the csgnic to a value of say 1 or 2 (default is zero) to try and tolerate a breif period of incorrect probing state of the nic resource.&nbsp; But this is just a way to hide these events from happening and not really a way to fix the underlying issue.

Thank you,

Wally

mikebounds · Answer

Did you check you have set NetworkHosts attribribute&nbsp;- you can check by running:

hares -display csgnic  | grep NetworkHosts

&nbsp;

Mike

allaboutunix · Answer

&nbsp;hares -display csgnic |grep -i NetworkHosts
	csgnic &nbsp; &nbsp; &nbsp; ArgListValues &nbsp; &nbsp; &nbsp; &nbsp; clnibm012 &nbsp;Device &nbsp; &nbsp;1 &nbsp; &nbsp; &nbsp; en4 &nbsp; &nbsp; Protocol &nbsp; &nbsp; &nbsp; &nbsp;1 &nbsp; &nbsp; &nbsp; ipv4 &nbsp; &nbsp;PingOptimize &nbsp; &nbsp;1 &nbsp; &nbsp; &nbsp; 1 &nbsp; &nbsp; &nbsp; NetworkHosts &nbsp; &nbsp;1 &nbsp; &nbsp; &nbsp; 132.189.91.254 &nbsp;NetworkType &nbsp; &nbsp; 1 &nbsp; &nbsp; &nbsp; ""
	csgnic &nbsp; &nbsp; &nbsp; ArgListValues &nbsp; &nbsp; &nbsp; &nbsp; clnibm014 &nbsp;Device &nbsp; &nbsp;1 &nbsp; &nbsp; &nbsp; en4 &nbsp; &nbsp; Protocol &nbsp; &nbsp; &nbsp; &nbsp;1 &nbsp; &nbsp; &nbsp; ipv4 &nbsp; &nbsp;PingOptimize &nbsp; &nbsp;1 &nbsp; &nbsp; &nbsp; 1 &nbsp; &nbsp; &nbsp; NetworkHosts &nbsp; &nbsp;1 &nbsp; &nbsp; &nbsp; 132.189.91.254 &nbsp;NetworkType &nbsp; &nbsp; 1 &nbsp; &nbsp; &nbsp; ""
	csgnic &nbsp; &nbsp; &nbsp; NetworkHosts &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;global &nbsp; &nbsp; 132.189.91.254

===================================================

&nbsp;

One question,

&nbsp;

How do we check that&nbsp;&nbsp;proxy_DB_INSPRD is configured to mirror the csgnic resource state.

Kindly let me know

wally_heim · Answer

Hi allaboutunix,

You can check the TargetResName attribute of the proxy resource to see what other resource it is set to mirror.&nbsp;

It is recommended that you have more than 1 entry in the NetworkHosts list for the NIC resource.&nbsp;

Thank you,

Wally

mikebounds · Answer

I agree with Wally. &nbsp;What is happening is that if VCS cannot ping the one address in the NetworkHosts attrribute, then it fails the csgnic resource and then probably&nbsp;proxy_DB_INSPRD is proxied to csgnic so this fails too.&nbsp;&nbsp;So to resolve you should add one&nbsp;or more additional&nbsp;hosts to NetworkHosts if you have another Device/host that is normally up on the same subnet&nbsp; Then if VCS cannot ping the host in the list, then it pings the next in the list (2nd is only pinged if first fails). &nbsp;In addition you can set ToleranceLimit as Wally has already said as&nbsp;132.189.91.254 was down for 1 to 2 minutes so setting&nbsp;ToleranceLimit to 1 or 2 would also help to stop this failing again.

Mike

sacheen_birhade · Answer

NIC resource is a persistance resource. NIC agent wont bring NIC resource online and offline. It only monitors the resource.&nbsp;

To bring NIC resource online, we manually need to bring NIC resource online outside of VCS using OS ifconfig to plumb NIC and&nbsp;assign&nbsp;base IP to NIC. Then this base IP is used to ping your specifed NetworkHosts in the monitor to detect the "accurate" status of your NIC. Remember that NetworkHost must be in the same subnet as that of NetworkHosts. If you have not specifed NetworkHost, response from your network is taken under consideration.

With above theory, please check &nbsp;whether you have assigned base IP to your NIC. If specifed then check your NetworkHosts are pingable through same NIC. If no, specify another NetworkHosts. If the base IP to NIC is not assogned then assign it. After correcting your configuration, probe your resource.

I hope this could help you to resolve your problem.

&nbsp;

Forum Discussion

Resource faults issue

9 Replies

Related Content

Listener resource remain faulted

Resource Fault question

Regarding resource grp faults

Re: need to know the VCS resource status if intentional recycle occur for DB

Need help in Nic resource fault

Recent Discussions

Configure two Mount type resources of nfs FStype attribute using the same share

order

key registration and reservation

Verifying that primary and dr clusters replication is synced

vcs can create logical nic