cancel
Showing results for 
Search instead for 
Did you mean: 

Listener resource remain faulted

Laszlo_Budai
Level 3

Hello,

 

we are doing some failure tests for a customer. We have VCS 6.2 running on solaris 10. We have an Oracle database and of course the listener associated with it.

We try to simulate different kind of failures. One of them is to kill the listener. In this situation the cluster observes that the listener has died, and it fails over the service to the other node. BUT the listener resource will remain in FAULTED state on the original node, and the group to which belongs will be in OFFLINE FAULTED state. In this situation if something goes wrong on the second node the service will not fail back to the original one until we manually run hagrp -clear.

Is there anything we can do to fix this? (to have the clear done automatically)

Here are some lines from the log:

2015/03/30 17:26:10 VCS ERROR V-16-2-13067 (node2p) Agent is calling clean for resource(ora_listener-res) because the resource became OFFLINE unexpectedly, on its own.
2015/03/30 17:26:11 VCS INFO V-16-2-13068 (node2p) Resource(ora_listener-res) - clean completed successfully.
2015/03/30 17:26:11 VCS INFO V-16-1-10307 Resource ora_listener-res (Owner: Unspecified, Group: oracle_rg) is offline on node2p (Not initiated by VCS)

in these it says that clean for the resource has completed successfully, but the resource is still faulted.

but if I run hares -clear manually, the the fault goes away.

20150330-173628:root@node1p:~# hares -state ora_listener-res
#Resource        Attribute             System     Value
ora_listener-res State                 node1p    ONLINE
ora_listener-res State                 node2p    FAULTED
20150330-173636:root@node1p:~# hares -clear ora_listener-res
20150330-173653:root@node1p:~# hares -state ora_listener-res
#Resource        Attribute             System     Value
ora_listener-res State                 node1p    ONLINE
ora_listener-res State                 node2p    OFFLINE
20150330-173655:root@node1p:~#

 

3 ACCEPTED SOLUTIONS

Accepted Solutions

starflyfly
Level 6
Employee Accredited Certified
HI, Seems vcs can't do this automatically. It's by design. In real production system, listener fault by some reason, it failover to other nodes. System admin need find that, and clear the fault. So listener can failover back later. If vcs automatically clear the fault, listener may still fault if error condition still there. So by design, vcs not clear the fault automatically. Regards

View solution in original post

mikebounds
Level 6
Partner Accredited

As starflyfly says, this is by design to prevent "ping -ponging" - i.e if there is an issue on node1 which causes listener to fail and then group fails over to node 2 and listener fails on node2 also, then if node1 was not in faulted state, then it would fail back to node 1 and then continue to ping-pong between the servers. You could use a Preonline script to run "hagrp -clear", so that when group comes up on node 2 it clears all faults (on node 1), but I would not recommend this as then you can get "ping-ponging", so you should manually clear faults which indicates you have fixed the issue that cause the resource to fault and so system is now valid to fail back to.

For the listener resource I would recommend setting the RestartLimit so that the process restarts locally first so if the process just happens to die and there is nothing wrong with system, then VCS will restart the listener without having to fail the whole group over.  You can set RestartLimit  to 1 or more to restart 1 or more times - example:

hatype -modify Netlsnr RestartLimit 2

 

Mike

 

View solution in original post

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hi,

The above message indicates that "clean" script completed successfully which means what ever was defined as clean action was executed & returned a completed status code without errors ..

Whether the above action has cleared the fault or not, will be determined by "monitor" script again .. so once clean is executed, post that a monitor will execute to determine the status of the resource

 

G

View solution in original post

5 REPLIES 5

starflyfly
Level 6
Employee Accredited Certified
HI, Seems vcs can't do this automatically. It's by design. In real production system, listener fault by some reason, it failover to other nodes. System admin need find that, and clear the fault. So listener can failover back later. If vcs automatically clear the fault, listener may still fault if error condition still there. So by design, vcs not clear the fault automatically. Regards

mikebounds
Level 6
Partner Accredited

As starflyfly says, this is by design to prevent "ping -ponging" - i.e if there is an issue on node1 which causes listener to fail and then group fails over to node 2 and listener fails on node2 also, then if node1 was not in faulted state, then it would fail back to node 1 and then continue to ping-pong between the servers. You could use a Preonline script to run "hagrp -clear", so that when group comes up on node 2 it clears all faults (on node 1), but I would not recommend this as then you can get "ping-ponging", so you should manually clear faults which indicates you have fixed the issue that cause the resource to fault and so system is now valid to fail back to.

For the listener resource I would recommend setting the RestartLimit so that the process restarts locally first so if the process just happens to die and there is nothing wrong with system, then VCS will restart the listener without having to fail the whole group over.  You can set RestartLimit  to 1 or more to restart 1 or more times - example:

hatype -modify Netlsnr RestartLimit 2

 

Mike

 

Marianne
Moderator
Moderator
Partner    VIP    Accredited Certified

As per above excellent post - I would not want VCS to automatically clear faults.

It is up to the Administrator to troubleshoot the problem, fix it, and then let VCS know 'all is well' by clearing the fault.

Laszlo_Budai
Level 3

dear all,

 

thank you for your messages.

I was confused by the log entry that says:

2015/03/30 17:26:11 VCS INFO V-16-2-13068 (node2p) Resource(ora_listener-res) - clean completed successfully.

I was expecting the resource to be OK if the clear was successfull. What would be the point for this message if the clear is not executed ?

Kind regards,

Laszlo

Gaurav_S
Moderator
Moderator
   VIP    Certified

Hi,

The above message indicates that "clean" script completed successfully which means what ever was defined as clean action was executed & returned a completed status code without errors ..

Whether the above action has cleared the fault or not, will be determined by "monitor" script again .. so once clean is executed, post that a monitor will execute to determine the status of the resource

 

G