Solved: IO FENCING: 2/3 CPS working and network problem

joagmv · ‎04-09-2013

Hello,

We have set up IO fencing by 3 CPS (3 different sites), and right now, because a hardware problem one CPS is out, so there are only 2 CPS working.

Next weekend, there will be a network cut-off for 5 minutes, where all the nodes (all clusters are 2-node based in 2 different sites) will not see each other. I tested the cut-off in an development environment and both nodes panic and reboot, because each node just could get one CPS.

Is there any way to avoid using io fencing during these 5 minutes cut-off??

mikebounds · ‎04-09-2013

These methods are ok - I think the first one is better as if VCS is down, then you have no control during the network outage if you need to failover a service during this period. If you have tested changing llt timeout and this works then I think this is easier to change than UseFence as you can't change UseFence attribute using haclus command - you have to stop and start VCS.

I guess the only danger changing llt timeout is if the outage is longer than expected, so you may want to test what the max you can change llt timeout to.

Mike

View solution in original post

mikebounds · ‎04-09-2013

To disabled fencing:

Run "hastop -all -force"

Remove UseFence = SCSI from main.cf

Run "hastart" (on node you changed main.cf)

Run "hatart" on remaining node (config is read from first node no need to change main.cf on all nodes).

I would test this on your development system first and if it doesn't work then add step to stop I/O fencing after running "hastop -all" (you can use RC stop script to stop fencing)

Mike

joagmv · ‎04-09-2013

Hello mike,

this is non-viable as there are like 40/50 clusters and we want to avoid to stop the applications (they all are very critical).

I am looking for an online way to apply (if it exists).

mikebounds · ‎04-09-2013

"hastop -all -force" does not stop the applications - it leaves the applications up and just stops the VCS daemons. If you have 40/50 clusters, then I would stongly recommend fixing CP server as if you have an unplanned complete network oustage then you could potentially have a lot of servers panicing.

Mike

joagmv · ‎04-09-2013

I have been testing some ways... these are the results....

1. Easy way: increase llt timeout: peerinact, so LLT eth will never be DOWN. The service group would be freezed.

2. In the node which is not running any app (lets say passive node), stop HAD and vxfen before the network oustage. Persistent freezing can be applied just to make sure nothing strange happens.

Mike, what do you think about these ways?? We have many clusters and applications so we want to find an easy and fast way (and safe of course).

mikebounds · ‎04-09-2013

These methods are ok - I think the first one is better as if VCS is down, then you have no control during the network outage if you need to failover a service during this period. If you have tested changing llt timeout and this works then I think this is easier to change than UseFence as you can't change UseFence attribute using haclus command - you have to stop and start VCS.

I guess the only danger changing llt timeout is if the outage is longer than expected, so you may want to test what the max you can change llt timeout to.

Mike

arangari · ‎04-09-2013

Few more points:

1. Changing peerinact will ensure that no node detects that other node has gone out of membership.

2. however, any operations - needing cluster-wide communication, either broadcast or unicast, the operation will hang. Ex - any non-read VCS command. If you are using CVM/CFS, then based on the activities applications are doing, you may see hangs.

3. Freezing the SGs before the n/w downtime operation will not make a lot of difference, as any state-change in the applications during this time is not even processed by VCS policy. This is because, after the agent detected the state-change, it will be processed by local VCS policy only after the broadcast message is received back. However it is still important to freeze the SGs as these messages will be queued, and after the n/w is reconnected, they will be recieved by each node and will be processed. So if the fault in any application during this time is to be ignored, you may want to freeze the SGs.

I would recommend to make sure you determine the resources which may depend directly/indirectly on GAB membership and deduce that the planned n/w outage can be handled by these resources.

VOX

IO FENCING: 2/3 CPS working and network problem