cancel
Showing results for 
Search instead for 
Did you mean: 

IO FENCING: 2/3 CPS working and network problem

joagmv
Level 4

Hello,

We have set up IO fencing by 3 CPS (3 different sites), and right now, because a hardware problem one CPS is out, so there are only 2 CPS working.

Next weekend, there will be a network cut-off for 5 minutes, where all the nodes (all clusters are 2-node based in 2 different sites) will not see each other. I tested the cut-off in an development environment and both nodes panic and reboot, because each node just could get one CPS. 

Is there any way to avoid using io fencing during these 5 minutes cut-off?? 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

mikebounds
Level 6
Partner Accredited

These methods are ok - I think the first one is better as if VCS is down, then you have no control during the network outage if you need to failover a service during this period.  If you have tested changing llt timeout and this works then I think this is easier to change than UseFence as you can't change UseFence attribute using haclus command - you have to stop and start VCS.

I guess the only danger changing llt timeout is if the outage is longer than expected, so you may want to test  what the max you can change llt timeout to.

Mike

View solution in original post

6 REPLIES 6

mikebounds
Level 6
Partner Accredited

To disabled fencing:

Run "hastop -all -force"

Remove UseFence = SCSI from main.cf 

Run "hastart" (on node you changed main.cf)

Run "hatart" on remaining node (config is read from first node no need to change main.cf on all nodes).

I would test this on your development system first and if it doesn't work then add step to stop I/O fencing after running "hastop -all" (you can use RC stop script to stop fencing)

Mike

joagmv
Level 4

Hello mike,

this is non-viable as there are like 40/50 clusters and we want to avoid to stop the applications (they all are very critical).

I am looking for an online way to apply (if it exists). 

mikebounds
Level 6
Partner Accredited

"hastop -all -force" does not stop the applications - it leaves the applications up and just stops the VCS daemons.  If you have 40/50 clusters, then I would stongly recommend fixing CP server as if you have an unplanned complete network oustage then you could potentially have a lot of servers panicing.

Mike

joagmv
Level 4

I have been testing some ways... these are the results....

1. Easy way: increase llt timeout: peerinact, so LLT eth will never be DOWN. The service group would be freezed.

2. In the node which is not running any app (lets say passive node), stop HAD and vxfen before the network oustage. Persistent freezing can be applied just to make sure nothing strange happens.

Mike, what do you think about these ways?? We have many clusters and applications so we want to find an easy and fast way (and safe of course).

mikebounds
Level 6
Partner Accredited

These methods are ok - I think the first one is better as if VCS is down, then you have no control during the network outage if you need to failover a service during this period.  If you have tested changing llt timeout and this works then I think this is easier to change than UseFence as you can't change UseFence attribute using haclus command - you have to stop and start VCS.

I guess the only danger changing llt timeout is if the outage is longer than expected, so you may want to test  what the max you can change llt timeout to.

Mike

arangari
Level 5

Few more points:

1. Changing peerinact will ensure that no node detects that other node has gone out of membership. 

2. however, any operations - needing cluster-wide communication, either broadcast or unicast, the operation will hang. Ex - any non-read VCS command. If you are using CVM/CFS, then based on the activities applications are doing, you may see hangs.

3. Freezing the SGs before the n/w downtime operation will not make a lot of difference, as any state-change in the applications during this time is not even processed by VCS policy. This is because, after the agent detected the state-change, it will be processed by local VCS policy only after the broadcast message is received back.  However it is still important to freeze the SGs as these messages will be queued, and after the n/w is reconnected, they will be recieved by each node and will be processed. So if the fault in any application during this time is to be ignored, you may want to freeze the SGs.

 

I would recommend to make sure you determine the resources which may depend directly/indirectly on GAB membership and deduce that the planned n/w outage can be handled by these resources.