Solved: Getting VCS to tolerate short network "blips"

jjsmithct · ‎11-09-2011

Hello all -

This if my first time posting here and I'm new to Symantec Connect.

I also do not have a lot of working knowledge of VCS - just pretty high level. I'm looking for an answer for a customer and am hoping someone here can help as it would take a long time for me to find the right information.

CU runs VCS on Solaris for SAP and Oracle. They experience a "very short" network interruption - say on the order of seconds to a full minute, and as described to me "service groups were bounced all over the nodes."

They are looking for a way to have VCS tolerate an outage like this for 5 minutes before taking any action.

From a few folks I talked to, there is a way to do this but no one has been able to articulate exactly how. So a few questions might be: are there multiple ways of doing this? Can this be done globally for the cluster or is it something configured app by app?

Any help appreciated since the timeframe is short and after Friday the network goes into lock down.

If there is additional information needed to answer the question I can furnish that.

Thanks much,

Jeff

joseph_dangelo · ‎11-09-2011

Jeff,

You have come to right place :o) The agent framework within VCS can easily be configured to react however your customer chooses. The simplest fix would be to set the NIC and IP agents to non-critical. This is not ideal however due to the fact that if there was an actual NIC/IP issue then your customer would not be protected in that scenario.

Your customer would be far better off configuring the Tolerance Limit attribute to the agent type in question. Essentially allowing for a certain number monitor intervals before declaring a resource is faulted.

The default behavior of VCS is to monitor a resource after it has been successfully brought online once every 60 seconds (Although framework enhancements were incorporated in to the product as of 5.1 SP1). Should VCS detect a failure, it will then declare the resource faulted. However, should you change the tolerance limit from say 0 to 1, then the agent will allow for one additional monitor cycle before declaring the resource is down. Please note that this is a cluster wide setting. Meaning, all iterations of the agent type in question will behave identically.

There are even more granular means to control VCS behavior, however I do believe this will satisfy your customers need. That being said, you can also look into the ConfInteraval and Restart Limit attributes.

It is important to note that this can in some cases delay the acknowledgment that a resource has truly faulted.

Hope this helps,

Joe D

View solution in original post

joseph_dangelo · ‎11-09-2011

Jeff,

You have come to right place :o) The agent framework within VCS can easily be configured to react however your customer chooses. The simplest fix would be to set the NIC and IP agents to non-critical. This is not ideal however due to the fact that if there was an actual NIC/IP issue then your customer would not be protected in that scenario.

Your customer would be far better off configuring the Tolerance Limit attribute to the agent type in question. Essentially allowing for a certain number monitor intervals before declaring a resource is faulted.

The default behavior of VCS is to monitor a resource after it has been successfully brought online once every 60 seconds (Although framework enhancements were incorporated in to the product as of 5.1 SP1). Should VCS detect a failure, it will then declare the resource faulted. However, should you change the tolerance limit from say 0 to 1, then the agent will allow for one additional monitor cycle before declaring the resource is down. Please note that this is a cluster wide setting. Meaning, all iterations of the agent type in question will behave identically.

There are even more granular means to control VCS behavior, however I do believe this will satisfy your customers need. That being said, you can also look into the ConfInteraval and Restart Limit attributes.

It is important to note that this can in some cases delay the acknowledgment that a resource has truly faulted.

Hope this helps,

Joe D

TonyGriffiths · ‎11-10-2011

Hi

Is the network issue affecting just the public service network (that clients use) or also affecting the VCS private interconnects (Heartbeats etc) ?

cheers

tony

jjsmithct · ‎11-10-2011

Hi Joe -

This helps much and thank you much. I think the Tolerance Limit attribute is what we're looking for here at least as a first defense.

This is what they have in types.cf in one environment:

type IPMultiNICB (
    static int MonitorInterval = 30
    static int OnlineRetryLimit = 1
    static int ToleranceLimit = 1

I'm pretty sure it is 5.1 - I'd have to verify if it's SP1, but probably.

Does it look to you like they changed defaults? So one would think the MonitorInterval and the ToleranceLimit work together and if they want 5 minutes time, then values of 60 and 5 would do that?

I appreciate this much - this is the answer this CU needs...

Regards,

Jeff

jjsmithct · ‎11-10-2011

Hi Tony,

I cannot be certain of the answer, as I was given a very high level desciption of the answer. On top of it, the environment is hosted and happened at the SP.

I will have further talks with them and attempt to get clearer information.

I'll post in this thread how it works out.

Thanks and regards,

Jeff

joseph_dangelo · ‎11-10-2011

Jeff,

The default value for IPMultiNICB is 1 for the ToleranceLimit. However, the monitor interval is set to 30 seconds. I would try increasing that to 60 and testing the agent behavior during the "network blip"

Would it be possible to see the entire types and main.cf?

Joe D

jjsmithct · ‎11-10-2011

Hi Joe,

I actually recommended they leave the monitor interval at 30 seconds and increase the ToleranceLimit to 10. This should give them the 5 minute grace period they asked for. I recommended of course they try it in QA, but have not heard back from them yet.

I'm attaching the types and main.cf from one of their environments. It should be fairly representitive of all the environments (4 of them).

Thanks for your help,

Jeff

joseph_dangelo · ‎11-10-2011

Jeff,

I will take a look at the main.cf in the simulator and let you know if I see anything that could also be "tweaked." However, it sounds like you've got a good bead on what they'll need.

Thanks,

Joe D

Gaurav_S · ‎11-11-2011

Tony has a very crucial point, at this stage I believe the network blips are limited to public network where you can follow above suggestions.

You can refer to VCS Admin guide (previously called VCS users guide), it has lots & lots of information on how you can manage faults , how you can tweek the tuneables customized to your environment. You can find VCS related docs here:

https://sort.symantec.com/documents

Coming to Tony's point, if you see network blips in private interconnects (LLT) as well, then there could be severe impacts, also it is important to know if you use IOFencing feature of VCS.

Gaurav

VOX

Getting VCS to tolerate short network "blips"