cancel
Showing results for 
Search instead for 
Did you mean: 

Timeout/keepalive parameters for Storage Foundation and Oracle RAC

gsroman
Not applicable

Hello:

 

I have a 2 node RAC, Solaris 10,Oracle

11g RAC and SF 5.0 MP1.

 

Could you please tell me the parameters and how to configure them, to avoid the 10 minute

wait between a node falling, and the other one noticing and moving resources and sessions?

I have some references from Oracle

saying that Oracle is waiting for the SF cluster to acknowledge the abort, and

it has a timeout of 10 minutes to make the ack by itself.So, 10 minutes later the RAC is making the node

eviction. This means users wait for 10 minutes before they get a response.

 

I was told  by the previous db admin that there are some parameters to avoid that behavior,

but he does not have any documentation about these parameters.

 

Any help will be really appreciated.

 

2 REPLIES 2

The_Dude1
Level 3

Hello,

Some words of wisdom from our SF Oracle RAC expert in Support, Ed Yu:

 

By default with SFRAC VCS and oracle 10g installed on the system, both VCS(vendor clusterware) and RAC CRS (RAC clusterware)will try to fight for the control of the cluster during an outage. 

Since VCS is more intimate with OS and various system related resources, ORACLE RAC by design upon detection of vendor clusterware during installation will default the CSSD timeout to 600 seconds.  This means that CRS will wait for VCS(vendor clusterware) to stablize the cluster before taking actions,  and the action CRS takes is node eviction. 

Under most of the circunstances, VCS being sensative as is will take corresponding actions quickly. (GAB panic, resource clean, retry online and etc.)  However, there are exceptions where CRS are in trouble for whatever reasons but VCS detects everything running in normal state, only then the node eviction will be delayed. 

It is not recommended for customer to reduce the timeout value in init.cssd due to the fact that shorter timeout will result VCS and CRS fighting to stablize the cluster and introduce resource contention and expose to possible "split brain" scenarios.  Here is a short example:  while VCS tries to offline resources or in process of bring down the cluster, Oracle also tries to take out the node, but due to this contention, they end up hanging with missing resources that were taken out by each other, the offline/panic/eviction may take a lot longer than usual. 

If customer is really seeking the shorter timeout value, they will need to get approval from Oracle,  but do not set the timeout value to 60, because they may end up getting a lot of unwanted node evictions and service interruptions. 

 

/ed 

 

 

Hope this helps.

 

-TD-

chethan_hublika
Not applicable
Employee
If the product is Storage Foundation for Oracle RAC, then there should not be 10minute delay in CRS realizing that the other node is dead.  This should not happen if things are configured correctly.  Was the veritas skgxn library copied as /opt/ORCLcluster/lib/libskgxn2.so before installing Oracle clusterware/CRS?  Did the Oracle clusterware populate the cluster nodes automatically?