Reducing the cost of high availability with larger...

Eric_Hennessey1 · ‎03-05-2009

Hello from the Storage Foundation/High Availability Technical Product Management team, and welcome to the new Symantec Connect! In this first VCS blog entry under the new format, we'll take a look at reducing the cost of HA by building larger, more "intelligent" clusters.

When most people think of clustering, they think of a 2-node, active/passive cluster in which a mission-critical application runs on one system (the active node) while a stand-by system (passive node) is ready to take over if something fails on the active node. This approach works fine in an environment with one or maybe two mission-critical applications, but consider the costs of this approach in an environment with 10 or 15 or 20 mission-critical applications.

To avoid the costs of excessive hardware sparing, an "N+1" approach is often implemented in which "N" represents the number of active nodes and "+1" represents a single spare (idle) server. In a very large cluster, one might see an N+2 or N+3 configuration in which there is more than one spare server.

Traditional clustering solutions typically use an ordered list of servers within a cluster to select a failover target when something fails on an active node. The cluster will simply select the next available system in the list as the failover target for the failed application. This is all well and good in the case of a single failure but becomes unmanageable in the case of multiple or cascading failures. To effectively implement an N+1 configuration, cluster technology with a little more intelligence is necessary.

SGWM, or Service Group Workload Management, is a unique feature among cluster products. Coincidentally, it’s also a feature of Veritas Cluster Server.

VCS has three failover policies to choose from:

Priority. With this mechanism, specific server groups can be given a higher or lower priority based on the needs of an application. Servers with a higher priority will be selected as a failover target before servers with a lower priority.
Round Robin. Veritas Cluster Server will select the server running the fewest number of service groups and use that as the failover target server. This failover policy is seldom used and is suitable for very specific environments.
Load. Veritas Cluster Server will select the next failover target based on a predefined load of the service group and predefined server capacity. This is the policy that implements Service Group Workload Management.

SGWM provides an advanced capability to dynamically choose the next failover target based on user, application, or system requirements. Most cluster products use some form of an ordered list of servers to determine failover behavior. That’s OK if you’re clairvoyant and always know the failure sequence of your environment, but for those of us who are challenged in that way, SGWM offers an intelligent solution.

Veritas Cluster Server doesn't require that all servers in the cluster be provisioned identically all the time, so this capability allows you to provide varying capacity values that are consistent with the capabilities of the server and also provide a load value for each service group (application). Veritas Cluster Server will attempt to stack up applications within the cluster to keep the load at a reasonable level on the server. It will also take into consideration the available remaining capacity of each server when it has to make a failover decision.

The systems must have enough compute and memory resources as well as storage and network bandwidth with this failover policy. Also, none of the service groups must conflict with any of the other service groups in terms of compatibility. If there are incompatible applications, SGWM does provide for the ability to set service system limits that will allow a certain number of service groups to run on a given server or a prerequisite condition that can be predefined.

Limits and prerequisites are sort of user-definable tricks that can be used where you can set limits for the number of types of applications you run on the server. For example, I might have a cluster that has eight servers and six or seven Oracle instances on it, and to maintain performance, I want to be sure that I never stack more than two or three Oracle instances on a single server. Under SGWM’s limits and prerequisites feature, I can implement that as global policy.

SGWM is the key to cost-effective high availability in the Veritas Cluster Server world.

VOX

Reducing the cost of high availability with larger clusters