High Availability for converged infrastructure in ...

Kshitij_Korde · ‎10-31-2014

Data Center Architecture

In data centers like Cisco's UCS (Unified Computing System), every server be it blade or rack-mount, is provisioned through a Service Profile. Just the way a SIM card holds an identity of a mobile phone, the Service Profile holds an identity of a server in data centers. The Service Profile thus maps to a single server defining its network and storage characteristics. As the Service Profile contains MAC addresses (for each vNIC), WWN addresses (for each vHBA), UUID, server boot policies etc.. the Service Profile is critical for any server in the UCS domain to become functional.

One can disassociate the Service Profile from its attached server and associate it to another. After the association (attachment) of the Service Profile on the new server, the burned-in settings such as UUID and MAC addresses on the new server gets overwritten with the Service Profile configuration. Thus, without changing any physical configuration on the target server the Service Profile can easily migrate from one bare metal to another. This can serve as a mechanism for quickly replacing faulted servers with available (not associated with any Service Profile) bare metal servers.

From High Availability perspective in data centers, we can think of a Service Profile as a critical application which should be always be up and running. But the idea of monitoring the health of Service Profiles and providing HA to it differ from the traditional health monitoring that we do for enterprise applications.

How is monitoring the health of service profile different from monitoring the health of any enterprise application ?

In simplest term, a Service Profile is just a file (like a SIM card) that can be applied on any server in the UCS domain. So the Service Profile is not really an application.
The Service Profile file is not stored on the associated Server. It is kept with a UCS Manager that runs on a fabric interconnect inside the same UCS domain. When the Service Profile gets deployed on the available server, the UCS Manager automatically configures the server to match the configuration specified in the Service Profile. As the UCS Manager mediates all the communication inside the UCS domain, any decision to move the Service Profile from one server to another has to happen through the UCS Manager. So this is essentially a remote monitoring of the health of the Service Profile that has been applied on to the Server through UCS Manager.
Not all Service Profiles can be applied on all available servers due to different hardware characteristic of the servers like the number of NIC(s), HBA(s), CPU, Cores, Memory etc.. Thus, choosing appropriate target server for failover out of potentially 160 servers (UCS domain can scale up to 160 servers) involves comparisons.
Since a server can be physically added or removed from the UCS domain at any time, choosing a target has to be dynamic.

How to achieve Service Profile failover ?

Cisco UCS Manager has exposed some XML APIs to manage entities inside the UCS domain. Be it chassis, blade server, adapter, NIC, Service Profile, etc.. every object can be modified (within the realms of possibility). Symantec solution for Cisco UCS uses these APIs to monitor health of Service Profiles and performs policy based action (failover) whenever hardware fault is identified on the server or service profile.

Model of Symantec HA solution inside UCS datacenter

Symantec UCSHA solution interacts with the UCS Manager using XML APIs and does the following

Queries list of associated profiles & available free servers data.
Queries list of faults on the attached (mapped) servers. The faults can be specified using Fault Code and criticality of the faults in a UCSHA configuration file.
After a fault is detected on the mapped servers, UCSHA intelligently figures out the optimum available server as the target blade for failover.
UCSHA instructs UCS Manager to migrate the Service Profile that encountered the fault on the target blade.

Venkata_Reddy_C · ‎11-03-2014

The solution can be obtained from https://sort.symantec.com/agents.

VOX

High Availability for converged infrastructure in data centres