When evaluating a solution for high availability and disaster recovery, one should start with a few basic criteria. Here are four.
Principle #1: An HA/DR solution should be simple.
In the middle of the chaos of a disaster or outage, complexity is not your friend. That detailed ream of paper describing the needed steps in complete detail may represent a thorough review and understanding of your business at a given point in time, but at the point of an actual disaster it becomes a hindrance rather than a help.
A complex solution is likely to be error-prone. Even if the steps described are accurate and current, under the pressure of the moment, people tend to make mistakes. The more complex the solution, the more likely the occurrence of an error. Worse, it also increases the likelihood of a severe error, one that may compound or prolong the outage. Yes, all the steps are necessary, but the execution of the disaster plan should not be “read this document.” A better solution would be one that encapsulates the complexity by bringing the execution of the plan down to a single-icon click.
Finally, the person actually at the controls at the point that something must be done may not be your most skilled employee, and yet her or his presence at that moment may make them your most valued employee. A disaster recovery solution should be simple enough for almost anyone to use.
Principle #2: An HA/DR solution should be easily testable, without disrupting the primary or secondary environments.
An untested environment is likely to fail in an actual disaster.
One reason HA/DR plans are untested is that they are disruptive to a business, by shutting down the production environment. As well, if data replication is used, that too may be shut off, which creates a vulnerability to data loss if an actual disaster were to occur during the test.
On the other hand, another reason to not test is that the results may reveal configuration errors in the replicated or secondary environments. This should be regarded as useful information to repair the configuration, and perhaps to identify errors in change management procedures.
Configuration drift can easily go unnoticed until it is too late. The disaster recovery document is likely to be out-of-date as soon as it is completed. Even if it is not, as time elapses, configuration changes are made. In a complex solution it is more likely that the changes may go unaccounted for in the disaster recovery plan. Careful and rigorous change management is usually the solution, but only if it is followed 100% of the time by everyone maintaining the environment, and frequently that is not the case.
Principle #3: An HA/DR solution should be a standard solution across all of your IT infrastructure.
Most solutions for HA/DR only work on one operating system or virtualization environment, creating islands of HA/DR, which must be implemented and managed separately. By using separate solutions for each environment, not only do we force the solution to be unnecessarily complex (violating Principle #1 to keep it simple), but we also give up any ability to treat your IT infrastructure as an integrated whole. These days most multi-tiered applications span multiple platforms, whether virtual or physical. Native tools will restrict that visibility and control to just the individual platform.
Nonetheless it is possible to standardize using a single vendor. However, a homogeneous solution using only one vendor constrains your ability to be cost-efficient and agile.
A solution that supports only one brand of virtualization, hosts, storage, or applications is likely to cost significantly more. Organizations which build fresh environments start out with one vendor for each of its major components. This may improve simplicity but it does not improve cost. Having multiple vendors puts you in a better position to negotiate a better price. Even if you stay with one vendor, the ability to easily migrate between vendors can provide the needed leverage.
Finally even if an organization starts out with a homogeneous environment, over time and with acquisitions or mergers, organizations frequently become heterogeneous. Heterogeneity is likely to be in your future. Plan on it now.
Many vendors’ solution for high availability and disaster recovery is to have customers purchase and then migrate to a new platform or environment – a considerably expensive and disruptive proposition. Here’s a better idea:
Principle #4: Use the infrastructure you’ve got to get the disaster recovery solution you want.
Customers have already invested a tremendous amount of capital on their IT environment, and yet to solve a fairly common problem, they are asked to spend a tremendous amount more, many times replacing all or most of their infrastructure. Worse, they then must go through the effort and expense of migrating to this new environment. If they could use their existing infrastructure, they could avoid this enormous additional cost of a wholesale hardware migration.
In summary, an HA/DR solution should be
2. Testable, without impacting production,