Five Critical Recovery Flaws Your Last Disaster Re...

Kimberley · ‎04-20-2010

EXCERPT FROM SYMANTEC WHITE PAPER

Businesses today are spending millions of dollars to develop and maintain disaster recovery (DR) infrastructures that will ensure business continuity. But despite such huge investments of time and resources, most IT professionals are still not completely confident in their ability to recover in an emergency. With industry analysts citing DR failure rates of at least 60 percent, there’s good reason to be concerned.

Realists that they are, most IT managers understand that the complexity and scale of today’s infrastructures, high change rates, the number of stakeholders tied to the change management process, and DR testing costs make recovery exceedingly difficult even in the best of circumstances. But the limitations of traditional DR testing are putting IT organizations at an even greater disadvantage. At a time when businesses are under more pressure than ever to ensure continuity and minimize data loss, IT organizations have no way to accurately measure if their DR plans will actually work when they need them.

This paper explores the reasons why periodic DR testing and manual auditing is not enough to ensure DR readiness. It takes a closer look at the challenges of traditional DR testing and explains how and why most tests will miss the serious data protection gaps and recovery vulnerabilities that are lurking in most environments.

In addition, the paper examines how automated DR testing and monitoring, a new approach to DR management, is helping companies around the world make up for the shortcomings of traditional DR testing. These solutions provide companies with the ability to reduce the cost and operational disruptions caused by traditional testing methods while delivering a consistent, up-to-date view of the environment. Automation enables vulnerabilities to be detected and resolved immediately to ensure the highest level of DR readiness and business continuity.

The failure of disaster recovery testing

The theory

A DR test should emulate how well business operations can be transferred to a remote facility to get the organization back online within a specified recovery time objective (RTO) and recovery point objective (RPO).

A good DR test requires considerable advance planning, along with a sizable investment in time and resources. Large numbers of people in the IT organization need to be involved. Network and storage resource mappings must be reconfigured not just once but twice, first for the test and then again to restore normal operations. And to simulate a real disaster – which is the only way to truly determine how well the DR strategy works – mission-critical applications or the whole production environment must be taken down during the test, a step which most businesses are loathe to take. When a test doesn’t work, the team must locate and fix the problems and then repeat the process.

The reality

DR tests are difficult, costly and complicated. Most companies run lean IT organizations that just don’t have the time or resources to execute full, by-the-book DR tests. Plus, simulating a disaster can be dangerous: upon completion of a test, IT professionals often hold their breath, hoping that production will be easily resumed. With such concerns and limitations, it’s no wonder the scope of DR tests is minimized. Shortcuts include:

Testing just a few key portions of the infrastructure, rather than testing the full DR environment. Companies may, for example, test very few business services and postpone the rest to a future test.
Keeping storage/database/application management servers and/or domain/name servers or file servers online while performing the test.
Conducting orderly system shutdowns to protect production systems, rather than simulating the abrupt cessation
of operations that would occur in a disaster.
Testing failover servers but not applications.
Testing applications but not simulating the actual load the application must bear following a full site recovery.
Neglecting to test dependencies, data inconsistencies and mapping errors that may exist between SAN devices and hosts, or any of the other errors that can cause a recovery to fail. This is important because most applications operate within a federated architecture that includes complex interrelationships between databases, applications, middleware,
flat files and so forth. To ensure successful recovery and data consistency, a DR test should ensure that all components in the federated architecture can be recovered, or restarted, to the same point-in-time, while ensuring write-order fidelity.

However, most businesses do not do this.In the end, they have test results that are at best incomplete and at worst worthless.

READ THE COMPLETE WHITE PAPER, ATTACHED.

VOX