Do you believe your DR will work in a case of a disaster?
As a Disaster Recovery and High-Availability Consultant, I've participated in several disaster recovery exercises, for different customers in different industries, most of them large financial institutions, that are severely regulated here in Brazil and, I believe, all around the world.
Regulated or not, everyone should test their DR plan from time to time, to ensure it's ready and will work in case of a real need, no matter if your DR plan for a specific system/application is based on backup/restore, local clustering, cold backup site (data is replicated to remote site) or a sophisticated metro/geo clustering technology.
Exercising is the only way to ensure that your DR plan is likely to work when needed! Really?
DR exercises have their own associated risks, and usually are very costly, involving a lot of people, and is time consuming too. Because of that, they cannot be done every day/week/month, by almost any company. Normally, a DR exercise is scheduled annually, semiannually, or quarterly at most.
Meanwhile, lots of different things happen in the whole infrastructure, either on active and passive systems: OS and applications are updated, configurations are adjusted, hardware is upgraded or changed, storage areas are allocated or reallocated, network and storage switches receive new/changed configurations, and so on. The safest thing to do is to redo the DR tests on every single small change done on the environment, for all affected or likely affected systems and applications. That's unfeasible, impossible!
So, that’s where regular DR exercises take place, to TRY ensure that everything that happened in between has not affected the ability to activate the DR infrastructure when needed, either if the DR is automated or not. If something doesn’t work as expected during the DR exercise, it’s fixed right away.
I’ve seen lots of different situations taking place during complex DR exercises. I don’t really remember a quarterly exercise that happened exactly as expected and every single thing went fine (I’m not saying it didn’t happen at least once, I just don’t remember that).
- I saw restores not working because backup was being done in the wrong way or tape/media damaged;
- Primary systems not been able to offline application or OS gracefully;
- Passive nodes on local clusters not able to start the application by many different reasons;
- Replicated data missing or corrupted on DR site so application didn’t started;
- Customers forcing application to start on passive nodes and corrupting data because part of the volume was missing;
- Application that have been left running on DR environment for months, because could not failback to primary datacenter and, could not be fixed during the DR exercise window.
- And other things too.
Doesn’t matter which hardware or application vendor you are using, almost anything can go wrong during a DR exercise causing frustration, downtime, high costs and in the worst case, data loss.
Getting back to the quote I did on the third paragraph, “exercising is the only way to ensure that your DR plan is likely to work when needed”, that’s unfortunately true but, Symantec has a tool that is able to, unobtrusively and with no impact on the environment, automatic validate thousands of different conditions that can lead a DR strategy or a DR exercise to fail or do not achieve the expected results. Remotely scanning, from storages, to hosts, to specific applications, automatic discovering DR specific configurations, and comparing them to a list of more than 5000 different gaps (and constantly increasing), it can report on almost any condition that can lead a DR to fail. Also, it helps educating the different administration teams showing, for each trouble ticket, the gap/issue found, why it’s wrong or misconfigured, the impacted systems/applications and business services, where is the gap on the structure (including a detailed graphical view), how to fix it, why to fix it and the expected results.
Symantec Disaster Recovery Advisor (DRA) is a powerful tool that works for Symantec and non-Symantec DR infrastructures. It shifts the maturity of your company, taking it from a reactive state, where you wait for the issues to come and fix them, to a proactive state, where you got alerted for the hidden issues on the infrastructure that can impact the ability to execute DR succesfully - don’t matter if it is a local cluster failover, a replicated environment (either by storage, or by application, like Oracle DataGuard), a complex metro/geo cluster, if it’s synchronous or asynchronous, DRA understand them all.
It’s now on version 6.2 so, has a lot of market experience and maturity. A customer that had a recent failure on a DR exercise of a critical financial application, was the first time I saw DRA in action and let me understand the business value of the tool. It was the second time in a row that the semiannually DR test for that specific application failed, despite the huge investment to keep that application high-available. Gartner says that 80% of mission critical downtime is caused by people and process, and that is exactly the case. This environment was not secured by Symantec High-Availability tools but actually, it wasn't goint to matter, because from the configuration issues found during the DR exercise, most environmental (storage, network, OS kernel parameters), it would have failed with Symantec tools (Compared to native cluster and replication tools, VCS has aditional checks to ensure is right configured and can control the application, but doesn't go that deep to ensure the infrastructure as a whole).
The point is, customer did the semiannually DR exercise, found a lot of different issues, corrected them during the DR exercise weekend and, two weeks after, a partner of Symantec did an assessment with DRA in that same environment, finding dozens of hidden issues, some probably pre-existents before the DR test, some that came after that, during scheduled changes on the environment and other applications that share same Network, SAN, Storage infrastructure.
For me, that’s the value of DRA for any customer, no matter big or small, regulated or not. If there is any investment on high-availability or disaster recovery capabilities, obviously customer expect that it will work when needed. As I said before, the safest thing to do if you have a critical application is to redo DR test for every single little change on the environment, even if the change happened in other application and you suspect that your critical application might be somehow affected.
But as this not only sounds but seems to be impossible, DRA is a unique tool, with no competition on the market that helps you getting the most from your investments in HA and DR, from cheap and simple local clusters (even native tools), thru storage/application replication, to expensive and complex geo clusters. It will help you be succesful not only on your DR tests, but when a real disaster happen.