A colleague tipped me to this blog post at the Netflix Tech Blog that I found both amusing and interesting. Some months back, I was talking about disaster recovery testing with a customer, and he used the phrase "chaos monkey" to describe all the things that can go wrong in the data center in the course of testing a DR plan. Little did I know that there was actually a chunk of code named Chaos Monkey written to make things go wrong on purpose.
The folks at Netflix developed Chaos Monkey to randomly disable instances of their production (yikes!) application to test the service's resilience against such failures. The engineers at Netflix will occasionally turn Chaos Monkey loose in their environment to make its mischief while the engineers monitor service availability.
Since the development of Chaos Monkey, engineers at Netflix have developed other members of what they wryly call their "Simian Army", detailed in this post at their Tech Blog. It all makes very worthwhile reading for those involved in IT service disaster recovery and business continuity. The measures taken by Netflix are very impressive, indeed.
While Netflix engineers take what some might consider an extreme approach to resilience testing, too many people I speak with do little or no testing of their DR plan and when they need to respond to an actual data center-wide outage are left with critical business services unavailable.
Disaster recovery testing may not be as fun as a barrel full of monkeys, but it's a whole lot more fun than explaining why your services aren't running two days after a disaster.