Time Machine: The Future of IT Resilience

Gautham_blog_0.jpegI

If you are a CIO with responsibility for an enterprise data center, your strategy needs to include planning for a resilient data center environment, especially with the movement to generation hybrid architectures. Historically, the IT community has looked at data center reliability through the lens of preventive defense, often measured through parameters like 2N, 2N+1, etc redundancy. However, as the definition of the data center expands beyond the scope of internally managed hardware/software into the integration of modular platforms and cloud services, simple redundancy calculations become only one factor in defining resilience. Ongoing reforms to legislation, relevant security standards and other regulations must also be continually monitored by organizations. Such requirements are generally established to ensure the resilience of the organization’s information assets, or information assets they hold on behalf of others in the course of their business. Compliance requirements may also be industry and/or location specific, with key sectors such as banking and finance, telecommunications and utilities subject to their own regulations.

There are a number of new practices that need to be implemented to insure resilience of key IT systems and business services. These approaches include completely automating monitoring critical services on a continuous basis, as well as designing systems which include persistence storage to withstand any disruptions to entire systems. IT Organizations should also look to augment their portfolio with new tools that can help boost their resilience and also implement novel best practices to test their systems hardness.  For example, Netflix has earned recognition for its novel use of resilience tools for testing the company’s ability to survive failures and operating abnormalities. The company’s Simian Army, consisting of services (monkeys), unleash failures on their systems to test how adaptive their environment actually is. The data center community needs to challenge itself to find similar means for testing adaptively in modern hybrid architectures if it is to rise to the challenge of ultra-reliability.

With data center infrastructure management (DCIM) tools becoming more sophisticated, and the need for integrations across multiple ecosystems increasing, availability of robust data for informing ongoing decision-making in the data center is a must have for ensuring IT Resiliency. No longer is resilient data center architecture just about the building and infrastructure. It’s about designing the right architecture plus including tools that can help reduce the probability of failure and insure better predictability of system availability.

1 Comment

Availability becomes a security vulnerability when it can be broken, such as in DoS attacks; thus, we need to think bigger than just servers and storage. New ways for testing certainly provide great validation and threat probing, but I would say that we in IT must also think differently than what has always been the case. For instance, the commonly used methodology around Disaster Recovery testing has been to intensely test various points in the infrastructure periodically (annually being most predominant). This methodology has traditionally been to choose those infrastructure components well in advance. The reality is that this leads to months of preparation and a test validation for success. The methodology is not flawed but the execution is. Anouncing what to test quite far in advance is flawed. How much advanced warning do we get for a Tornado, Hurricane, Earth Quake, Power Outage, Gas Explosion, etc.? Do not misunderstand my point. The hours of preparation are needed and prudent as part of the daily operational management of the infrastructure. It is simply skewing the results of the test at the most basic of levels. Secondarily, the business as usual (BAU) approach to DR testing uses the best and most knowledgeable resources to perform the test. Again, this is not often the case in the event of a real disaster. Situations arise where the best resources may not be available for family or personal injury reasons. However, the most flawed component of the BAU DR testing approach is the 'engineered' success. While every organization certainly needs to validate that its processes are current and proper, it cannot be done in such a manner as to mask the failures. Those failures are where an organization learns of poor processes, broken resource trees, and lacking knowledge before having to enact them during an actual event.

I advocate that the most novel of approaches to testing is to change our thinking and execute to experience success while pushing the failures out in the open. The ability to do this testing without production interruption on a daily, weekly, monthly, or other regular cadence is paramount to the novel approach of testing availability thinking. This will translate down to the single point of failure that is no longer wholly owned by a single organization in this age of cloud, IoT, and multi-point information exchange.