Resiliency Theater – You May Not Really Be Prepared for an Outage

Richard Long

We spend time preparing for major data center or facility outages. We perform a risk analysis and write plans; we put in technologies to keep the business running and perform various tests. We report that we are ready. We feel confident our business can continue to run. But much of that could be considered what I call “Resiliency Theater” – because those activities do not prevent or address the most common or most probable events that may impact the organization.

Two very recent events demonstrate the concept of Resiliency Theater quite effectively.

  1. United Airlines issued a ground stop on Saturday due to an IT issue that impacted a communications system used to transmit information needed to take off. According to a CNN report: “Everything has a redundancy, but it’s slow,” said one of the sources. “Putting in flight plans by hand. Not having times automatically recorded or sending weights and such. The ground stop is a way to be incredibly cautious.” On paper, it looked like United was prepared with redundancy and workarounds, but the reality was something different.
  2. According to a USA Today report, a former employee of a for-profit university is alleged to have changed a password on a Google email account used for storing course work. The account had a single administrator listed – the former employee – who refused to provide the password. Because of this, Google could not verify the validity of the access request and access was unavailable for a significant amount of time. This single point of failure for an important email system could have been easily avoided.

These types of outages can be defined as operational risks rather than business continuity risks. In much of our work with clients, we encourage planning and preparing for critical localized outages in addition to planning and preparing for full facility or data center outages. Many outages, whether local or full facility, are self-inflicted (as the two examples above demonstrate). Often these issues are not incorporated in the business continuity area as they are owned and managed by the operational teams.

Here are several ideas on how to incorporate what may be considered “non-business continuity” resiliency issues into your program.

  1. Document operational risks and gaps as part of the risk assessment.
  2. Determine the impact of these operational risks to overall business continuity or recovery capability.
  3. As recovery strategies are implemented, include discussions on how the high risk or high impact operational issues could be mitigated using BC-related solutions.
  4. Include operational issues or localized outages as part of your mock disaster or tabletop exercises.
  5. Include a section in your documentation or reporting to include a copy or reference to overall availability or potential outages.

We use resources preparing for disaster-type outages and feel prepared, but then a major localized outage occurs that has significant business impact. It may be time to expand our thinking in terms of the full spectrum of what business continuity entails – including resiliency theater. Any outage should be thought of as a business continuity issue. The two examples above were significant business outages where a core function of the organization stopped. Those were disasters, even if they were not caused by mother nature, fire, or another “typical” BC event. Would your BC organization have been part of the solution or resolution? If the answer to this question is no, a better question may be, why not?

Comments
pingbacks / trackbacks
  • […] begin the review of risks and plan updates, with what type of events should you be concerned?  In a recent blog, we talked about a couple of events that demonstrated our potential lack of functional recovery. […]

disaster recovery strategybusiness continuity disasters