The fascinating new discipline known as Chaos Engineering (CE) has the potential to bring big changes to business continuity.
A couple of weeks ago, I wrote a post called, What Is Chaos Engineering and Why Should I Care? where I gave a brief introduction to CE. I also mentioned that Chaos Engineering potentially has the ability to shake up the fields of both Business Continuity (BC) and Information Technology/Disaster Recovery (IT/DR).
In today’s post, I want to give a quick refresher on the basics of CE. Then I’ll sketch out some of the ways Chaos Engineering might one day be used by organizations like yours to strengthen their ability to recover from disruptions to their business processes. Such improvements would make the company safer and stronger overall.
CHAOS ENGINEERING IN A NUTSHELL
Chaos Engineering originated at Netflix in 2011 and has since spread to other tech companies such as Google and Amazon. It now looks poised to make an impact at non-technology firms.
The core idea is that as part of your testing strategy, you should deliberately cause problems in the production environment (rather than in a test environment), because this is the best method of determining and improving resiliency. This method lets you see how the system deals with the disruption, identify vulnerabilities and fix them. The ultimate goal is to minimize the impact of disruptions on the end-user experience.
According to the Chaos Engineering community’s home website Principles of Chaos, the “advanced principles” for doing Chaos Engineering include: build a hypothesis around steady-state behavior, run experiments in production, automate experiments to run continuously, and minimize the blast radius, meaning, ensure that the fallout from experiments is minimized and contained.
PIE IN THE SKY?
It’s a little scary to think about intentionally throwing wrenches into your production environment. But Chaos Engineering recognizes an important fact: unless you test in production, you never really know how resilient you are.
By being highly cautious in their testing, organizations can leave a lot of room to deceive themselves about how recoverable their systems are.
In my experience, many companies test themselves with the unstated goal of not making the test too hard, so they can get good results and everyone feels good about how things are going. Real problems, if or rather when they occur, are not likely to be that considerate.
To many BC managers and corporate executives today, chaos engineering might seem pie-in-the-sky. But in my opinion, it’s the most realistic and hard-headed approach of all.
CHAOS ENGINEERING AND BCM
At MHA Consulting, we write our recovery plans in an event-neutral fashion with four key scenarios in mind:
- Loss of Facility/Region
- Loss of Technology
- Loss of Resources
- Loss of Critical Third-Party Vendors
These scenarios suggest areas where an organization could productively insert a little chaos into their real-time production efforts.
What if, at your organization, your BCM team sprang the following scenarios on your colleagues without warning, in the middle of their normal activities:
- Unexpectedly closing off the space where a department resides, requiring immediate relocation.
- Removing critical business unit leaders during peak production times.
- Directing 40 percent of staff not to come in for a given day or shift, or to only work from home.
- Only using recovery team alternates to manage a response and recovery, while having the primary people sit on the bench.
- Shutting off the phone system unexpectedly.
- Shutting down one or more critical systems.
- Introducing a systems disruption that requires data synchronization.
- Running the department or entire building using the alternate worksite strategy for a couple of days.
- Requiring the business unit to use manual workaround processes for a shift.
- Requiring the use of an alternate supplier.
How would your organization handle having these various wrenches thrown into the works? Would the company shrug it off? Muddle through? Would business come to a screeching halt?
You would certainly learn a lot about your organization’s resiliency by running the above “chaos experiments.” And unlike in a conventional, test-environment exercise, the insights you would gain would be tough, honest, and authentic.
The experiments would teach you a lot about where your vulnerabilities are, as well as your strengths. This information would be invaluable as a guide to how you can improve that process for the next time. Then, the problem might not be caused by a CE experiment, but rather a real emergency.
Would running such experiments in production cause problems? Would orders be delayed or messed up? Would customer inquiries go unanswered for a period of time? It’s possible. But there are ways to mitigate this kind of problem. And it’s one of the principles of chaos engineering that by putting yourself (and your customers) through modest inconveniences now, you can reduce the chances of experiencing major problems later, during a real event.
THE FUTURE OF BUSINESS RESILIENCE
Obviously, you would need to ask management for permission to purposely insert chaos into a healthy production environment. I suspect that for the time being, most managers will say, No, thanks.
But trends from Silicon Valley have a way of spreading throughout the country. I think that eventually, mainstream organizations will be taking a closer look at chaos engineering as a way of validating their recovery plans. In my opinion, chaos engineering is the future of business resilience. It’s the one true way of finding out if you can recover for real.
FOR FURTHER READING
For more information on Chaos Engineering and other hot topics in BC and IT/DR, check out these recent posts from BCMMETRICS: