Today’s post is the third in our three-part series on how chaos engineering might one day help us get better at business continuity and disaster recovery.
In the first post, “What Is Chaos Engineering and Why Should I Care?,” I gave a general introduction to CE. In the second, “Chaos Engineering and Business Continuity,” I talked about how CE could help companies test and strengthen the resiliency of their business processes.
To wrap things up, I’m going to talk about ways CE could potentially help organizations strengthen their Information Technology/Disaster Recovery (IT/DR) environments.
Before we get started discussing chaos engineering and disaster recovery, I’ll give a thumbnail sketch of CE, for the benefit of those who are just joining us.
THUMBNAIL SKETCH OF CHAOS ENGINEERING
Chaos engineering is a cutting-edge approach to system resiliency that originated at Netflix in 2011. It has been embraced by other tech firms such as Google and Amazon.
Chaos engineering is based on the idea that the best way to learn the strengths and weaknesses of a complex system is to go into the production environment and intentionally break things. Then you can see how the system deals with disruption and you can identify and fix your vulnerabilities.
The thinking behind the movement is set forth on a website called Principles of Chaos.
The goal of chaos engineering is to highlight systems’ vulnerabilities by performing experiments on them. By identifying hidden problems prior to them causing an outage, you can address systemic weaknesses and make production systems resilient and fault-tolerant.
The idea of intentionally throwing wrenches in the production environment is enough to give most managers ulcers. But the fact is, unless you test in the production environment, you never really know how recoverable your systems are.
In my experience, many organizations are so cautious in their testing, they never get a true picture of how resilient their environments are. Sometimes, instead of performing testing to find where our systems are weak, we do it to fool ourselves into thinking they are strong.
At the moment, most BC managers and corporate executives at mainstream companies are likely to consider chaos engineering to be out-and-out wacky. (“Why should I throw a grenade in a functioning production environment that our customers depend on?”) Personally, I think CE is the most hard-headed and practical approach of all. It’s the only way to know if you can really recover.
CHAOS ENGINEERING MEETS IT/DISASTER RECOVERY
You know that you have single points of failure and other vulnerabilities in your IT systems. If you’re like many people I’ve worked with, worrying about them can even keep you up at night. So why wait until there’s a problem to find where your weaknesses are? Why not shake your systems up right now, intentionally, to find the weak spots before they fail for real? This is what chaos engineering does.
By intentionally triggering failures in a controlled manner, you can gain confidence that your production systems can address unplanned disruptions before they happen in production.
CHAOS ENGINEERING VS. REGULAR TESTING
Chaos engineering differs from regular testing in several ways. Normal testing is done during build/compile activities and doesn’t test for different configurations or behaviors or factors beyond your control. Additionally, routine testing doesn’t account for people—for training and preparing them for the failures they will be responsible for fixing live, in the middle of the night.
Chaos engineering is a different kettle of fish. It’s about breaking your production systems. How could this be done in the IT environment? Here are a few ideas:
- Terminate the primary network connections. This could help you see if your backup network communications take over seamlessly with no system impacts.
- Reboot or halt the host operating system(s). This would allow you to test such things as how your system reacts when losing one or more cluster machines.
- Change the host’s system time. This could be done to test your system’s capability to adjust to daylight saving time and other time-related events.
- Break data replication. This could be used to validate that you receive notification of failures as well as that you can resynchronize data for critical applications.
- Shut down one or more processes. This could be used to simulate application or dependency failures.
- Failover systems to their backup site. This could be used to validate the high availability configuration and ensure that your system(s) can failover seamlessly and continue operations.
As DR planners, we typically exercise in highly controlled environments with the scenario of catastrophic loss of the entire computing environment. In real life, we are more likely to face a component-based failure than the loss of the entire data center.
By finding the weak links in your production environments, by combining chaos engineering and disaster recovery, you have taken the essential first step toward building a resilient, fault-tolerant environment.
For the time being, few companies are likely to let their IT/DR teams go tramping around in the production environment breaking things. But I believe that mainstream organizations will eventually take a closer look at chaos engineering and disaster recovery.
There is no better way to identify the flaws in an IT environment so that your team can get them fixed—before they break your system at 3:00 in the morning when no one is looking and the health of your business is on the line.
FOR FURTHER READING
For more information on Chaos Engineering and disaster recovery and more hot topics in BC and IT/DR, check out these recent posts from BCMMETRICS and MHA Consulting: