The Benefits of Stressing Out: Why You Should Stress Test Your Recovery Plans

In everyday life, stress is usually regarded in a negative light, but in business continuity management, stress testing your recovery plans can play a very positive role in improving an organization’s resilience. Only by conducting realistic stress tests as part of a testing strategy can companies make sure their recovery plans are functional—and learn how to fix them if they are not. 

 

Optimistic Assumptions and Unrealistic Tests

Most planners do a good job of writing recovery plans; however, two things commonly hold their plans back from being as good as they should be. 

One is that their plans are often sprinkled with overly optimistic assumptions about how quickly and easily recovery tasks can be accomplished.  

The other is that when companies test their plans, they commonly take so many special precautions to avoid impacting production that the tests are not realistic. (For more on this tendency, see MHA Consulting CEO Michael Herrera’s post “Overdoing It: People Who Overplan Their Mock Disaster Exercises.”) Sometimes, organizations limit the scope of tests due to time or resource limitations, which has a similar effect. 

Today’s blog will explain why stress testing is valuable and share some ways it can be done effectively without risking an impact on production. 

The Value of Stress Testing

The only way to make sure your recovery plans are capable of seeing you through a real-life event is by subjecting them to an equivalent level of stress ahead of time.  

It’s one thing to have a plan and another to know from testing that it actually works.  

No sensible person would assemble a baby crib and put an infant in it without first shaking, testing, and pressing on the various parts to make sure the crib is strong enough to safely support the child. No one should entrust their organization to a recovery plan without performing a similar test of its capabilities. 

Common Problems of Untested Recovery Plans

Most recovery plans are premised on the feasibility of certain workarounds and time estimates. Are these assumptions realistic? The only way to know for sure is to try the plans out under realistic conditions and see. Our experience at MHA Consulting is that many of the assumptions planners make are overly optimistic. 

Among the most common problems we see are:  

  • Critical portions of a workaround fail due to insufficient capability 
  • A planned workaround is impossible because a piece is missing 
  • Actions take much longer than anticipated 
  • People lack access to the resources needed to complete prescribed tasks 

Conducting Stress Tests and Avoiding Production Impacts

No planner can be blamed for wanting to make sure their tests do not impact production. The best response to this concern is not to forego realistic testing but to find ways to test realistically that do not pose a risk to production. It is possible.  

Many recovery plan tasks are completely separate from production systems and can be rigorously tested with zero chance of impacting those systems. Examples include data gathering, analysis, running reports, and communications.  

For instance, you can have someone on the communications team draft a response about a mock event in real-time and get it approved without having the slightest impact on production. In carrying out these activities, you might find out that conducting them is more difficult and time-consuming than you originally thought.  

A Graduated Approach to Stress Testing

Even with IT/DR, it is possible to stress your plan without impacting operations. It takes careful planning and some specialized knowledge, but it can be done. 

The best way to proceed is with a graduated approach. First, do parts of parts, then entire parts. Pick times when the activity level is low. Only gradually should you work up to a full production failover.  

Organizations that are just starting down the BCM path should not be overambitious in their IT/DR testing. But with a couple of years of experience—after putting in the appropriate testing strategy and road map—every organization should be able to do this. 

Tabletop exercises can simulate a full recovery and make sure that the plan makes sense logically as well as identify gaps in the recovery. After conducting a number of tabletops, move on to testing parts of the system.  

As an example of a graduated approach to testing a manual workaround for a business continuity process, you might have a portion of a department do the workaround for half a day. Afterward, you can assess whether the workaround functions as the plan anticipates. Gradually increase the number of people performing manually or the number of departments that participate.  

As an IT/DR example, you might find a time that’s minimally impactful and fail over a single app or a small set of apps then fail them back.  

For IT, the ultimate goal should be to recover the whole environment. It might not be possible to do this in a single test, but everything should be on a schedule to recover. All the parts of the system are there for a reason, so all of them should be recoverable. 

Leveraging Stress Tests to Improve Recovery Plans

Stress testing is not an end in itself. Neither is identifying invalid assumptions in recovery plans. The point of conducting stress tests is to identify gaps that can then be closed by adjusting recovery plans, changing workarounds, or similar measures.  

The goal is to have recovery plans that are in tune with reality—and which have been proven to function as advertised under real-world conditions. Realistic stress tests are a means to achieving this important end. 

The Benefits of Stressing Out

No one likes stress in everyday life, but on the job, the smart BCM professional knows that it pays to stress out their recovery plans.  

Many recovery plans are weakened by overly optimistic assumptions and unrealistic tests. Conducting realistic stress tests is the best way to see if our assumptions are valid and the recovery plans work as they are supposed to. With careful planning, it is possible to conduct realistic stress tests without impacting production.  

The final step, if any gaps are revealed, is to adjust our plans to close them. This will increase the plans’ functionality, enhancing the resilience of the organization and better protecting its stakeholders. 

Further Reading

For more information on conducting stress tests and other hot topics in BC and IT/disaster recovery, check out these recent posts from MHA Consulting and BCMMETRICS: 

About
Richard Long
Richard Long is one of MHA’s practice team leaders for Technology and Disaster Recovery related engagements. He has been responsible for the successful execution of MHA business continuity and disaster recovery engagements in industries such as Energy & Utilities, Government Services, Healthcare, Insurance, Risk Management, Travel & Entertainment, Consumer Products, and Education. Prior to joining MHA, Richard held Senior IT Director positions at PetSmart (NASDAQ: PETM) and Avnet, Inc. (NYSE: AVT) and has been a senior leader across all disciplines of IT. He has successfully led international and domestic disaster recovery, technology assessment, crisis management and risk mitigation engagements.
Post-Incident Analysis Crafting a Site Recovery Plan