Disaster Recovery Strategy Execution, or Will It Really Work?

Richard Long

Do you know how to actually execute a recovery using your defined disaster recovery strategy, or will your team have to figure it out? We’ve discussed developing a disaster recovery strategy at length, but what happens when it’s time to execute your strategy?

In his poem, To a Mouse, Robert Burns provides a well-known and insightful thought, “the best-laid plans of mice and men sometimes go awry.” We’ve seen how true this can be when we must perform an actual recovery that doesn’t go as smoothly as we might have hoped, even with all of our planning and document development.

Here are some ideas on providing training and validation of the execution of your DR strategy and plans.

  1. Exercise, Exercise, Exercise. We have discussed this in previous blogs, but there is no better way to provide training and validation than exercises.
    • Tabletop: Run through the steps and actions – including communication. This will verify basic dependencies and order of execution. A good basic validation.
    • Technology Tests: Testing and validation for the individual technologies. These can be done multiple times, and can be limited as needed. Do the technologies function as anticipated? For example, you may test storage replication, backup, and restoration; or virtual server failover using a single server or application. A technology test is a good validation that the steps that you would use in multiple environments are correct and functional.
    • DR Exercises: Multiple application recovery using most or all of the recovery technologies simulating an actual outage. These types of exercises help to provide training, as well as confidence that the disaster recovery strategy will provide a functional recovery. These exercises present issues not anticipated or known in planning and implementation of your strategy. You should perform these tests annually. The scope should always include the most critical applications. Less critical applications are also included so that you can roll through them to ensure recovery works, and also to allow for full interface and dependency testing.
    • Non-IT Personnel Testing: If you are not including non-it personnel in testing, you are missing a significant resource. Non-IT personnel can help identify issues and gaps in execution or strategy. They will not know or care about assumptions made by IT; they will use the environments as they normally do. Because more detailed testing often occurs during this type of exercise, it can also identify missing components in the disaster recovery strategy.
    • Performance Testing: Recover the environment during testing even if the scope does not include all applications. This provides understanding of actual timing, resource constraints (both human and technology), and missing components.
    • Testing Third Party and Hosted Environments: In DR testing consider failing over the hosted environments and ensure you can still function with non-hosted environments. Include the hosted environments in your scope of DR exercises. It can be more complex to ensure there are no production impacts during testing, but it is important to validate the assumption that no changes are needed. More importantly, it ensures you have identified configuration requirements. Good examples are license keys based on hardware or encryption.
  2. What about a real event?
    • Take it slow. The beginning of the recovery can be chaotic. People often just start performing tasks without validating dependencies or understanding the state of the environment.
    • Be very disciplined in your actions. Monitor the people performing tasks. Ensure they are using the documentation, and communicating any issues encountered before remediating or changing anything. Your Change Management during an event is critical. You must be able to understand what has been done in case issues arise after a turnover for production use.
    • Watch the clock. When issues arise, it is easy to allow significant time to elapse during troubleshooting. “5 more minutes” usually means at least 30 minutes. Identify time milestones to ensure that you can obtain additional assistance as necessary. Often our people want to solve the problem, when getting help from vendors or others can decrease the time to resolution.
    • Even if it is not documented, perform validation before turnover to the next team. A few minutes of validating the state of the environment before turning it over can save significant time in troubleshooting or delays.

Until an event actually happens, you cannot be 100% sure your disaster recovery strategy will work. But by rigorous testing and disciplined actions, you increase your likelihood of success and efficient performance. Remember, “Hope is not a strategy.”

recovery objectivesresiliency theater