You Still Need to Drill: IT/DR Testing Is as Important as Ever

Richard Long

The COVID-19 pandemic has caused many organizations to focus on operations and suspend their IT/Disaster Recovery testing programs. In today’s post, we’ll explain why it’s important to keep up with your IT/DR testing, remind you of the eight steps to an effective IT/DR exercise, and look at what has and has not changed with IT/DR testing as a result of the pandemic.

Pressing Pause on IT/DR Testing

The need earlier this year for business to shift large numbers of employees to remote work in order to reduce the spread of COVID-19 led most business continuity management (BCM) offices to shift their entire focus to ensuring their organizations could continue their operations. Matters deemed less urgent were put on the back burner.

This was understandable given the unique circumstances that existed at the time; however, it’s not sustainable as a long-term strategy.

Many organizations suspended their IT/Disaster Recovery (IT/DR) testing programs during that time. Those programs should be restarted, if they haven’t been already.

Why IT/DR Testing Needs to Be Resumed

Nothing about the presence of COVID-19 has made IT/DR testing less important than it was previously. If anything, the coronavirus has made such testing more important.

Just as in the past, it is critical that organizations conduct regular disaster recovery tests of their information technology systems.

IT/DR tests are the only way of verifying that the organization is capable of recovering its systems in the event of a disruption. They reveal what issues are likely to come up during system recovery. They demonstrate capability and identify gaps.

In some industries, IT/DR tests might not simply be a good idea; they might be required. This is often the case, for example, with companies that must satisfy FDA requirements or meet SOX reporting mandates.

For all of these reasons, it is important for your team to implement and carry out a quality IT/DR testing program even during the ongoing COVID pandemic.

Eight Steps to an Effective IT/DR Exercise

If you are like most of the BCM professionals I’ve been talking with lately, IT/DR testing has not exactly been top of mind for you over the last eight months.

With that in mind, I thought it might be worthwhile to set down the eight steps to an effective IT/DR exercise. Consider this a refresher course on the essential BCM activity of conducting well-designed IT/DR drills and implementing a thoughtful testing program.

 Here are the eight things you must do to design an effective IT/DR exercise:

1. Define the reason for the test. Why are you conducting the exercise? What do you hope to find out? There are many good reasons for conducting an IT/DR exercise, and one test can serve multiple purposes. Here are a few things you can accomplish by performing an IT/DR test:

  • Verifying technical capability, such as whether you can bring up your servers or do a data backup.
  • Verifying application functionality, such as whether you can run a certain function.
  • Training staff (including secondary and tertiary people) on procedures and ensure they have the needed skillsets.
  • Verifying processes and procedures.
  • Satisfying an audit requirement.
  • Performing maintenance on the DR environment

2. Define the type of exercise. Which kind of IT/DR exercise will you be conducting? Here are the main options:

  • Tabletop. Review your documentation to make sure things make sense. Can help find major gaps and dependency issues.
  • Smoke test. Bring up certain technologies such as servers or an application environment. Log in but stop short of performing extensive functionality testing. Also called a stand-up test.
  • Application verification in isolation. Test a minimal number of applications without looking at adjacent technologies.
  • Multiple application with integration testing. Look at multiple applications and verify they can function in concert with integrated technologies.
  • Functional (non-IT) testing. Conduct one of the above technology tests while also looking at the integration of technology and business processes.

3. Define the scope of the exercise. Exercises can vary widely in scope. Some look to verify that you can bring up a single application (such as your inventory app or financial app). Others look at multiple apps or at the integration between apps, whether internal or external. Some exercises might look at the full lifecycle of a business process, for example, the order and inventory lifecycle, taking something from order to delivery to receipt and invoice.

4. Identify the participants. Based on the scope, identify which resources (servers, applications, etc.) and support personnel you need to conduct the exercise.

5. Make sure the test environment is prepared. In many cases, a test environment must be set up specifically for the test. It might be necessary to set up a special network, for example. The test environment ensures the exercise does not impact the production environment.

6. Simulate the production environment as it would be in a real event. Try to make the test environment mirror as closely as possible the way the environment would be set up in a real event.

7. Determine the level of notification for the exercise. How much notification you provide depends on your goals and priorities. The less notification you give, the more the exercise will resemble a real disruption and the more accurate a picture it will provide of your current recovery capability. The more notification you provide, the more people will have the opportunity to fix problems before the test happens. These fixes might result in long-term improvements in resiliency. Organizations with mature IT/DR programs should be capable of performing very well in unannounced exercises. There are three basic levels of notification:

  • The exercise is scheduled and publicized, and the staff prepares heavily, pre-checking the environment, documentation, dependencies, and so on.
  • Your test is scheduled and publicized, but the staff do not make any special preparations to the environment, documentation, or other aspects of the recovery plan.
  • Your exercise is not publicized. It is sprung on the participants without advance notice.

8. Determine the exercise logistics. Decide whether in performing the exercise the company will follow the regular crisis communication plan and procedures or whether you will use a (pre-planned) special organization. The special organization might use a defined bridge line, email list, schedule of status calls, and so on.

Those are the eight main steps you need to take to design and conduct an effective IT/Disaster Recovery exercise.

Other Things to Think About

There are a number of other things to think about in conducting IT/DR drills and designing and implementing a comprehensive training program:

  1. Your testing program should incorporate a variety of exercises. Don’t do the same drill all of the time.
  2. Conduct the exercise in the same manner as the recovery would actually occur. This might involve having all the participants working remotely, or a hybrid model in which some employees are on-site and others off-site.
  3. Make the exercise as close to reality as possible. Avoid making special arrangements whose only justification is making the exercise easier or more likely to succeed.
  4. Most exercises should be simulations whose outcome has no impact on production. Exercises that fail production over to an alternate site should be preceded by multiple verifications of recovery capability.
  5. Exercises should elicit the appropriate stresses and reactions. The drill should not feel like a party. If participants have the idea they can simply hit the reset button if things don’t go well, the exercise will not achieve the intended result.
  6. IT/DR exercises should not be viewed as projects. You should think of demonstrating and verifying recovery capability as an ongoing part of your operational activities.
  7. Don’t drill the same people over and over. Your exercises should also put the secondary and tertiary staff to the test.

Keeping these additional considerations in mind will help you in scaling a single exercise up into a full-fledged, well-rounded IT/DR testing program.

The Impact of COVID-19 on Conducting IT/DR Tests

In terms of the impact of COVID-19 on IT/DR testing, the biggest change is in how the pandemic has distracted companies from doing the needed testing.

The tests themselves can be conducted essentially the same as before.

This is owing to how IT/DR exercises have evolved in recent years. Fifteen years ago, people had to go to the recovery site to conduct IT/DR tests. Fortunately, this is no longer necessary.

Until recently, many organizations continued to bring staff together for recovery exercises because doing so made it easy to communicate. However, most people have gotten more comfortable in using collaboration tools over the past year, and these work fine for doing IT/DR exercises. In some cases, it might be necessary for employees to use multiple devices to maintain connectivity, for example, by participating in a web meeting over their phone while performing recovery activities with their computer. You should address the issue in the planning stages of the exercise.

Getting Your Testing Program Back on Track

The great demands of the early stages of the COVID-19 pandemic caused many companies to press pause on their IT/DR testing programs. It’s past time for organizations to resume conducting IT/DR exercises. Such testing remains the only way of verifying that a company is capable of recovering its systems in the event of a disruption.

By following the eight steps to an effective disaster recovery exercise, and thinking about the additional testing considerations, you can help your company get its IT/DR testing program back on track, helping it be ready for the next global pandemic or any other disasters that fate might send your way.

Further Reading

For more information on working remotely and other hot topics on IT/DR testing, check out these recent posts from MHA Consulting and BCMMETRICS:

About
Richard Long is one of MHA’s practice team leaders for Technology and Disaster Recovery related engagements. He has been responsible for the successful execution of MHA business continuity and disaster recovery engagements in industries such as Energy & Utilities, Government Services, Healthcare, Insurance, Risk Management, Travel & Entertainment, Consumer Products, and Education. Prior to joining MHA, Richard held Senior IT Director positions at PetSmart (NASDAQ: PETM) and Avnet, Inc. (NYSE: AVT) and has been a senior leader across all disciplines of IT. He has successfully led international and domestic disaster recovery, technology assessment, crisis management and risk mitigation engagements.
loss of human resources