In our Business Continuity and Disaster Recovery planning, we spend much of our time assessing, documenting and developing strategies for when an event may occur. This is all to prepare for or prevent an outage. What is the point of all these preparations? When disaster strikes, you want to get back to normal as quickly as possible. It’s important to go through these three phases of disaster recovery.
This phase of restoring services is often the most underutilized or is not performed at all. Before you carry out any restoration activities, you should conduct an assessment of the situation, risks, and impacts of the event, even if it seems clear that you should implement restoration of IT services at your alternate site. Consider the following points in your assessment:
- What is the current state of business functions?
- How many of the IT services are impacted? Which service RTOs are impacted?
- For services that are not affected, what are the risks to them; is there a potential for outage?
- How long is the estimated outage? What actions are required to restore services at the primary data center?
- What are the current impacts to business functions; are workarounds functional?
- What are the potential processing impacts at the alternate site? Will there be performance impacts, capacity constraints, etc.?
- Determine if a restoration at the alternate site is necessary. It may not make sense to perform a recovery at the alternate site, depending on when services can be restored at the primary location. Remember, as soon as production activities occur at the alternate site, you have defined another outage event to return to the primary location.
- Make a formal decision and communicate the next steps – either wait, work on primary location restoration, or alternate site relocation.
We generally assume that this phase refers to restoration at the alternate site. Once you have decided on this action, you should be able to rely on the understanding of the process and management gained through previous DR exercises. Before beginning the restoration tasks, consider the following:
- Review and provide orientation to the restoration teams on logistics and expectations on following the plans and reporting issues. Your goal is to ensure an organized and disciplined execution of tasks. Without a short orientation/reminder of execution expectations, chaos can occur. Remind the teams to perform the tasks based on the plan; not from memory.
- Ensure that you have an overall coordinator for the restoration who will actively ask for updates. This may be as simple as following up with the team leads or managers to verify that tasks are on track.
- Track issues and troubleshooting time. Identify specific time/milestones. Without tracking time for problem-solving, issue resolution or identification of other solutions can linger. “Five more minutes” really means thirty, and without active time management, issue resolution may impede effective and efficient recovery.
- Define and perform regular restoration updates including issue status to both IT and non-IT departments even after the three phases of disaster recovery.
You may think of this as part of restoration, but it is different. Consider the following as part of the Recovery phase of disaster recovery:
- Identify application or specific process changes that you need to address; for example, think about interfaces with third parties.
- Identify potential points of data loss. What data will you need to recreate or what integration will you need to address? This is an area we often assume is “OK” when frequently data protection synchronization is off. Do all the integrations and interfacing self-heal or is there a need for manual intervention?
- Be ready to adjust processing activities if you find performance or capacity issues.
- Do you need to make changes due to dependency issues? Consider integration of systems with different RTOs. You may have critical systems up and running, but upstream or downstream environments may still be unavailable.
- Perform functional validation both at the IT and business level prior to turnover. It is much easier to identify and correct issues prior to production turnover rather than several hours into data and transaction flow.
- Ensure that backups at the alternate location are running and functional. You do not want to lose all the work performed during an event once the environments are productive again. If you have and you need to restore again, you will need to have that backup ready.
Does your current recovery plan include these three phases of disaster recovery? Review your plan to identify any needed updates and improvements. Be sure to test your updates, and run the necessary training exercises and simulations. Finally, instead of using exercise scenarios, always ensure that your plans are based on how you would perform in an actual event.