The rise in the use of third-party computing services has given many companies a false sense of security regarding the recoverability of their IT systems. In today’s post we’ll look at why organizations still need to be adept at IT disaster recovery (IT/DR) and describe the four phases of restoring IT services after an outage.
Related on MHA Consulting: Learning to Talk to Your IT/DR Colleagues
Knowing How to Recover Is Still Important
With the migration toward cloud computing and Software as a Service (SaaS), many organizations have grown complacent about IT/DR. Their thinking is that the burden of recovering is now on their IT services vendors. However, most organizations retain some level of on-premises IT capability, and if this can’t be recovered, it might make the vendors’ efforts moot. For many if not most organizations, the need to be able to recover is as great as ever, while the layering of multiple environments has made the job more complicated.
For these reasons, it’s important that IT departments (and business continuity professionals) make sure their organizations are capable of restoring their IT services after an outage. There are four main phases involved in doing this. Let’s look at them one by one.
Phase 1: Preparation
Technically, preparation is not a phase of disaster recovery since it happens before the outage. Practically, however, it might be the most important phase. If you haven’t prepared properly, recovering might be impossible. This is also the area where a lot of people fall short. The following are some things to keep in mind regarding the preparation phase.
- Prioritize your services and technologies so you know which to restore first. The usual way of doing this is by conducting a BIA.
- Identify which services and technologies your mission-critical services depend on; these will also need to be restored quickly. (Common examples include authentication, access, middleware, and network services.)
- A critical part of preparation is DR exercises and testing, to make sure people know what to do and that everything works.
- Testing needs to go beyond doing the same tests over and over. (This is another area where people often fall short.) Test across all the different technologies.
- Don’t assume your IT services partners have everything covered. Make sure they have all their steps in place and that your recovery is integrated with theirs.
Phase 2: Assessment
Now we come to the steps to take after the outage occurs. This is almost always a stressful and confusing period. It’s also the phase where the advice “Don’t just do something, stand there!” applies. That’s because the very first thing to do is figure out what happened and trace the contours of the impact. Here are the main things to consider in this phase:
- Before you carry out any restoration activities, conduct an assessment of the situation, risks, and impacts of the event.
- Identify the current state of your business functions. Find out which IT services and which service RTOs have been impacted.
- For unaffected services, investigate the risks they face. Is there a chance they might be impacted as the event develops?
- Estimate how long the outage will last.
- Identify the actions needed to restore services at the primary data center.
- Determine the functionality of any workarounds needed.
- Identify the potential processing impacts at the alternate site. Will there be performance impacts or capacity constraints?
- Determine whether a restoration at the alternate site is necessary. Depending on when services can be restored at the primary location, it might not make sense to perform a recovery at the alternate site.
- Company and IT leaders should make a formal decision either to wait, work on restoring the primary location, or relocate to the alternate site then communicate the next steps to the people involved in the recovery effort.
Phase 3: Restoration
If a decision is made to relocate to the alternate processing site, then your recovery effort enters the restoration phase. This is where the exercises you conducted in Phase 1 pay off. Consider the following as you embark on restoring your systems:
- Review the logistics and expectations of following your recovery plans with the restoration team. Explain the process for reporting issues. (Without a short orientation, chaos can occur.)
- Remind the teams to perform tasks based on the plan, not their memory.
- Have an overall coordinator for the restoration who actively asks for updates, verifying that tasks are on track.
- Track issues and the time spent troubleshooting. Without such tracking, issue resolution is likely to drag on, impeding efficient recovery.
- Define and perform regular restoration updates, including issuing status updates to all involved departments.
Phase 4: Post-Restoration
Many regard these next items as a subset of restoration, but these items differs sufficiently to merit treatment as a separate phase. Here are the main things to consider in this phase:
- Identify the application or process changes that need to be made (for example, think about interfaces with third-party vendors).
- Identify potential points of data loss. (What data will you need to recreate?) This is an area we often assume is “OK”; however, data protection synchronization is frequently found to be off, invalidating the assumption.
- Determine whether integrations and interfacing self-heal or there is a need for manual intervention.
- Be ready to adjust processing activities if necessitated by performance or capacity issues.
- Consider whether changes need to be made as a result of dependency issues. (Think about the integration of systems with different RTOs.) You might have critical systems up and running, but upstream or downstream environments might still be unavailable.
- Prior to turnover, perform functional validation both at the IT level and the business level. (It is much easier to identify and correct issues prior to production turnover rather than several hours into data and transaction flow.)
- Ensure that backups at the alternate location are running and functional. (You do not want to lose all the work performed during an event once the environments are productive again.)
- Start planning for the move back to the primary location. The transition will amount to another DR event, but this shift will be planned and controlled.
Ensuring the Ability to Recover
The increasing reliance of many companies on third-party computing services should not lead them to underestimate the importance of IT disaster recovery. The complexity of modern IT environments demands a thorough understanding of the four essential phases of IT/DR: Preparation, Assessment, Restoration, and Recovery.
Proper preparation, including service prioritization and testing, lays the foundation for effective recovery efforts. The assessment phase is crucial for understanding the scope of the outage and making informed decisions. During restoration and recovery, clear communication, diligent tracking, and meticulous attention to details ensure a smoother transition back to normal operations. By mastering these phases, IT departments and BC professionals can ensure their systems remain recoverable even as systems grow more layered and the challenges of recovery more complex.
- Who Does What: The Most Critical Job Roles in IT Disaster Recovery
- Hit or Myth: 5 Common Misconceptions About IT Disaster Recovery
- BCM Basics: Modern IT/DR Strategies
- The Cloud Is Not a Magic Kingdom: Misconceptions About Cloud-Based IT/DR
- Learning to Talk to Your IT/DR Colleagues
- For Want of a Nail: The Importance of Meticulous Execution in BC and IT/DR
- You Still Need to Drill: IT/DR Testing Is as Important as Ever