Many organizations lack a clear, recognized understanding of when the metaphorical switch will be flipped to start the recovery time objective (RTO) countdown timer. There are two options, either of which can work provided the organization takes a few key considerations into account.
Related on MHA Consulting: All About RTOs: What They Are and Why You Have To Get Them Right
A Common Source of Confusion
A common source of confusion at many organizations is when the countdown for the organization’s RTOs begins.
An RTO is a time window within which, in the event of an outage, a critical business process or application needs to be returned to a fully productive state in order to prevent an unacceptable level of harm to the organization (as previously determined by a business impact analysis).
Note that the problem under discussion is mostly an issue with highly critical processes that have very short RTOs, such as four hours or less. (This discussion also pertains to outages resulting from major events, not day-to-day availability issues.)
Typically, some people will assume the countdown for the RTO begins at the time of the outage. Others will operate on the understanding it doesn’t begin until a disaster or recovery event is declared.
One consequence of this confusion is worry and frustration among people who incorrectly think the organization is at risk of missing or has missed an RTO.
A more fundamental problem is when a lack of clarity about the company’s preferred approach leads to RTOs that don’t allow for the necessary decision-making time on the part of senior management. (More on this below.)
Outside the Scope of This Discussion
As stated previously, this discussion mostly pertains to highly critical processes and apps with short RTOs.
However, within that group there is a subset of processes and apps—usually very small—that are so critical they can never be down (or if the RTO is missed by even a few minutes there will be significant harm). These stand apart from the current discussion because they should already be architected to be in a high availability state.
This blog is about functions that have fairly low RTOs but do not require immediate recovery.
Starting the Countdown at Formal Declaration
Most organizations choose to have their RTO countdown begin at the time a recovery or disaster is formally declared. Such a declaration can be made within minutes or take over an hour.
This can be considered the standard approach.
The reason that a lengthy delay might occur before the recovery is declared is because crisis teams and management need time to investigate the outage and decide whether it’s worthwhile to perform a recovery, a demanding and expensive undertaking.
Starting the Countdown at Event Time
The other possible approach is to have the RTO countdown start automatically at the time of the outage.
Organizations that use this method will still need time to analyze the outage and decide whether to mount a full recovery.
However, with this approach, the time consumed by investigation and decision-making eats up part of the RTO window, leaving that much less time to recover the app or process.
This is a less common approach but some organizations might have good reasons for doing it this way.
Achieving Success with Either Approach
Both methods can work provided the organization takes the following points into account:
- In both methods, leadership should try to conduct their assessment and make a decision about recovery as quickly as possible.
- The preferred timeframe for deciding whether to do a full recovery should be included as part of the BIA and recovery plan development and stated in the crisis management plans.
- Organizations that opt to start the RTO countdown at event time need to budget time for analysis and decision-making in their RTOs.
- RTOs are guidelines, not hard deadlines. (Any process or app where the cost of going slightly over the RTO would be severe should be moved to a shorter RTO category to ensure it will not miss the required recovery.)
- Whatever approach the organization decides on, the decision must be clear and widely communicated and understood throughout the organization.
- The chosen approach should be incorporated in the BIA and factored into the setting of RTOs.
By taking these items into account, an organization can achieve success no matter when it decides to start the metaphorical countdown timer for its RTOs.
The Importance of Clear Communication
At many organizations, confusion reigns regarding when, in the event of an outage, the RTO countdown timer begins. This confusion can have consequences ranging from unnecessary turmoil to unrealistic and badly missed RTOs.
There are two possible approaches to deciding when to initiate the RTO countdown. The standard approach is for the timer to start when a recovery is formally called or approved. The other possibility of having the countdown start automatically at the time of the event might work for some organizations.
Either approach can work provided a handful of key considerations are taken into account. These include referencing the chosen approach when setting the individual RTOs and making sure it is clearly communicated throughout the organization.
For more information on RTO countdowns and other topics in BC and IT/disaster recovery, check out these recent posts:
- All About RTOs: What They Are and Why You Have To Get Them Right
- RTO and RPO: Making It Simple
- After the BIA: Save Time and Money by Fine-Tuning Your Application RTOs
- Fine by Me: The Proposed $1 Million Fine of Colonial Pipeline
- BIA Blunders: 6 Common Mistakes Organizations Make When Conducting BIAs