Aligning ITIL processes to your DR plan leads to more efficient and effective use of IT infrastructure. Inadequate planning is a risk to the business, and is often overlooked until it is too late – when a crisis event such as a major outage, security or other breach results in the loss of supporting IT systems.
About the ITIL Framework
Many organizations strive to become ITIL compliant or to use ITIL as an IT process framework. ITIL is exactly that – a framework for IT processes and services. It provides best practices, key performance indicators (KPI), and benchmarks for measuring IT service development, performance, and quality. It is not my intent to use this blog to describe ITIL in detail. Standard searches will provide multiple resources than can be used to learn about ITIL. Complete implementation of ITIL can be time consuming – and a program unto itself. We recommend using ITIL, to the level it makes sense in your organization, as a framework. Use of the basic concepts will provide tremendous value without overshadowing other business critical functions and projects.
The goal of using ITIL is to ensure that your program and implementation follow best practices, and to promote efficiency and functional capability.
We will map the appropriate ITIL processes to IT Continuity Service Management listed below. In general, the ITIL processes associated with IT DR are: SD 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199.
IT Continuity Service Management (ITCSM)
ITCSM is a subset of overall Business Continuity Management focused on the planning for restoration of IT-based services and technologies. Given the complexity and need for emergency management during an event, the ITSCM is often run as a program itself with oversight by the BCM program. The components of ITSCM mirror those of overall BCM:
Initiation (SD 188.8.131.52)
The development of policies, scope, resource allocation, and program organization (management, crisis teams, etc.).
Requirements and Strategy
To develop the strategy, it is critical to have BIA information related to recovery time objectives, recovery point objectives, and technologies used by business functions. (SD 184.108.40.206)
Resiliency needs and service levels should be identified for day-to-day operations, as well as to ensure that once IT functions are restored they will meet the operational needs. Remember recovery is not just about restoring systems but ensuring they will be available, functional and perform adequately. (SD 220.127.116.11, 18.104.22.168)
Risks associated with the environment, recovery, and operations must be identified to ensure they are mitigated appropriately in the recovery strategy and plan.
With the above information, you can now develop the IT Disaster Recovery Strategy.
Implement the Strategy
This is the implementation of the technologies, plans, and processes identified in the strategy. Sometimes organizations move to this phase of the program before performing the requirements and strategy definition. Without a good understanding of the needs, the strategy and recovery details may be over/under architected or not cost effective which then leads to a non-functional recovery capability.
Maintenance or management of the program
This includes training of both IT and non-IT personnel on roles and responsibilities. It calls for testing the plans and recovery processes to ensure functional capability. It also requires continued improvement and modification as production environments are implemented or changed. One of the most common gaps in ITSCM programs is the lack of integration with the change management process; changes in production are implemented and corresponding changes are not integrated in the recovery environment or processes. (SD 22.214.171.124, SD 126.96.36.199, 188.8.131.52)
Performing a BIA (SD 184.108.40.206)
The Business Impact Analysis uses financial and non-financial impacts to determine the overall impact the process has to an organization and therefore the Recovery Time Objective (RTO) – or time in which a process needs to be recovered. Based on the business processes, the application RTOs can then be determined. An application RTO will be the shortest RTO of the business processes dependent on the applications. During the BIA the Recovery Point Objective (RPO) – or acceptable data loss – is also determined. See this blog for more information on determining RTOs and RPOs for your processes.
Analyzing Risk (SD 220.127.116.11)
With the recovery requirements understood through an appropriate BIA, knowledge of the risks to the organization and to the technology environment will also provide you with the necessary information to develop and implement an appropriate strategy. Consider the single points of failure in the technology and processing environment, the human capabilities and constraints in both day-to-day operations and during an emergency event, and the likely scenarios that may cause technology outages. The categories to analyze include technology, people, natural events, and security (physical and data). While data security receives much attention today, do not forget that many of the actual outage events are caused by human error – either deliberate or inadvertent.
Implementing a Recovery Plan (SD 18.104.22.168, 22.214.171.124)
Implementation of the recovery technology includes the determination of how to implement the strategy, use of alternate sites with hardware available, cloud-based solutions, data protection (backups, replication), server recovery (physical, virtual, IaaS), and network resiliency.
The recovery plan may (and probably should) encompass multiple documents – technical recovery plans, emergency management plans, and contact lists. These documents should be reviewed and updated on a regular basis. Testing should use both tabletop exercises or walkthroughs and actual recovery events to ensure the procedures and tasks actually work.
Don’t forget post-recovery operational needs. Once processing is again available, it must be supported and maintained as a production environment because it is now a production environment. It may be temporary, but it could be operational for an extended period. At some point, migration to the production environment will occur. All the same processes and protections will need to be in place in some capacity to ensure migration back is possible – think backups, monitoring, troubleshooting, maintenance, patching, and development. Also, the same level of security protections and monitoring may (probably must) be in place as well.
The ability to efficiently maintain and implement IT operational initiatives such as ITCSM will provide both increased value to an organization and the necessary resiliency that today’s lean IT organizations need to support and implement technologies required by business strategies. Changing processes to meet the ITIL best practices can improve service delivery and provide a mature and functional continuity capability. It provides a good framework to organize and measure your IT DR program. Use it as a tool, but be flexible to ensure the implementation framework meets your business and IT needs.