A recent Harvard Business Review article in the December 2013 edition entitled “The Hidden Benefits of Keeping Teams Intact” discussed the benefits and reasons for keeping teams familiar with each other. The article expresses that team familiarity raises performance; leads to fewer mistakes, encourages better decision making, etc.
So how does this apply to us? In our role of BCM, we deal with a number of different teams including Fire Life Safety, Crisis Management, Business and IT Recovery Teams, etc. Maintaining familairity consistency across team members is difficult as existing team members leave and new members arrive.
In my experience, I agree with this article as I can the say that the performance of Crisis Management Teams who have worked together for a number of years or at least have some familiarity is much higher than those who do not have familiarity and/or long term working relationships. So what data substantiates this theory:
- Defense – Special ops teams such as the Navy Seals are kept intact over many years.
- Aviation – NASA found that fatigued but familiar crews made about half as many errors as rested but unfamiliar teams.
- Surgery – A study of surgeons who worked across multiple hospitals found performance varied perhaps because of their varying levels of familiarity with the OR teams.
In our consulting firm, we have a high degree of familiarity as the majority of us have worked together over 10 years. This familiarity has led us to a high level of performance as we are clearly versed in each others strengths, weaknesses and areas of expertise.
So, how do we make this work? We can’ t keep team members forever; however, we can work teams to have some level of familiarity which is better than none at all. Hold short training and awareness sessions, short 30 minute mock disaster exercises, etc.
Does having a BCM program compliant with industry best practices, standards and guidelines equate to recoverability? I do not believe it always does. Being compliant, in my opinion, ensures the best underlying infrastructure has been assembled, implemented and integrated to to maximize program efforts and potential for success in a disruption. It does not mean however; that you will recover without a hitch or difficulty in all situations.
Lets use the athlete analogy. Being Tiger Woods doesn’t mean you will win 100% of all golf tournaments played. Now, because of his talent, preparation and work ethic it does mean he will win more than a good share of those he plays in and so goes it for being compliant. Working to be compliant is like building the best possible athlete to compete but you will not always dominate; there are too many variables like the people factor, events we never saw coming, just plain bad luck, etc. that can derail us.
So, working towards having a high level of compliance with industry best practices, standards and guidelines is the right thing to do. I liken the industry best practices, standards and guidelines to a fitness program for your organization. Some organizations get on it but quit because they get tired, lose interest or don’t want to do it on a routine basis. Others work through the soreness, the daily grind and the sweat to build a BCM program that is strong, resilient and ready for any disruption that comes its way.
Get your BCM program on a workout routine today!
Some tests only involve two people while others can include an entire department. All tests require preparation time. This is necessary to coordinate schedules of people, exercise control rooms, and equipment. At a minimum, every plan should be tested annually. Plans to test should include business processes, IT systems, work area recovery, pandemic, and more. The following is a typical testing schedule and what to include:
- Inspect Command Center sites for availability and to ensure their network and telecommunication connections are live.
- Data Backups
- Verify that data backups are readable.
- Ensure that every disk in the data center and key personal computers are included in the backups.
- Inspect safe and secure transportation of media to off-site storage.
- Inspect how the off-site storage facility handles and secures the media.
- All business process owners verify that their employee recall lists are current.
- Issue updated versions of plans.
- Conduct an IT simulation at the recovery site.
- Conduct a work area recovery simulation at the recovery site.
- Conduct a pandemic table-top exercise.
- Conduct an executive recovery plan exercise with all simulations.
- Review Business Continuity Plans of key vendors.
- All managers submit a signed report that their recovery plans are up to date.
- Practice a data backup recall from the secured storage area to the hot site.
MHA Consulting CEO Michael Herrera discusses the Business Continuity Management (BCM) trends that he and his team have experienced across their global customer base in 2013:
- Business Continuity staffing in most organizations is not increasing. Many organizations continue to either staff minimally or use outside consultants to augment the program. Business units are having to take more accountability for their plans and use the continuity staff as Subject Matter Experts (SMEs). MHA continues to heavily augment or serve as the BCM or Disaster Recovery Office for a good number of its clients.
- Business Continuity Management (BCM) is the new Business Continuity Planning (BCP). The majority of organizations are renaming their enterprise continuity programs to Business Continuity Management.
- Enterprise Risk Management (ERM) is integrating BCM into its process and utilizing the information gathered through BIAs and Threat & Risk Assessments to support identification of risks and exposures; a good sign.
- The Business Impact Analysis (BIAs) study remain as the foundational component to drive the development of the BCM program. However, senior management is continually looking for us to refine the BIA process, shorten business unit participation time in the studies and ensure the rigor in the process is strong enough to clearly identify the most critical activities and dependencies. A common weakness in most BIA studies is not having management sign off on the results which affects alignment discussions between IT and business.
- We see Recovery Time Objectives (RTOs) continue to get shorter and shorter (e.g., no downtime, 1 hour, 4 hours, etc.) in many of the companies we worked at in 2013. The influx of complex technology and automated workflows and customer demands for uptime require business activities and dependent systems/applications to be recovered in timeframes that mandate “real time” recovery strategies that can be activated immediately, a challenge few companies can support at all levels which causes gaps between the RTOs and the Recovery Time Actuals (RTAs).
- The new norm for tolerance for data loss or Recovery Point Objectives (RPOs) across critical business activities is zero or near zero in many companies due to the use of complex technology and automated workflows that virtually eliminate manual workarounds. However, in many cases, senior management continues to believe they don’t need the data backup technology to meet the RPOs because they believe they can work manually for a period of time. We also find cases where IT cannot afford the technology to provide the short RPOs and/or the business has no idea what their RPOs are currently or what they should be.
- Business and IT RTO/RPO Alignment – Alignment remains a critical gap across a majority of companies whether they are small, medium or large. Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) continue to be driven by Information Technology (IT) versus by the needs of the business.
- Emergency Notification Systems – The use of ENS is becoming widespread. However, organizations routinely struggle with the processes to effectively and efficiently notify associates, getting good contact information from associates and holding testing on a regular basis. However, ENS is only good if we have electricity for our technology.
- Big Data -We have heard a lot about “Big Data”; the monster sized database warehouses that drive today’s businesses. In the old days, data warehouses had low recovery priorities, however, Big Data is now driving mission critical applications requiring short RTOs and RPOs, a huge challenge for Information Technology.
- Companies continue to struggle with Recovery Strategies particularly for the business units of the organization. Yes, work at home will work but only for a limited time and Information Security concerns are limiting its use. Information Technology strategies are making it easier and easier to recover the critical systems and applications. The problem that remains is how will my business get to that data based on their strategy. It is our opinion, that in today’s complex business environments recovery strategies for RTOs of 72 hours need to be fully in place before an event occurs.
- Our most mature clients (financial, utilities) are holding live Recovery Exercises. They shut down production operations and migrate production work to their alternate sites (data center and business) for a day to validate their plans and strategies. Other clients are building in resiliency through diversity of operations which permit them to transfer work loads across their network. But sadly to say, recovery exercises at many organizations are limited to desktop plan reviews, a minimal examination of true recovery capability.
- Customer Audits are filling the inbox of the BCM Office and lowering staff productivity. The sheer number and diversity of questions is requiring management to spend hours completing these audits and reviewing them with the customer. We strongly recommend to our clients to build a Customer Audit process to streamline it, ensure consistency in responses, minimize the opportunity for unauthorized information to be disclosed and take less time.
Overall, 2013 was a good year for BCM. Companies are continuing to recognize the need for BCM in their environments. I was reminded by our Director of Operations that BCM is still a relatively new field and we are still figuring out how to make it a refined, streamlined process.
Happy New Year to You from MHA Consulting
Exercise and testing can consist of talking through recovery actions or physically recovering things. Testing can be discussion-based or operations-based. There are several different kinds of testing each categorized by their complexity involving set-up and number of participants needed.
- Standalone Testing – the person who authored the plan reviews it with someone that has a similar technical background (i.e. manager, backup support, etc.) It is useful for catching omissions in the plan and can also provide insight into the process for the backup support person.
- Integrated System Testing – occurs when all components of an IT system are recovered from scratch. This type of testing can reveal many of the interfaces between IT systems required to recover a specific IT function.
- Table-Top Exercises – these simulate a disaster but the response to it is conducted in a conference room. A disaster scenario is provided and participants work through the problem. Similar to walk-through testing except the team responds to an incident scenario.
- Simulation Exercises – requires taking a table-top exercise one step further and includes the actual recovery site and equipment. A simulation is the closest that a company can come to experiencing (and learning from) a real disaster. Simulations provide numerous dimensions that most recovery plan tests never explore. They are time consuming and expensive to conduct.
The Emergency Operations Center (EOC) should be located as close to the problem site that is safe. If you were aware of where and when a disaster would strike, you would take steps to prevent it. Therefore, unless you’re the cause of the problem, you don’t know where it will be. When establishing an EOC, evaluate possible sites based on a few criteria. Because very few companies can afford to leave a fully equipped room sitting idle until needed, most companies convert an existing facility to an EOC when needed. Often times, with a bit of rearranging and some additions, a room that is already wired for data and equipped with computers can turn into an EOC.
A typical center is between 500 and 2000 square feet and should have a large closet to hold supplies for set up. It should also be close to a building exit. It must be easily accessible by road and have ready access to delivery services, food service, and hotels. Other things to keep in mind when setting up an EOC is the power source and telephone company. These should both be serviced by different companies than the central office. This way, your primary EOC can become a back-up EOC if you have another facility in a nearby city or town.
A few options for EOC are a personal computer training room, a large conference room with wiring, or a hotel wired for PC training that has sufficient outbound telecommunications capacity.
A note on using a backup EOC to control recovery operations: expect to relocate closer to the disaster site within 48 hours, as it will quickly become unwieldy to control operations from a distance. However, for the first few hours, even a remote facility will be extremely valuable.
A disaster scenario is a hypothetical incident that gives participants a problem to work through. The scenario may describe any disruption to the normal flow of business. When selecting a scenario, be sure to make it one that is realistic as well as broad enough to include several teams to test intergroup communications. Also, make sure the final solution is achievable. The following is a list of potential testing scenarios.
- Natural Disasters
- Hurricane/heavy winds and rain
- Civil Crises
- Labor strike
- Workplace violence
- Serious supplier disruption
- Terrorist target neighbor (judiciary, military, federal, or diplomatic buildings)
- Limited or no property access
- Location Threats
- Nearby major highway, railway, pipeline
- Hazardous neighbor
- Offices above 12th floor (limit of fire ladders)
- Major political event
- Network/Information Security Issues
- Computer virus
- Hackers stealing data
- Data communication failure
- Data Operations Threats
- Roof collapse
- Broken water pipe in room above data center
- Fire in data center
- Critical IT equipment failure
- Telecommunications failure
- Power failure
- Service provider failure
Writing a Disaster Recovery Plan is only half the challenge. The other half – the real challenge – is to test it. Testing requires time, equipment, resources, and expertise to run. The gathering of all this into one place is difficult – however it is essential to knowing the plan will work. A tested plan has a much higher possibility of succeeding. Here are some of the many benefits to testing a recovery plan:
- Demonstrating that a plan works
- Validating plan assumptions
- Identifying unknown contingencies
- Verifying resource availability
- Training team members for their recovery roles
- Determining the actual length of recovery time and the ability to achieve the desired company RTO
The Disaster Containment Manager is in charge of making tough decisions, setting the recovery effort objectives, directing staff toward priorities, and keeping the Recovery Team focused. They are also the primary contact with public emergency services at a disaster site. The following list is Disaster Containment Manager responsibilities:
- Declaring that a disaster exists and identifying which outside assistance is required including the need to activate an off-site data center.
- Coordinating with any emergency services onsite to gain access to the site ASAP.
- Making an initial damage assessment and beginning planning for emergency containment.
- Selecting a site for the Emergency Operations Center by determining if the primary site is suitable or if a backup site must be activated.
- Activating the Disaster Recovery Teams, assigning people to either Business Continuity or Business Recovery efforts.
- Personally ensuring that adequate personnel safeguards are in place.
- Assigning staff to maintain a 24-hour schedule for containment and recovery.
- Maintaining the official status of the recovery for executive management.
- Coordinating incoming material with the materials receiving staff.
- Coordinating use of skilled trades with the facility engineering management such as from contract labor, electricians, welders, and millwrights.
- Assessing personnel strengths and weaknesses in terms of knowledge, skill, and performance to balance labor expertise and staffing.
- Watching for signs of excessive stress and fatigue.
- Identifying “at-risk” employees, such as those deeply affected by traumatic stress.
- Designating a backup person to assume the Disaster Containment Manager’s role while they are resting or not on the disaster site.