Hanging by a Thread: Protecting Yourself from Single Points of Failure

Richard Long

Many organizations are at risk to some extent by single points of failure, resources that have no redundancy and whose loss could have a significant impact.

In fact, this is a surprisingly widespread problem which could leave many if not most organizations hanging by a thread, whether they know it or not.

In today’s blog, I’ll sketch out some of the main issues surrounding SPOFs and share some tips for protecting yourself against their impact.

 

 

SINGLE POINTS OF FAILURE IN BRIEF

A single point of failure (SPOF) is a person, facility, piece of equipment, application, or another resource for which there is no redundancy in place. If such a resource goes down, any system or process of which it is an essential part will come to a halt.

Here are a few examples of common SPOFs:

  • Single pieces of equipment, such as network devices or servers which may impact one or more applications or processing functions
  • Expensive pieces of equipment where only one is needed for processing, such as a custom stamping machine
  • Manufacturing locations for specific products which cannot be made anywhere else
  • An individual with special knowledge or the only person performing a process

Such situations pose obvious problems from the point of view of business continuity. If the resource goes down, so does your operation, or at least that part of it which is dependent on the resource.

 

RUNNING THE GAMUT

So what can be done about SPOFs? It depends. First, you have to identify and understand them. SPOFs run the gamut. Some exist only because of an oversight and, once recognized, are easy to fix. Others are well-known but prohibitively expensive to fix. In such cases, the organization might make an informed decision to simply live with the risk. (This is a risk mitigation strategy known as risk acceptance. It’s not ideal, but it accepts the reality that sometimes our actions are constrained by costs.)

 

LIKE A PRAYER

Many people are familiar with the Serenity Prayer: “God grant me the serenity to accept the things I cannot change, courage to change the things I can, and the wisdom to know the difference.”

It would be hard to give a better summary of the best approach for dealing with single points of failure.

Sometimes you have to accept the existence of the SPOF (if, for example, it’s too expensive to duplicate the resource).

Sometimes you can make a change and eliminate the SPOF, though it might indeed take courage—as well as hard work, determination, and similar qualities.

It’s the third part of the formula—the wisdom to know the difference—that is most likely to trip companies up, in my experience.

Organizations tend to be too quick to throw in the towel and decide to simply live with the SPOF and its attendant risks. This is unfortunate because in many cases, there are steps that could be taken to reduce the company’s exposure. Even if the risk can’t be eliminated, it can often be reduced, provided the organization makes the effort.

 

____________________________________________________

 

Below I’ll go into more detail concerning how you can help your organization eliminate, reduce, or coexist with your SPOFs. The best approach is: identify it, classify it, and remediate it.

 

IDENTIFY IT

The first thing you have to do is identify the SPOF. This should be a component of both your Business Impact Analysis (BIA) and your Risk Assessment. Be aware of the possibility that you have SPOFs and really dig in and find them. It can be challenging. Sometimes people are aware of the SPOF but don’t want to share what they know because they fear it might reflect poorly on them or their department. Try to keep the focus on finding the SPOFs rather than pointing fingers.

 

CLASSIFY IT

After you find a SPOF, you have to classify it in terms of how easy it is to remediate. Put it one of these three categories:

  • Can be remediated directly and easily, within a reasonable time and budget.
  • Cannot be remediated directly; however, a reliable workaround exists or could be developed.
  • Cannot be remediated, and there is nothing that can be done, within reason, to work around it.

Note: Many SPOFs can be effectively remediated, it just takes creative thinking and effort to figure out a way to do it. Don’t give up too easily!

You should also classify the SPOF in terms of the probability of occurrence (low risk, medium risk, high risk) and the potential impact on the organization (low impact, medium impact, high impact).

 

REMEDIATE IT

After you identify and classify the single point of failure, you can start remediating it. Here’s how:

  • If the SPOF can be remediated, create a plan to do so and implement it based on priority. Obtain and implement the redundant equipment. Train your staff on how to perform whatever redundancy you are putting in place. Note that in addition to installing new equipment, there might be a need to increase the resiliency for applications and technologies. You might also need to modify processing or application setup to self-heal or self-correct.
  • If a workaround can be implemented for the SPOF, make sure to document it and to train the staff on it. Also make sure that any dependent processes, equipment, and staff are in place.

However, if the single point of failure cannot readily be eliminated or worked around, you might have to live with it. But even here, there are things you can do to cushion yourselves against a potential failure of this resource. Here are some things you can do:

  • If the SPOF is, for example, a manufacturing facility making a specialized product, you could increase stock levels and identify and prepare a third-party vendor that could ramp up and produce the item in the event of an emergency. Your increased stock levels could cover the shortfall until this vendor is able to come on line.
  • If the SPOF is an individual with special knowledge or the only person performing a process, and there is no one else at the organization with the necessary knowledge or skill set, and you cannot hire or train internal staff, then consider identifying a third party who could take over in an emergency.
  • If the SPOF is a self-managed data center with insufficient redundancy in power, cooling, and other environmentals, and it does not make sense to remediate the gaps, then you could: migrate the recovery of hypercritical applications to the cloud; create a detailed listing of equipment and recovery procedures; utilize a traditional recovery standby site to provide equipment, space, and power; and/or work with the business units to create a viable workaround for critical applications.

 

HANG ON TO THE TOWEL

Is your organization dangling by a thread at any point? Are there single points of failure that you currently know about? Do you suspect there might be others?

The conscientious practice of business continuity says that you should identify the SPOFs at your organization, classify them, and then remediate them.

Some SPOFs you will be able to fix fairly easily, some you will be able to fix with effort, some you might be able to reduce but not eliminate, and some you might just have to learn to live with.

But don’t be too quick to throw in the towel on eliminating or reducing your exposure! With creativity and effort, most SPOF-related risks can be significantly brought down.

 

FURTHER READING

 For more on this and other hot topics in Business Continuity and IT/Disaster Recovery, check out these recent posts from MHA Consulting and BCMMETRICS:

BCM program strengthsKnow Your Own BC Program