technology operations

SPOF (Single Point of Failure) Analysis

When planning a system or taking on the analysis of a system that is already in place to begin preparations for scaling there are a few key things one must do.  One of those tasks is to perform a Single Point of Failure (SPOF) analysis.  SPOF’s are the enemy of availability for any system.  This is an exercises done with the input of a cross functional team of key persons from operations, business, and development.  The goal of this analysis is not to actually do the work but to identify the work that needs to be done in the context of the business goals.

Doing this analysis goes through phases similar to many technology projects.  They are usually something such as:
  • Define the Goals of the Analysis
  • Design the Plan to Achieve the Goals
  • Execute the Plan
  • Analyze the Data
  • Produce Report of Results

These steps may vary a bit depending on the organization size, team, available resources, and the size of the environment.  But, in general, SPOF analysis will follow that pattern.

Some other important points to consider when thinking about SPOFs:

It’s can be better to have SPOF analysis done by a 3rd party who is actually less familiar with the systems but has proper relevant experience.  Those that are very close to a system have a tendancy not to see things because they are too near.

Having SPOF’s does not necessarily mean that someone made a mistake.  It is not a weapon.  If you treat it this way you can be sure people will not report SPOF’s when they find them.  Often times we live with SPOF’s on purpose due to resource limitations or opportunity costs reasons.  If fixing an SPOF problem will cost a million dollars it might be better to accept that potential down time is the better outcome if something goes wrong.  Weather this is or is not true in any given situation is complex business question much more than it is a technical matter.

The report that is produced as the output result of an SPOF audit should be reviewed by the entire technology and business team to determine the potential impact to the business and then entered into whatever passes for a backlog and technology architectural review board so that they can be properly analyzed, ranked, and put in line to be fixed.

SPOF analysis should be done periodically throughout a product/service life cycle.  Things change every single day.  Last years SPOF analysis is probably no longer valid.  Comparing SPOF analysis’ over time can be very enlightening as well toward finding endemic problems that consistently get swept under the rug.

SPOF analysis does not just apply to technology.  It also applies toward business organizations.  One of the people that I’ve noted understands this far better than most is Warren Buffet.  I think he had a clearly articulated (albeit secretive) planned succession strategy when I was still in diapers.  Even at Berkshire-Hathaway Warren Buffet himself has made sure that he is not a single point of failure;  a true visionary.