Root Cause Analysis (RCA) done right

Root Cause Analysis (RCA) done right

Shahir Chundra, CC BY-SA 4.0 , via Wikimedia Commons

If you are working in any area which offers services to customers, you will have been hit with a request for a Root Cause Analysis.

A “Root Cause Analysis”-Request is basically just a customer saying:

Hey, something went really wrong and I want to know what caused the issue and how do we stop it from happening again?

Lucky for us there is a standard method do conduct these RCA’s. It’s nothing what hasn’t been done before.

What is a RCA

Like so often, Wikipedia has the answers already:

RCA can be decomposed into four steps:

  • Identify and describe the problem clearly.
  • Establish a timeline from the normal situation until the problem occurs.
  • Distinguish between the root cause and other causal factors (e.g. using event correlation).
  • Establish a causal graph between the root cause and the problem.

Overall structure

Your Root Cause Analyze (RCA) document should include the following (not limited to)

  • Executive Summary
  • Problem Summary
  • User / Business Impact
  • Technical analyze and Root Cause

Identify the Problem : asking why

The main challenge can be to really get to the cause of the root. This means to conduct an investigation and really follow the trail of evidence. A website may have been unreachable because the server has shut down, but what caused this shutdown?

Root Cause Analysis is best conducted by asking “Why did the problem occur?” (5 times), like you will find them on Wikipedia:

  1. Why? — The battery is dead. (First why)
  2. Why? — The alternator is not functioning. (Second why)
  3. Why? — The alternator belt has broken. (Third why)
  4. Why? — The alternator belt was well beyond its useful service life and not replaced. (Fourth why)
  5. Why? — The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)

Although I don’t like such examples, you can see that you need to follow the white rabbit’s hole. Why I don’t like such examples? Well, easy, because it follows the tendency to call things human error or misconfiguration. So please do us all a favor and don’t stop when you find a human doing something wrong, maybe take a minute and check if there could be one more why, just for the sake of decency. Maybe:

  • the documentation actually never told that the alternator belt needs to be replaced from time to time?
  • was the car on a mileage where you would expect the belt to break?

Stuff like that. It is rather easy and common to blame the user/customer for misconfiguration, but don’t try to go that route. If you end up with a user introduced issue, try at least to see if there is anything he could have done different or if your documentation and guidance actually would have helped doing the right stuff.

Workarounds and solutions

Once we have found the root cause we should be able to provide the most important part of your RCA “Workaround” and “solution”. This even applies to misconfiguration, if the customer broke something on their own terms. The RCA should tell them to backup configuration, use etckeeper* or establish or update their change management process. *(a Linux software which basically puts all “/etc” into a git so you can restore previous versions of config files)

Keep in mind that you are the expert on your product so if you can give a hint on how to make it better, that’s the information the customer needs to get

  1. Provide a workaround
    Something which helps directly. In the IT-world this can be everything from restart of a server to provide a rule on a firewall to block the funny packages from killing the network stack. But something which eases the pain instantly.
  2. Provide a solution
    Now that you know that something did happen, what do we do about it so it never happens again?


A RCA should provide and show things

  1. You can effectively investigate your own software/infrastructure/system/cars whatsoever
  2. You can provide guidance and response in a well formatted way
  3. You can do all of this in a standardized and timely manner
  4. You are able to communicate issues and provide help for your customer

So it comes down to trust, RCA’s are maybe the easiest way to show a customer that they can trust you.

That’s it for today, please leave a 👏 and be excellent to each other.