How to Perform an Effective Root Cause Analysis at MishiPay

BLOG

How to Perform an Effective Root Cause Analysis at MishiPay

September 24, 2021

Why find a Root Cause?

A Root Cause Analysis provides us with the information needed so that we take the right corrective actions to prevent problems from reoccurring.

Imagine the lights go out in your house and you discover that the fuse protecting the circuit has blown. Upon investigation, you find that the current being drawn is too high. The reason that the current was too high was because a junction box was full of water. The reason that the junction box was full of water was because you had a roof leak.

A Root Cause analysis has established the reason why – and a course of action (repair the roof) which will prevent further trouble. Alternatively, you could just replace the fuse and hope for the best. You risk the event reoccurring – or worse – an electrical fire.

How do you find the Root Cause? The Five Whys…

A simple way to establish the Root Cause is to use the ‘Five Whys’. This is an iterative process which keeps asking for the cause until you find the Root Cause.

Often a Root Cause is a process issue, a lack of training, or a failure of policy which has allowed the issue to occur.

Q. Why doesn’t your car start?
A. The battery is dead.
Q. Why is the battery dead?
A. The alternator is not working.
Q. Why is the alternator not working?
A. The alternator drive belt had broken.
Q. Why had the alternator drive belt broken?
A. The alternator belt was well beyond its useful service life and not replaced.
Q. Why was the alternator belt well beyond its useful life and not replaced?
A. The car has not been maintained according to the recommended service schedule.

As you will note, the final answer has changed from a technical issue to a matter of process. This is actionable – we can do something about this to prevent further occurrences.

How do you know when you’re reached the ‘true’ Root Cause?

In general a ‘true’ Root Cause is not a technical matter – it is a matter of policy, process, training, general behaviours, etc. These are factors which we can all influence.

Beware of falling into the trap of ‘glib’ root causes – these should be avoided because, whilst they may be true to some extent, they are invariably out of our control and over-simplify the complex issues companies have to deal with day to day.

Examples of responses to be avoided are:

A. There were not enough developers.
A. There was insufficient investment.
A. There was too little time to do a proper job.
A. Insufficient resource was made available.

Invariably, everything instead distills down to a process lacking in some way (or there was no process, or if there was – it wasn’t being followed). Factors we can influence. The reason that your roof leaked wasn’t the fact you didn’t spend enough money on it – but that you weren’t carrying out some basic checks after a period of heavy rainfall after which any roof might leak.

Can there be more than one Root Cause?

Most complex failures are the result of multiple Root Causes. This is because complex systems are generally built to have multiple layers of defence to prevent disaster, downtime, failure, etc.

Study any plane crash and you’ll discover that there were often multiple opportunities for the disaster to have been averted. If only…

A popular industry model is the ‘swiss cheese’ model where all of our defensive layers have imperfections – and in the event that all of these imperfections ever align then we are at risk of a disaster.

An extra inspection, audit, cross-check, or process step carried out could remove some of the holes from one or more layers, thereby neutralizing the threat. We often call these controls.

Where do you start looking for a Root Cause?

It is often tempting to begin looking for a problem having already uncovered many of the issues. For instance, if you’ve already started analysing network traffic, database performance or system logs you probably expect to be halfway there to finding the cause.

Whilst this may be true, the danger of this approach is often to miss important factors which contributed to the failure. Following the example of the Swiss Cheese model – this is like focusing on just the holes in the last piece of cheese rather than all the layers.

One of the reasons why this can be a mistake is that fixing those other layers can often bring very quick wins. A small change in rota or process further up could significantly reduce risk whilst longer term technical changes are being put into place.

Q. Why did the Plane Crash

↓

A. The wings fell off.

Q. Why did the wings fell off?

↓

A. The bolts cracked.

Q. Why did the bolts crack

↓

A. The wrong bolt was used

Q. Why was the wrong bold used?

↓

A. The bolts were mislabeled

Q. Why were the bolts mislabeled?

A. There was insufficient workshop lighting

Q. Why was there insufficient workshop lighting

↓

A. We failed to audit the supplier

A. We failed to audit our workshop processes

A good starting point is always the business impact: uptime, revenue, margin, customer feedback, etc. In the above example we asked why the plane crashed in the first place, rather that fixating on our bolt supplier. In so doing we identified a failing on our side and a course of action which might have prevented a disaster – simple workshop checks were routinely not carried out due to poor lighting.

Gathering information

The key to an effective Root Cause Analysis is data – lots of it – and generally from as many sources as possible.

The more people feeding into an RCA, the faster you will be able to pull things together. Be aware, however, that information from various sources will often be biased (as will your thinking) – and the aim here is to normalize that information.

Your RCA must be based upon metrics – not anecdotes. Yes it was slow – how slow? Yes there were thousands of errors – how many thousand exactly? You should also be prepared to push back whenever metrics and anecdotes differ – this in itself may be another disaster waiting to happen.

Check, double-check, and triple-check timelines. Perception of time – especially across timezones, can be hard. CSEs and shoppers will be talking relative to local time, dashboards could be in just about anything, developers will often be thinking in GMT or IST, unless they are looking at system logs when they will be locked into UTC…

Group meetings can be a time-efficient way to gather data – however individuals may not contribute, or collective bias may come into play. Reinforce that the RCA process is not about blame, but learning lessons. What went well? What could have gone better?

Experience helps… but can also get in the way. The more someone knows about a subject, the more likely they become trapped into their ‘favourite-cause-itis’ trap.

Complete the Correction of Error Report

Having created the RCA – a Correction of Error report and follow-up actions are now the most critical part of your investigation. The changes and actions highlighted here will prevent a repeat of the problem.

Some important questions your Correction of Error Report should ask include:

Did we pick this up through our internal reporting / dashboards – or have to wait for a customer to report this?
What route did the original issue take through our support processes before it reached the desk of a developer?

Were the drill books in place to triage and fix this issue – or were they missing?
What did we do which had a positive impact? What had a negative or no impact? Do we understand why this was?
Where did we get lucky?
How could we have cut the blast radius in half?

I hope that this has been a useful tour of the root cause analysis – feel free to reach out to me if you have any questions.