Failure Modes and Effects Analysis – or FMEA for short – is widely used across many industries. Often in the design phase of new equipment. But also to troubleshoot poor performing equipment. In this article I will give a detailed overview of FMEA’s. The origin of FMEAs, when to use them and how to conduct an FMEA. I've also included an easy to use FMEA template. This is a long and detailed post, I will always bring it back to our main focus. And that is how to use FMEAs to improve plant reliability.
What is an FMEA?
A Failure Modes and Effects Analysis (FMEA) is often one of the first steps you would undertake to analyse and improve the reliability of a system or piece of equipment.
During an FMEA you break the selected equipment down into systems, subsystems, assemblies and components and determine how these could fail.
You analyse why the failure would happen and what the consequence would be.
The analysis is completed by assigning preventive or corrective actions to improve reliability.
An FMEA analysis helps you to identify how a piece of equipment might fail. You do this based on experience with similar types of equipment. Or in some cases purely on the basis of sound engineering logic.
FMEAs are widely used in the development phase of a product. But are also used to analyse the failure of existing equipment already in operation. In that case often the FMEA is used to review and optimise the preventive maintenance program.
What is the difference between FMEA and FMECA?
A FMECA (Failure Modes, Effects and Criticality Analysis) is an extended FMEA that includes a risk assessment to prioritise the failure modes with the biggest impact.
These failure modes are then reviewed for possible mitigations to reduce the risk.
One method of prioritisation that is often used in FMECA is the Risk Priority Number (RPN). We’ll talk about that later in more detail.
It’s important to realise that an FMEA or FMECA is really an exercise in engineering analysis.
As such, an FMEA must be done in a structured process with the participation of the right subject matter experts. You simply can’t do an FMEA sitting at your desk on your own. Ok, strictly speaking you could, but it would be a waste of your time.
What makes up an FMEA
The main elements of an FMEA are:
- The potential failure mode that describes how the item fails to perform as intended;
- the cause(s) of the potential failure mode.
- The effect of the failure. Either on the system the item is part of or the people using it;
And in the case of a Failure Modes Effect and Criticality Analysis (FMECA) this is expanded to include a risk assessment of the potential failure modes that have been identified:
- An assessment of the risks associated with the potential failure modes and their effects;
- A prioritised list of corrective actions to address the failure modes and effects with the highest risk
For the rest of this article, I am going to use FMEA and FMECA interchangeably.
I know that is strictly speaking not correct, but it makes the article so much easier to read.
Where did FMEA come from?
Before we delve deeper into FMEAs and how to use them, let’s have a quick look at the origin of the FMEA.
Failure Modes Effect Analysis (FMEA) was developed by the American military in the late 1940s to investigate problems with munitions malfunctioning. As a result of those problems they developed a structured process to eliminate all potential root causes. 1
This was one of the first, highly structured and systemic approaches to failure analysis. This first approach was documented in MIL-P-1629 “Procedures for Performing a Failure Mode, Effects and Criticality Analysis” dated November 9, 1949.
The methodology worked well and was later adopted by the nuclear and aerospace industries including NASA. Apparently, NASA has gone as far as crediting the use of FMEAs to the success of the moon landings.
From there onwards, the often-quoted MIL-STD-1629 was developed by the US Navy.
Who uses FMEA?
The FMEA found its way into the private sector, initially through car manufacturer Ford in the 1970’s. Failure Mode Effects Analysis is now well used in the automotive industry, the energy sector and many more manufacturing industries.
In fact, even white goods manufacturers are nowadays using FMEA analysis during their design process.
And the FMEA has become a cornerstone in most quality management systems.
FMEA Standards and Guidelines
There are a several guidelines and standards that describe how to conduct a failure mode and effects analysis. The standards cover the FMEA process, the FMEA template to use and provide detailed technical guidance.
Currently, the main standards for FMEAs are:
- IEC 60812 Analysis techniques for system reliability – Procedure for failure mode and effects analysis (FMEA). This International Standard describes Failure Mode and Effects Analysis (FMEA) and Failure Mode, Effects and Criticality Analysis (FMECA) and provides guidance on how to apply both approaches. Available via the IEC website.
- Society of Automobile Engineers (SAE) Standard J1739, Potential Failure Mode and Effects Analysis in Design (Design FMEA), Potential Failure Mode and Effects Analysis in Manufacturing and Assembly Processes (Process FMEA), Revision 1, January 2009 available via the SAE website for USD 78 per copy.
- SAE Aerospace Recommended Practice ARP5580: Recommended Failure Modes and Effects Analysis (FMEA) Practices for Non-Automobile Applications, 2001. This standard is available for USD 78 at the SAE website.
- FMEA-4 Potential Failure Mode and Effects Analysis, 2008, Automotive Industry Action Group (AIAG), which is a technical equivalent of SAE 1739 written specifically for the automotive industry. The standard is available via the AIAG website at a cost of USD 150 to USD 225 for non-members.
- Another frequently quoted standard for FMEAs is MIL-STD-1629A: Procedures for Performing a Failure Mode Effects and Criticality Analysis published by the U.S. Department of Defense in 1984. However, this standard was cancelled by the US Department of Defence in 1998.
In the many articles on the internet both SAE J1739 and MIL-STD-1629A are usually quoted as the main standards for FMEAs. But if you are working in a processing industry like Oil & Gas, Mining, Chemicals, Pharmaceuticals etc. I strongly believe you would be better off using the SAE Aerospace Recommended Practice ARP5580 or the IEC 60812 standard.
Different Types of FMEAs
The are several approaches to conducting FMEAs. You might hear people talk about a system FMEA, a design FMEA (DFMEA) or a process FMEA (PFMEA).
Each are slightly different, use different worksheets and templates but always come back to the key concept of
- how can this fail?
- why would it fail?
- and what would be the consequence?
If you have read some of my other articles, you’ll know that I like to keep things simple so that we can actually put them to use. When it comes to FMEAs in the world of maintenance and reliability I prefer to think of FMEAs as either:
- A functional FMEA, like you would do during an RCM analysis, or
- A hardware based FMEA where you go by component.
The functional FMEA is something that in my mind is most suitable when you are still earlier in the lifecycle of your equipment. So when it is still on the drawing board and we may not have a full design the best approach is to use a functional FMEA.
When you have your equipment already in operation, potentially already for many years, the hardware FMEA can often be the easiest route to go down. You don’t have to worry about writing accurate function statements. Instead you break your equipment down into system, subsystem, components etc. to the level that is useful. And you determine the failure modes from there.
Because hardware FMEAs don’t determine functions they don’t naturally differentiate between design capacity versus what you actually need from your equipment.
And that’s one of the risks associated with hardware FMEAs.
If you don’t accurately determine the function of the equipment and its subsystems you may very well end up preventing failure modes that don’t really matter. Without an accurate function statement you may not be able to accurately assess the impact of a failure on the equipment and the plant as a whole. And you may end up preventive maintenance that is simply not worth the effort. We’ve all seen that before.
What is DFMEA?
Quite simply a DFMEA is a Design Failure Modes and Effects Analysis, in other words, an FMEA that is executed during the design phase of a project. It is a tool to design for reliability.
What is PFMEA or Process FMEA?
A PFMEA or Process Failure Modes and Effects Analysis is an FMEA that is conducted on a process i.e. it analysis how a process may fail, what the consequence of that process failure might be and then identifies potential corrective actions.
Why do an FMEA?
Before we delve into the steps of conducting an FMEA and have a look at a FMEA template, let’s be clear about why we would be conducting an FMEA.
As Carl S. Carlson notes in his book Effective FMEAs 2 the main objective of an FMEA is to improve the design of a system, subsystem, component or even a process.
But in his book, Carlson also points to a number of other reasons why you would want to do an FMEA, for example
- to identify and prevent safety hazards;
- to minimise loss of product performance or performance degradation;
- to improve test and verification plans (in the case of System or Design FMEAs);
- to improve Process Control Plans (in the case of Process FMEAs);
- to consider changes to the product design or manufacturing process;
- to identify significant product or process characteristics;
- to develop Preventive Maintenance plans for in-service machinery and equipment;
- to develop online diagnostic techniques.
When to do an FMEA analysis?
Remember the quote “You Can’t Maintain Your Way to Reliability” 3 that we talked about in the 9 Principles of Modern Maintenance?
Well, keeping that principle in mind we can immediately see a use for FMEAs in improving reliability. These would be:
- Improve reliability of our equipment during design
- Improve our Preventive Maintenance programs for equipment that is already in service
And as we talked about earlier, during the design phase you would want to start with a functional FMEA early in the design process.
Start too late in the design process and you will struggle to influence the design. And you’ll end up with a design that has inherent reliability issues and these defects will either haunt your pant for the rest of its life. Or you’ll have to spend time and money to remove them once the plant is operational.
It’s much easier and cheaper to prevent these defects from ever occurring by tackling them early in the lifecycle. You do that through a functional FMEA during the design phase.
As you progress during the design you may want to migrate your functional FMEA towards a hardware FMEA to make sure you cover all (important) failure modes.
Many of us have simply inherited the plant we have. And we have inherited all the defects and reliability problems with it. So what to do?
We can use Root Cause Analysis to systemically go after our Bad Actors. We can ensure that when something fails, we fix it and improve it such that it won’t fail again.
But can we get a bit more proactive?
The answer is yes, by using FMEAs on installed equipment that is already operational we can pre-empt failure. We identify the credible failure modes and determine the best method to address them. That could be an optimized PM program or maybe a change to the equipment.
What’s a Failure Mode
We’ve talked a lot about failure modes, but that’s one of those words in our industry that can cause a lot of confusion and misalignment.
Across industry people mean slightly different things when they talk about a failure mode. Especially when talking about failure modes and failure mechanisms.
So, what is a failure mode? And what is a failure mechanism?
You can google it, read Wikipedia and come up with a host of different definitions. One definition I found for a Failure Mode was “the specific condition causing a functional failure often best described by the condition after failure.” 4
Luckily in my trusted RCM bible by John Moubray I find a more succinct definition of a failure mode:
“Any event that causes a functional failure”
That sounds almost too simple, right?
But it’s not. It’s as simple as it needs to be. Start applying that definition and you’ll soon see the value of its simplicity.
In his book, Moubray expands further on failure modes and. Concludes they are often best described as a verb + noun statement that describe the physical state of the item. For example, a fractured axle or a deformed axle – both of which are separate failure modes.
An important thing to keep in mind is that description of the physical state of the item should be as accurate and meaningful as possible.
That means you need to try and avoid verbs like ‘fails’, ‘breaks’ or ‘malfunctions’. They give little or no indication what happened and are not an accurate description of the physical state of the item. As an example, a ‘broken axle’ could be a ‘fractured axle’ or a ‘deformed axle’. As we saw earlier these are two distinct failure modes. And each would require different mitigations.
What’s a Failure Mechanism?
The Failure Mechanism is then the “defect which is the underlying cause or sequence of causes that lead to a failure mode.”
Or in other words, a failure mechanism is a really a failure cause.
A failure mechanism states why the failure mode occurred. A single failure mode may have multiple failure mechanisms (or causes).
An easy way to check if your failure mode and failure cause align and make sense is simply adding the statement “due to” between your failure mode and failure cause (failure mechanism).
As an example, if the failure mode is “bearing seized” and the failure mechanism is “lack of lubrication" the statement becomes “bearing seized due to lack of lubrication”:
The Constant Dilemma: Failure Mechanism or Failure Mode
Determining to what level of detail you need to go to with your failure modes and failure mechanisms can be a bit of an art.
And that’s because what you would consider a failure cause at a system level could be deemed a failure mode when you go down to subsystem or even component level. In his book RCM II John Moubray gives a good example of this for a pump set, refer to figure 4.7 in Chapter 4 Failure Modes and Effects Analysis (FMEA).
The diagram below summarises this quite nicely:5
When read together, the failure mode and failure cause statements should however contain enough detail for it to be possible to select an appropriate failure management strategy, but not so much detail that excessive amounts of time are wasted on the analysis process itself. 6
Before You Start – Who You Need For an FMEA
As with many parts of a design process and similar to conducting a Root Cause Analysis, the success of your FMEA depends on who you have involved. It depends on who you have in the room when you’re doing your analysis.
So make sure you pull together a cross-functional team that includes the various engineering disciplines, safety, maintenance, operations but also less obvious disciplines like contracting and procurement.
Just like there are several approaches to FMEAs there are several FMEA templates that support each different approach.
However, the various FMEA templates are very similar. And that makes sense as they all aim to achieve very similar goals.
I’ve created a simple FMEA template that you can download here or by clicking on the banner below.
How to Conduct an FMEA Analysis
There is not a single, correct method for conducting an FMEA. And the various standards listed earlier in this article provide good guidance if you can get your hands on one of them.
Below is an outline of how you would go about conducting an FMEA. It is based on the process outlined in IEC 60812 with some simplifications here and there.
Step #1 – Plan and Prepare the FMEA
Before you start your FMEA you need to make sure you set yourself up for success. You need to map out the various steps of an FMEA and prepare for them. That means you need to:
- Be clear on why you’re doing this FMEA. What’s the objective?
- Collate all the necessary supporting documentation like design details (drawings, diagrams), operating and maintenance manuals, operational context etc.
- Be clear on who you need involved in the FMEA to get to a high-quality outcome. And make sure you get the best possible facilitator in front of that group. A facilitator can make or break the success of an FMEA.
- Document all this in a very brief Terms of Reference and get that agreed with the key stakeholders.
Step #2 – Define and Scope the FMEA
This is another preparation step, but it’s so critical that I wanted to show it separately from the more general ‘plan and prepare’ step.
This step is all about being very clear on what’s in scope for the FMEA. So before you jump into the depths of your FMEA analysis make sure you reflect on the following and document this ahead of the actual FMEA workshop:
Where are the boundaries of what you’ll analyse?
How deep will you go?
Ideally at this stage you would break down the item you’re planning to analyse into a set of component block diagrams or a set of detailed design drawings.
- show the breakdown of the system into major subsystems and highlight the functional relationships;
- highlight interfaces, experience show that a lot of failures occur on the interfaces and too often these interfaces are not
- addressed during an FMEA as people focus too much on components;
- show the different operating modes (this may require different diagrams);
- show redundancies and other mitigations against system failure;
- have all subsystems, components, inputs and outputs clearly labelled as you’ll need to refer to this in the FMEA.
Before delving into the potential failure modes you need to get clarity on the functions, requirements and specifications that apply to the system you are about to analyse.
At the start of the FMEA workshop you should go over this with all participants in the room to make sure everybody is fully aligned on what system is being analysed, to what level of detail, what the required functions and specifications are.
Step #3 – Identify Failure Modes
Now that you are clear on the system your analysing and have all the background documentation at hand you should be ready to start identifying potential failure modes.
Using the system, subsystem and component break down you prepared earlier identify all the potential failure mode for each component.
And here some of the best advice that comes from the most experienced FMEA facilitators is that you first work through your FMEA template column by column. That way you can make sure you are happy with the breakdown from system level down to component level and the associated failure modes before you delve into the detailed failure mode analysis and look at failure causes and failure effects. This is especially important when you do a functional FMEA like you do in RCM.
When it comes to determining failure modes, IEC 60812 standard for FMEAs suggest you consider the following as part of the identification process:
- the use of the system;
- the particular system element involved;
- the mode of operation;
- the pertinent operational specifications;
- the time constraints;
- the environmental stresses;
- the operational stresses.
The IEC standard also clearly states that you need to ensure that failure modes are not omitted because of a lack of data. Instead you should keep them in your analysis and document what needs to be done to progress the analysis of these failure modes.
A good practice is to give each of the failure modes in your FMEA a unique code. This failure mode code helps with referencing within the FMEA and it helps with summarizing your analysis.
More importantly you want to bring this failure mode code into your CMMS. This will allow you to track whether this specific failure mode is occurring in your operation. And that allows you to determine the effectiveness of your FMEA and start a continuous improvement loop.
Once you have all failure modes identified you start analyzing each failure mode one at the time as shown in the FMEA process map earlier.
Step #4 – Identify Failure Effects
A first step in the failure mode analysis is determining the failure effect. A failure effect is the consequence of the failure mode in terms of the operation, function or status of the equipment you’re analysing. A failure effect may be caused by one or more failure modes of one or more components.
In your analysis document the failure effect clearly and comprehensively. Be specific on how the failure mode impacts the operation, function or status of the equipment.
Make sure you consider whether the failure effect is ‘local’ i.e. it only impacts the system or equipment that you’re analysing or whether the effects has a wider impact.
Does it have a safety impact on the user of the system?
Does it result in an impact on the full plant?
And make sure that check your logic by re-reading what you’ve documented from failure mode to failure cause and failure effect. Is your narrative clear, coherent, complete and logical? If not, make sure you fix it at this stage.
The failure Effect is usually captured in your FMEA template in two columns, one for the ‘Local Effect’ and a separate column for the ‘System Effect’. The use of two columns is recommended as it forces you to distinctly evaluate the effect at both levels.
Once you have this complete you need to make sure you capture the Severity of the failure effect in the FMEA template. The Severity is usually captured using a scale of 1 to 10 with the higher the more severe. It is this Severity that is used in the calculation of a Risk Priority Number (RPN). We’ll look at the use of the RPN towards the end of the article in more detail.
Step #5 – Identify Failure Causes
For each potential failure mode, you need to identify and describe the corresponding failure cause. Because a failure mode may have more than one failure cause, you should try to focus on the dominant failure causes.
When it comes to capturing failure causes it is important that you keep in mind the failure effect of the failure mode. A very severe failure effect may require you to really spend a lot more time on documenting the failure causes than you would for a failure mode with marginal impact.
In practice you start to get some iteration here between the description of the failure mode, the failure cause and the failure effect.
And that’s perfectly ok, just make sure you put your effort where it is justified to do so. Failure modes with limited effects should not be described and analysed in too much detail (to begin with).
Once you’ve landed on the failure cause you need to capture the likelihood or occurrence in the FMEA analysis sheet. Just like the Severity this is usually done on a scale of 1 to 10 (with 1 being very unlikely)
Step #6 – Identify Controls
With the failure mode, failure cause and failure effect clearly documented you need to look at what existing controls you have to prevent the failure mode or at least mitigate the effect. This could be built in redundancy or procedural controls like inspecting and testing.
Some FMEA templates will simply have a single column for ‘Controls’ whereas if you follow IEC 60812 you would distinguish between controls that act as detection methods and you would have a separate column for what are called ‘failure compensating provisions’. These type of controls are design features that prevent or mitigate the failure effect.
Using two columns here would be beneficial as it forces you to formally evaluate both aspects distinctly during the FMEA.
Once you have all detection methods and failure compensating provisions accurately described you would capture what is called the Detection in your FMEA worksheet. Detection is a measure of the likelihood that a failure is detected using a scale of 1 to 10. With 1 being almost certain and 10 being absolutely uncertain.
Step #7 – Assess & Prioritise Risks
Once you have all this analysed for the failure mode under consideration you will look at the risk associated with this failure mode.
And in an FMECA you would now determine the Risk Priority Number (RPN) for this failure mode as follows:
Risk Priority Number = Severity x Occurrence x Detection
The RPN would range from 0 to 1000 which gives you a quantitative way of assessing the risks in the FMEA analysis. It is important to realise that the RPN is not a continuous scale from 0 to 1000 but instead there are only 120 possible RPN outcomes.
Using the RPN determine where you need to put your effort and which failure modes will need to be mitigated first.
Problems with the FMEA RPN
The RPN is intuitive and many people find it easy to use and help the prioritisation process. But there are definitely some risks associated with use of RPN.
Even the IEC 80612 standard highlights the commonly quoted deficiencies with the use of RPN. I won’t go into all of them as some are a bit academic in my view, but the most important problems with the use of RPN are:
Firstly, all three factors are weighted equally in the Risk Priority Number. This means that high severity events with low likelihoods and high detection rates could be overlooked if you took a simplistic view and just priortised on the RPN score.
In processing industries like Oil & Gas, petrochemical, Pharmaceutical, Mining etc. these types of events could lead to Process Safety incidents with multiple fatalities. As such, these type of events must be analysed in great detail. And most companies therefore set additional rules around the use of the RPN requiring additional analysis, mitigation and review of high severity risks.
Secondly there are several issues around the scale of the Risk Priority Number. As I already mentioned it is not a linear scale from 0 to 1000, but instead only 120 possible outcomes between 0 and 1000. These outcomes can be very susceptible to small changes so an increase in detection rate from 3 to 4 has a much bigger impact if the severity and occurrence are high than when the severity and occurrence are low.
S x O x D = 9 x 9 x 3 = 243
S x O x D = 9 x 9 x 4 = 324
S x O x D = 3 x 4 x 3 = 36
S x O x D = 3 x 4 x 4 = 48
The other issue associated with the scale is that it is not linear, which means that the difference in the RPN number may appear negligible when in fact it really shouldn’t
S x O x D = 6 x 4 x 2 = 48
S x O x D = 6 x 5 x 4 = 60
The RPN number here has only gone up by 25% when in fact the Occurrence rate has changed from 4 to 5 which in many Occurrence scales being used in industry actually means the event is twice as likely.
So… what does this all mean?
In simple words – beware with the use of Risk Priority Number. Use it, but use it wisely and never ever allow someone to take a purely numeric approach to Risk Priority Number.
Step #8 – Recommend & Take Action
Once you have you prioritised list of failure modes complete with failure causes and failure effects, you need to determine the actions you need to take to reduce the risk profile.
These actions could be redesign of certain aspects, adding a built-in-test functionality, adding inspection or testing procedures in your maintenance regime.
It is important that as always with these actions that they are clear, assigned to a specific owner and are given a due date. You then need to follow-up to make sure the actions are indeed closed out.
Once you have closed out an action you would typically show the mitigated Severity, Occurrence and Detection rates in the FMEA complete with the mitigated Risk Priority Number.
Step #9 – Document the FMEA
As you progress the FMEA process you need to document it. Don’t want till everything is done instead:
- document the planned FMEA once you have Steps 1 and 2 complete
- document the FMEA analysis once you have all failure modes analysed and the recommended actions have been agreed
- finalise the FMEA report once all recommended actions have been completed and the FMEA template has been fully updated.
Have you used FMEA?
Let us know how it went and leave a comment below: