Maintenance

FMEA for wastewater plants: how to prioritise critical assets

Severity, occurrence, and detection scoring produces a Risk Priority Number that sorts the PM programme. Worked examples on blowers, clarifiers, and UV.

UtilityRadar Team

Maintenance

May 9, 2026 9 min read

FMEA ranks wastewater asset failures by severity, occurrence, and detection. The output is a prioritised maintenance plan grounded in risk, not in habit. A plant with 4,000 distinct asset items cannot maintain everything at the same intensity. FMEA is how you decide what gets the attention.

This guide walks through Failure Mode and Effects Analysis specifically for wastewater treatment plants, with worked examples on aeration blowers, secondary clarifiers, and disinfection systems. If you are building a PM programme from scratch, or reviewing an existing one, FMEA is the mechanism that turns a spreadsheet of assets into a defensible reliability strategy.

What FMEA actually is

FMEA is a structured method for identifying how an asset can fail, what the consequences are, and how likely and detectable those failures are. Each identified failure mode gets three scores: severity (how bad is the consequence), occurrence (how often does it happen), and detection (how likely are we to catch it before it turns into a real event). The product of the three scores is a Risk Priority Number (RPN), and RPN sorts the population.

The method originated in aerospace and automotive engineering in the 1960s and is now standardised across many industries by the IEC 60812 international standard on failure modes and effects analysis. Its water utility application is documented in Water Research Foundation reliability research over the last two decades.

Scoring severity, occurrence, and detection

Each dimension is scored 1 to 10, calibrated to the wastewater context.

Score	Severity	Occurrence	Detection
1 to 2	Negligible operational impact	Extremely unlikely (once in 20 years)	Very obvious, immediate detection
3 to 4	Minor operational impact	Rare (once in 5 to 10 years)	Detected in routine round
5 to 6	Notable disruption, no permit exposure	Moderate (once in 1 to 5 years)	Detected in scheduled inspection
7 to 8	Permit exposure, service disruption	Frequent (once a year)	Detected only via alarm
9 to 10	Public health or major regulatory event	Very frequent (multiple per year)	Only detected after impact

Calibration matters. Different teams working on the same assets can produce very different scores if the scale is not anchored. Spend the first FMEA session calibrating on a handful of well understood failure modes so the scoring is comparable across the rest of the plant.

The Risk Priority Number

RPN is Severity multiplied by Occurrence multiplied by Detection. The theoretical maximum is 1000 (10 x 10 x 10) and the practical maximum in a well maintained wastewater plant is around 500 to 700. Any RPN above 200 warrants immediate mitigation planning.

Key insight. RPN is a directional tool, not an absolute one. A failure mode with RPN 180 is not automatically safer than one at RPN 220. Use RPN to sort, then apply judgement on the highest ranked items. Some failure modes with modest RPN carry consequences that make them worth more attention than the pure score suggests.

Worked example: aeration blower FMEA

Consider a centrifugal aeration blower at a secondary treatment plant. The FMEA session might produce the following failure modes.

Failure mode	Sev	Occ	Det	RPN	Mitigation
Bearing failure	7	4	6	168	Vibration monitoring, 8k hour PM
Motor overload trip	5	3	3	45	VFD overload settings verified quarterly
Air filter clog	4	6	4	96	Filter DP monitoring, PM at 90 days
Coupling wear	6	3	7	126	Alignment check every 6 months
Impeller imbalance	8	2	5	80	Vibration monitoring
Discharge valve seat wear	6	4	8	192	Add valve leak test to annual PM
VFD failure	8	2	4	64	Spare VFD in stock, temperature monitoring

The two highest RPN items (discharge valve seat wear at 192, bearing failure at 168) get the most attention. Discharge valve wear was not previously on the PM programme; adding an annual leak test closes the detection gap. Bearing failure was on the programme; the vibration threshold gets tightened and the PM interval reduced.

Worked example: secondary clarifier FMEA

Failure mode	Sev	Occ	Det	RPN	Mitigation
Sludge blanket washout	9	3	4	108	Continuous blanket monitor with alarm
Scraper drive gear failure	7	2	6	84	Vibration monitoring, annual gear inspection
Weir short circuiting	6	5	7	210	Weir level survey every 12 months
Rake arm bearing	5	3	6	90	Grease every 90 days, replace every 5 years
Return sludge pump loss	8	3	5	120	Standby pump exercise weekly

Weir short circuiting shows up as the highest RPN, driven by the low detection score. This is a common finding: hydraulic short circuiting is invisible to routine inspection unless the operator is specifically looking for it. A short annual weir level survey closes the detection gap and reduces RPN by a factor of 3.

Worked example: UV disinfection FMEA

Failure mode	Sev	Occ	Det	RPN	Mitigation
Lamp end of life	7	7	3	147	Runtime tracked, replace at 12k hours
Sleeve fouling	8	6	5	240	UV transmittance monitoring, sleeve wipe every 30 days
Ballast failure	6	3	4	72	Spare ballasts in stock, temperature monitoring
Sensor drift	9	4	7	252	Reference sensor calibration every 90 days
Reactor bypass valve failure	9	2	6	108	Annual valve test, redundant position feedback

Sensor drift stands out at RPN 252, with severity 9 (pathogen breakthrough risk) and detection 7 (only visible when the reference calibration catches it). This is the highest priority mitigation and drives the calibration schedule.

The FMEA session in practice

Assemble the right team

An FMEA session needs an operations lead, a maintenance lead, a reliability engineer, and someone with process expertise. Sessions can run half a day for a single asset or up to two full days for a complete process area. Facilitator neutrality matters; RPN scores can be swayed by whoever speaks loudest if the facilitator does not enforce discipline.

Identify failure modes systematically

Work through each subsystem of the asset. Do not skip modes that seem unlikely. The value of FMEA is in surfacing the low probability high consequence modes that never appear in the routine inspection.

Score each dimension separately

Score severity first (does not depend on the asset condition), then occurrence (historical failure rate), then detection (how would we catch this before impact). Discussing severity and occurrence together conflates them and produces inconsistent scores.

Compute RPN and sort

Once scored, RPN sorts the failure modes. Anything over 200 gets a mitigation plan. Anything between 100 and 200 gets a periodic review. Below 100 is monitored but not actively worked on.

Assign mitigations

Each mitigation should have an owner, a due date, and a mechanism to verify it is in place. Mitigations that never get implemented are worse than no FMEA at all, because they create a false sense of coverage.

Re score after mitigation

Once the mitigation is in place, re score detection or occurrence and recalculate RPN. This tells you whether the mitigation actually reduced risk or just added work.

Common trap. Teams score every failure mode at severity 8 or 9. The result is a spreadsheet where every RPN is high and nothing sorts. The severity calibration table above is the antidote. Force the team to distinguish between operationally annoying failures and true safety or permit failures.

Integrating FMEA output with the CMMS

FMEA is not a one time exercise. The value comes from feeding the mitigation plan into the CMMS PM library and revisiting the RPN scores when new failure data accumulates.

Each mitigation becomes one or more PM templates in the CMMS.
The FMEA RPN becomes an attribute on the asset in the CMMS.
Failure events reported in the CMMS feed back to the FMEA occurrence score.
New failure modes discovered in operation get added to the FMEA library.
FMEA is re run every 3 to 5 years for critical assets, on major upgrades, and after any significant failure event.

FMEA vs HAZOP

HAZOP (Hazard and Operability Study) is a related methodology often used in process plants for safety focused analysis. HAZOP is more structured around process deviations from design intent (high flow, low temperature, no flow), while FMEA is more equipment focused. Wastewater plants that carry significant chemical handling (chlorine, sodium hypochlorite, sulphuric acid) often run HAZOP alongside FMEA to cover both equipment failure and process safety. Choose FMEA for equipment reliability, HAZOP for process safety, and both when both dimensions matter.

FMEA vs Reliability Centered Maintenance

FMEA is one tool; Reliability Centered Maintenance (RCM) is the broader methodology it fits inside. RCM adds functional analysis (what does each asset actually do), consequence classification (safety, environmental, operational, hidden), and maintenance strategy selection (predictive, preventive, run to failure, redesign).

For most wastewater utilities, a well disciplined FMEA is sufficient to build a defensible PM programme. Full RCM is worth the additional investment at large utilities with complex process trains and stringent regulatory profiles.

Workshop facilitation techniques

An FMEA session runs on the quality of facilitation. Effective facilitators enforce three disciplines: separate identification from scoring, enforce the calibration table, and log dissent explicitly. Separating identification from scoring means the team first lists every failure mode without discussing likelihood or consequence, then goes back and scores each. This prevents anchoring on the first failure discussed. Enforcing the calibration table means every score has to reference the anchor conditions, not the facilitator gut feel. Logging dissent means preserving alternative viewpoints in the record even when the group settles on a majority position. The IEC 60812 standard includes recommended facilitation practices in its normative annex.

Updating the FMEA over time

An FMEA is not a one time exercise. Every incident that occurs should trigger a review: was this failure mode in the FMEA? If not, why not, and what should be added? Was the scoring accurate given the actual failure occurrence? What mitigation might have caught this earlier? Utilities that treat FMEA as a living document typically show measurable RPN reduction over 3 to 5 years, whereas utilities that treat it as a one time deliverable see FMEA quality drift as personnel change.

FMEA programme metrics

Coverage: percent of critical assets with a current FMEA.
Mitigation closure rate: percent of high RPN mitigations implemented within committed date.
RPN trend: are RPNs decreasing over time as mitigations take effect.
Failure event correlation: are unplanned failures happening on high RPN or unexpected modes.
Re score cadence: FMEAs updated on schedule.

Frequently asked questions

Does every asset need an FMEA?

No. Focus on critical assets (safety, environmental, operational). Non critical assets can use time based PM without a full FMEA.

How long does an FMEA take?

Half a day per asset class for the initial pass. Refreshes are shorter.

Can we outsource the FMEA?

External facilitation helps but internal ownership must remain. Consultants cannot maintain the FMEA library over the long term.

How often should we re run FMEA?

Every 3 to 5 years for critical assets, immediately after any major upgrade or significant failure event.

Do we need special software?

No. Spreadsheets work well for the initial pass. Mature CMMS platforms often include an FMEA module that ties directly to the PM library.

What if the team cannot agree on scores?

That is often where the most valuable discussion happens. Force the disagreement into the calibration table and resolve on a specific example.

Should FMEA drive spare parts inventory?

Yes. Critical spares for high RPN failure modes are the first priority for stocking.

Does FMEA work for civil assets like tanks and pipes?

Yes but the scoring calibration is different. Consequence tends toward slow onset (corrosion, subsidence) rather than sudden failure.

How does FMEA relate to capital planning?

Sustained high RPN despite mitigation suggests the asset should be replaced. FMEA is one input to the capital replacement queue.

Can FMEA prevent a failure we have never seen?

That is exactly what it is designed to do. The severity dimension surfaces low occurrence high consequence modes that would otherwise stay invisible until the day they happen.

Summary

FMEA turns an asset register into a prioritised risk map. Severity, occurrence, and detection produce a Risk Priority Number that sorts the population. High RPN failure modes get mitigation. The output feeds directly into the CMMS PM library, becoming the basis of a defensible reliability programme grounded in evidence rather than habit. For wastewater utilities under regulatory pressure and constrained by staffing, FMEA is the single highest leverage exercise in the reliability toolkit.

Next reading

See the assets in this article

Explore 177,000+ utility infrastructure sites

Locations, capacity, operators, and permits across 24 sectors: the same records our writers pull from.

Start browsing

Written by

UtilityRadar Team

Maintenance guides from the UtilityRadar team.