Maintenance

RCA for wastewater plants: a root-cause investigation playbook

Five Whys, Ishikawa fishbone, and FRAM all work for wastewater. The choice depends on incident complexity. Here is when and how to use each method.

UtilityRadar Team

Maintenance

May 9, 2026 9 min read

Root cause analysis turns a one off incident into a permanent fix. Five Whys, Ishikawa fishbone, and FRAM all work for wastewater. The choice depends on how clean the causal chain is and how far it crosses departmental lines. Done well, RCA closes the loop between an incident and the next FMEA update.

This guide walks through the three most common RCA methods at wastewater plants, shows worked examples on real failure types, and explains how to integrate RCA output back into the CMMS so the same problem does not recur. If your utility has ever fixed the same failure three times in five years, RCA is the missing discipline.

Why RCA matters at wastewater plants

Water and wastewater utilities operate under constant regulatory scrutiny. When something goes wrong, whether that is a permit exceedance, a spill, or a customer service incident, the regulator wants to see three things: what happened, why it happened, and what will stop it happening again. Root cause analysis is how the third question gets answered defensibly.

Beyond regulatory pressure, RCA is the mechanism that stops recurring failures. Every experienced operator has seen the same pump seal fail four times in three years because the fix was replacement without diagnosis. RCA is what turns that pattern into a permanent fix on the fifth occurrence.

The three methods worth knowing

Method	When to use	Effort
Five Whys	Clear causal chain, single failure	30 to 60 minutes
Ishikawa fishbone	Multiple potential contributing factors	Half day session
FRAM (functional resonance)	Complex socio technical events, cross departmental	Multi day study

Most incidents at a wastewater plant are well served by Five Whys or fishbone. FRAM is reserved for major incidents where the failure emerged from the interaction of multiple systems rather than a single failed component. The IEC 62740 root cause analysis standard catalogues these methods and their appropriate use cases.

The Five Whys method

Five Whys asks "why did this happen" five times, each time going deeper into the causal chain. The output is a series of stacked causes ending at a root cause that a specific action can address.

Worked example: pump seal failure

Why did the pump fail? The mechanical seal leaked and the pump was locked out for safety.
Why did the seal leak? The seal faces were damaged.
Why were the seal faces damaged? Dry running events destroyed the lubrication film.
Why did the pump run dry? The wet well level probe failed to trip the low level cut off.
Why did the probe fail to trip? The probe was coated in grease and gave a false level reading.

The root cause is probe fouling. The fix is not seal replacement (which addresses the symptom) but a probe cleaning PM on a 30 day schedule (which addresses the cause). Now the seal replacement PM interval can extend by 40 to 60 percent because dry running events stop occurring.

When Five Whys works

Single obvious failure with a clear immediate cause.
Small team available (2 to 4 people).
Short time window between event and analysis.
Well understood equipment failure modes.

When Five Whys does not work

Multiple simultaneous contributing factors.
Failure crosses departmental lines (operations, maintenance, planning).
Human factors or organisational culture involved.
Novel or unfamiliar failure mode.

The Ishikawa fishbone method

Fishbone diagrams categorise potential causes into 6 branches: methods, machines, materials, measurements, environment, and people. Each branch is populated with candidate causes, and the team then investigates each one to determine which contributed.

Worked example: recurrent effluent ammonia exceedance

Branch	Candidate causes
Methods	MLSS control target too low; nitrification not fully established
Machines	Aeration blower capacity insufficient; DO probe fouled
Materials	Influent industrial slug loading; sludge age too young
Measurements	Grab sampling missing peak; laboratory method noise
Environment	Cold weather reducing nitrification rate; storm dilution
People	Operator experience with cold weather nitrification; shift handover gap

The team then investigates each branch and tests each candidate. The final answer might be a combination: MLSS control target too low AND cold weather reducing nitrification AND shift handover gap. Fix each contributing cause, and the exceedance stops.

FRAM for major incidents

Functional Resonance Analysis Method (FRAM) treats failure as emergent from the interaction of normally functioning systems rather than from a single failed component. It is the appropriate method for major incidents like large spills, worker fatalities, or extended plant shutdowns where the cause emerged from a chain of interactions rather than a single asset failure.

FRAM sessions typically take 2 to 5 days, involve external facilitators, and produce structured system maps that identify where the interactions produced the failure. The output feeds long term programme change, not just an asset PM update.

Integrating RCA into the incident workflow

Stage	Activity	CMMS action
Detect	Alarm, permit exceedance, service disruption	Incident work order auto created
Notify	Regulator, management, on call team	Notification checklist attached
Contain	Immediate operational response	Response work orders logged
Investigate	RCA session	RCA report attached to incident
Correct	Corrective and preventive actions	New PM templates created
Verify	Post fix effectiveness check	Verification work order at 30 days
Close	Incident record signed off	FMEA updated with lessons

Key insight. The RCA is only worth doing if the output changes something. Every RCA should produce at least one specific action: a new PM template, a modified operating procedure, a training update, an asset upgrade, or an FMEA revision. RCAs that end with "operator to be careful" have failed.

Who should facilitate

Simple Five Whys sessions can be facilitated by the maintenance manager or a senior operator. Fishbone sessions benefit from an independent facilitator who does not have a stake in any specific cause. FRAM requires a trained specialist, either an internal reliability engineer with the certification or an external consultant.

The most important facilitation discipline is neutrality on the outcome. Facilitators who have a hypothesis at the start of the session steer the team toward it, and the RCA becomes a confirmation exercise rather than an investigation.

Common RCA pitfalls

Common trap. RCA that stops at "human error" is an unfinished RCA. Human error is a symptom, not a cause. Why did the operator make that decision? What information were they missing? What incentive structures pushed them toward the wrong choice? These are the real causes.

Stopping at symptoms, not causes.
Blaming individuals instead of investigating systems.
Selecting root causes based on ease of fix rather than actual causation.
Skipping verification of proposed causes.
Failure to close the loop back to CMMS and FMEA.
Not tracking effectiveness of corrective actions.
Repeating the same causes across multiple RCAs without programme level response.

RCA documentation standards

A defensible RCA report includes seven consistent elements: incident description with timeline, contributing factors identified, root cause identified with reasoning, corrective actions with owners and dates, preventive actions to stop recurrence, effectiveness verification plan, and formal sign off. Regulators and auditors focus on the reasoning chain from event to root cause. Reports that state a root cause without showing the investigative reasoning weaken during scrutiny. Utilities that maintain a consistent report template across incidents build institutional memory and simplify audit response.

Near miss investigation

Some utilities extend RCA discipline to near misses: events where an incident nearly happened but did not. Near miss investigation catches systemic issues before they produce an actual event. The challenge is definition and reporting culture. If reporting a near miss is treated as a career limiting event, near misses do not get reported. Utilities with mature safety cultures report near misses freely and treat them as free learning opportunities. The OSHA safety programme and equivalent regulators encourage near miss reporting as a leading indicator of safety culture health.

When RCA crosses departmental lines

Many wastewater incidents involve operations, maintenance, laboratory, planning, and IT. Cross departmental RCAs require executive sponsorship or the participants revert to defending their department. A clear "no blame" ground rule, ideally chartered by the executive sponsor, is what makes cross departmental RCA productive.

Regulatory RCA expectations

Under enforcement, most regulators expect a formal RCA report within 30 days of a serious incident. The EPA NPDES compliance guidance covers the US expectation. UK utilities operate under the Environment Agency reporting framework. EU utilities under national environment agency requirements. Even absent an enforcement action, a documented RCA discipline demonstrates the utility takes causes seriously.

Communicating RCA findings

RCA findings only produce change when communicated effectively. Internal communication should target three audiences: field crews (what changes in daily practice), managers (what changes in programme), and executive (what the risk landscape looks like now). External communication may include regulator briefings, board reports, and public disclosures for major incidents. Each audience needs a different level of technical detail and a different framing. Utilities that produce a single RCA report and distribute it to all audiences typically see poor traction; effective programmes tailor communications to the reader.

Training the team

Basic RCA competence can be trained in a two day workshop with practical case studies. Advanced facilitation (FRAM and system safety analysis) requires more specialised training. Maintaining a small internal pool of RCA competent staff (typically 4 to 8 for a mid sized utility) supports the incident workflow without external dependency.

RCA programme metrics

Metric	Target
RCA completion rate	Over 95 percent of qualifying incidents
Time from incident to RCA start	Under 5 business days
Time from RCA to corrective action closed	Under 90 days
Repeat cause rate	Under 15 percent
Corrective action effectiveness verification	90 day post fix check
FMEA update rate from RCA	Over 80 percent of RCAs update FMEA

RCA tooling

Tooling for RCA at wastewater utilities ranges from paper templates to dedicated RCA modules integrated with the CMMS. Effective spreadsheet templates handle most Five Whys and light fishbone sessions. Dedicated tools like Sologic, TapRoot, and Reliability Center methods add rigour for larger utilities running multiple RCAs per month. The single most important criterion for tool selection is whether the output can flow back into the CMMS work order library so the identified corrective and preventive actions become executable. Standalone tools that produce a nice report but do not connect to operational systems tend to fade after the initial adoption period.

Scaling RCA to smaller utilities

Small utilities cannot afford full FRAM investigations. But they can absolutely run Five Whys and light fishbone on their qualifying incidents. The critical discipline is not the method choice but the closing of the loop back to CMMS PM library and FMEA.

Frequently asked questions

Which incidents deserve a formal RCA?

Permit exceedances, spills, worker safety events, extended outages, and any incident that has occurred previously (recurrence signals a missed root cause).

How long should an RCA take?

Five Whys 30 to 60 minutes. Fishbone half day. FRAM 2 to 5 days.

What if the team disagrees on the root cause?

Test each hypothesis. The correct root cause is the one whose fix actually stops the recurrence.

Can we outsource RCA?

Facilitation yes; ownership no. Consultants can facilitate but internal team must own the analysis and the corrective actions.

What if we cannot identify a root cause?

Document what you did find. Sometimes the causal chain is genuinely uncertain. That is honest and better than fabricating a cause.

How do we prevent RCA from becoming blame?

Executive sponsorship of a no blame culture. Focus on systems and processes, not individuals.

Should we share RCA reports across utilities?

Yes where possible. Peer utility RCA sharing (via industry associations) accelerates collective learning.

Does the CMMS need a special RCA module?

Not necessarily. Attaching the RCA report to the incident work order and creating corrective PMs is sufficient for most utilities.

What if the same root cause keeps recurring?

Escalate to programme level. Recurrent same cause failures usually signal a design or systemic issue that individual PM changes cannot fix.

How does RCA feed FMEA?

Each RCA output should update the FMEA occurrence and detection scores for the affected asset class. See FMEA for wastewater plants.

Summary

Root cause analysis is the discipline that turns incidents into permanent fixes. Five Whys handles simple causal chains. Fishbone handles multi factor investigations. FRAM handles complex socio technical events. The choice depends on the incident complexity and organisational scope. Regardless of method, the output must feed back into the CMMS PM library and the FMEA scoring, or the RCA is a paper exercise. Utilities that build a rigorous RCA discipline see recurrent failure rates fall by 60 to 80 percent within two years.

Next reading

See the assets in this article

Explore 177,000+ utility infrastructure sites

Locations, capacity, operators, and permits across 24 sectors: the same records our writers pull from.

Start browsing

Written by

UtilityRadar Team

Maintenance guides from the UtilityRadar team.