Pillar guide·Maintenance

RCA for wastewater plants: a root-cause investigation playbook

Root cause analysis turns a one-off incident into a permanent fix. Five Whys, fishbone, and FRAM all work for wastewater - here is how to pick and run them.

UtilityRadar Team May 9, 2026 8 min read

Root cause analysis turns a one-off incident into a permanent fix. Five Whys, Ishikawa fishbone, and FRAM all work for wastewater — the choice depends on how clean the causal chain is and how far it crosses departmental lines. Done well, RCA closes the loop between an incident and the next FMEA update.

What RCA actually is

Root cause analysis is a structured walk backward from a known failure to the underlying conditions that allowed it. The principle is that visible symptoms (a tripped pump, a permit exceedance, a flooded basement) are almost never the actual cause. Stop the investigation at the symptom and the same failure recurs within months.

A useful RCA produces three outputs: a causal chain from symptom back to root, a set of corrective actions ordered by where in the chain they intervene, and an update to the existing risk picture — usually one or more rows added or revised in the plant's FMEA worksheet.

RCA is not blame allocation. The fastest way to kill an RCA programme is to use it to discipline an operator. Once the operating crew learns that being honest in the post-incident interview costs them, the analyses get shallow and the same failures keep coming back.

When to do RCA

Not every incident justifies a formal RCA. The pragmatic triggers are:

  • Reportable events — permit exceedance, bypass, spill, lost-time injury. The regulator will ask for the analysis; better to have it done before they ask.
  • Repeat failures — the same pump trip three times in six weeks, the same alarm at every wet weather event. Repeat is the diagnostic; RCA finds the actual reason.
  • High-cost incidents — anything that triggered overtime, emergency hire, or compliance reporting beyond the routine.
  • Audit and inspection findings — an external party flagging a gap deserves a structured response, not an immediate procedural patch.
  • High-RPN failures from the FMEA that actually occurred — these are pre-flagged as serious; when they fire, the response should be commensurate.
💡 The 24-hour rule Start the RCA within 24 hours of the incident, while operator memory is fresh and the SCADA trends are still in the active history. Waiting a week halves the quality of the evidence. Waiting a month means you are reconstructing rather than investigating.

Five Whys

The simplest method, and the right one for most technical failures with a clean causal chain. Ask "why" of the failure, then "why" of the answer, and so on, typically four to six times, until the chain reaches a condition you can actually fix.

A worked example:

  • Incident: Lift-station pump 2 tripped at 02:14, triggering a bypass to the storm drain.
  • Why? Motor over-current trip on the VFD.
  • Why? Impeller jammed with a rag mass.
  • Why? Wet well screening was offline for cleaning, no standby in service.
  • Why? The standby grinder has been waiting on a part for nine weeks.
  • Why? The part was ordered as a spot purchase; not on the min/max list because the FMEA marked the grinder non-critical.

The root is not the rag, not the trip, not even the missing part. It is an FMEA classification error that has now been visible for two months. The corrective action is one row change in the FMEA, one min/max update in the parts inventory, and one SOP change for screening cleaning rotation. The pump itself does not need work.

Ishikawa fishbone

When the causal chain is messy or multi-factor, Five Whys runs out before it reaches anything actionable. The Ishikawa diagram — a fishbone with one major bone per cause category — works better. The standard six categories adapted for water utilities:

  • Method — SOPs, work instructions, run books
  • Machine — the asset itself and its condition
  • Material — influent characteristics, chemicals, parts
  • Manpower — staffing, training, fatigue, handover
  • Measurement — instruments, sampling, data quality
  • Environment — weather, temperature, network state, season

The diagram surfaces multiple contributing causes and their interactions. A storm-driven bypass typically has bones in Environment (rainfall intensity), Machine (storage capacity), Method (operator decision rules), and Measurement (rain gauge upstream coverage). The action plan addresses all four; fixing only the Machine bone is the classic incomplete RCA.

FRAM and STAMP for systemic issues

Some failures cross departments and reflect organisational drift rather than a specific equipment failure. A pattern of late discharge monitoring reports is rarely a calibration issue; it is usually a chain involving the lab, the planner, the duty manager, and IT. Five Whys produces a fight; the fishbone produces six bones with no clear priority.

FRAM (Functional Resonance Analysis Method) and STAMP (Systems-Theoretic Accident Model and Processes) are designed for this case. Both treat the plant as a network of functions that interact through expected couplings; they look for couplings that are not what the procedures assume. The output is harder to read than a Five Whys, but it produces durable interventions where the simpler methods produce cosmetic ones.

Use FRAM/STAMP for the second or third recurrence of the same systemic issue. They are slow and analyst-intensive; reserve them for problems that have already resisted the simpler methods.

Closing the loop

An RCA without follow-through is a meeting. Every analysis should produce three categories of action, each with an owner and a due date entered into the CMMS:

  • Corrective — fix the immediate physical or procedural condition. The grinder part order, the SOP update, the calibration re-verification.
  • Preventive — ensure the same chain cannot fire again. New PMs, new condition monitoring, new alarm rules. This is where the RCA links most directly into the preventive/predictive mix.
  • Systemic — address the upstream condition that allowed the failure mode to be unaddressed. Usually a process, training, or organisational change. The hardest to assign, the most valuable when it lands.

The RCA also feeds back into the FMEA. At minimum, increase the Occurrence score for the failure mode, reduce the Detection score if the existing controls missed it, and recompute the RPN. If the new RPN crosses an action threshold, raise the new work orders the day the RCA closes. This loop is where the long-term CMMS effectiveness numbers actually come from.

Common pitfalls

The four most common ways an RCA programme drifts off the rails:

  • Stopping at "operator error" — if the chain ends with a human action, ask one more why. Operators rarely choose to fail; the question is what conditions made the wrong action the easy one.
  • Blame culture — if the post-incident interview feels like a discipline meeting, the next one will be unproductive. Separate the RCA conversation from any HR process, in time and in room.
  • No follow-through — the RCA produces a 12-page report and 14 actions; six months later, three are done. Track action closure rate as a KPI; fewer than 90% closed within the agreed window is a programme-level issue.
  • No FMEA update — the RCA finds a failure mode the FMEA missed, but nobody updates the worksheet. The next FMEA review starts from the same blind spot.
⚠ The diagnostic question If the same plant has had three RCAs ending in "operator error" or "communications breakdown" in the last year, the RCA process itself needs an RCA. Those endings almost always mean the analysis stopped one or two whys short of the real cause.
UtilityRadar
More
Press Esc to close · Advanced search