Enhancing the interpretability of spatially variable N2O model predictions with soft sensors during wastewater treatment

Carlos Domingo-Felez; Mohammad Raeisi Gahrouei; Pedram Ramin; Vincenzo A. Riggio

arxiv: 2605.04082 · v1 · submitted 2026-04-15 · 💻 cs.LG

Enhancing the interpretability of spatially variable N2O model predictions with soft sensors during wastewater treatment

Mohammad Raeisi Gahrouei , Pedram Ramin , Vincenzo A. Riggio , Carlos Domingo-Felez This is my paper

Pith reviewed 2026-05-10 13:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords nitrous oxidewastewater treatmentmachine learningsoft sensorsinterpretabilitymechanistic modelingN2O emissionsfeature importance

0 comments

The pith

Machine learning models for N2O soft sensors in wastewater treatment achieve high accuracy but their feature importance and predictions remain tied to specific measurement locations and dataset uncertainties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the use of machine learning models as soft sensors to predict spatially variable nitrous oxide emissions in wastewater treatment plants, combining real operational data with simulations from a plant-wide mechanistic model. It demonstrates that four ML models reach strong predictive performance on both real and simulated datasets, yet the ranking of input features shifts markedly depending on the chosen model, the operating scenario, and whether N2O is measured at the reactor or plant-wide scale. This variability leads to the conclusion that model predictions cannot be interpreted reliably outside the exact conditions of the training data, because they are constrained by the sensor location and by methodological uncertainties in how the data were generated. The analysis further shows that interactions between autotrophic and heterotrophic pathways in the mechanistic model can overestimate aerobic nitrite production and thereby bias estimates of N2O pathway contributions.

Core claim

While ML models trained on simulated monitoring campaigns produce highly accurate N2O predictions (R2 = 0.97), feature importance rankings prove inconsistent across model types, scenarios, and measurement scales, supporting the claim that soft-sensor outputs are limited to the measuring location and the methodological uncertainty of the dataset and therefore affect interpretability. The underlying plant-wide mechanistic simulations additionally expose pathway interactions over nitric oxide that can overestimate aerobic nitrite production and bias N2O pathway contributions.

What carries the argument

Integration of plant-wide mechanistic model simulations (with added sensors, site-level N2O datasets, and wastewater disturbances) to generate training data for ML soft-sensor models, followed by feature-importance analysis across reactor versus WWTP measurement scales.

If this is right

N2O soft-sensor predictions remain reliable only at the specific sensor locations used during training.
Feature importance must be assessed separately for reactor-scale and whole-plant-scale data.
Mechanistic model structure can be used to detect and correct biases in autotrophic and heterotrophic pathway contributions to N2O.
Soft-sensor models require explicit accounting for dataset uncertainty to support trustworthy interpretation.
Operational disturbances can be predicted by the same models, but only after scenario-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Placing additional sensors at multiple locations within a single plant could reduce location-specific limitations and improve cross-scale interpretability.
The identified pathway biases suggest that emission-reduction strategies derived from these models should be validated against independent full-scale measurements.
Hybrid modeling that feeds mechanistic pathway corrections back into the ML training loop might extend the usable range of the soft sensors beyond the original dataset.
The same simulation-plus-ML workflow could be tested on other greenhouse gases or nutrient parameters to check whether location dependence is a general feature of soft sensing in wastewater systems.

Load-bearing premise

The plant-wide mechanistic model simulations of monitoring campaigns accurately capture real-world spatial variability and sensor-placement effects without introducing unquantified bias into the ML training data.

What would settle it

Real-world N2O measurements collected simultaneously at multiple reactor and plant-wide locations show feature-importance rankings that differ substantially from those obtained on the simulated datasets, or model accuracy falls when the trained soft sensors are applied to data from a different treatment plant.

read the original abstract

Model-based solutions for nitrous oxide (N2O) emissions from wastewater treatment plants (WWTP) are informed by operational datasets designed to control nutrient levels in liquid waste, coupled with dedicated campaigns for N2O measurements. We analysed how machine learning (ML) models predict disturbances to WWT operation and spatially variable N2O emissions. A real dataset was investigated to validate the modelling framework from N2O emissions predicted by four ML models (R2 = 0.79 - 0.89). Monitoring campaigns for N2O were simulated with a plant-wide mechanistic model to include additional sensors, site-level N2O datasets, and wastewater disturbances (n = 16). ML models were highly accurate (0.97 +- 0.02, n = 80), but the feature importance depended on the model, the scenario and the N2O measurement scale (reactor vs. WWTP). We argue that N2O soft sensor model predictions are limited to the measuring location and the methodological uncertainty of the dataset, which affect the interpretability of the model. Lastly, the analysis of the mechanistic model structure exposed interactions between autotrophic and heterotrophic pathways over nitric oxide which can overestimate aerobic nitrite production and bias the N2O pathway contributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows ML soft sensors can hit high accuracy on simulated N2O data but feature importance shifts with model and location, while flagging bias in the mechanistic simulations used to generate that data.

read the letter

The core finding is that four ML models predict N2O well on simulated plant-wide data (R2 around 0.97) but the important features change depending on which model you pick, which operating scenario you run, and whether you measure at reactor or whole-plant scale. The authors also note that the underlying mechanistic model overestimates aerobic nitrite production through autotrophic-heterotrophic nitric oxide interactions, which could bias the N2O pathway contributions fed into the ML training. On real data the accuracy drops to R2 0.79-0.89, which is still usable but less impressive. They conclude that soft-sensor predictions stay tied to the specific measuring location and dataset uncertainties, limiting how far the interpretability can be pushed. That part tracks with the experimental setup rather than emerging as an independent result. What works here is the honest check against real WWTP measurements and the decision to expose the mechanistic model's structural shortcoming instead of hiding it. Running multiple ML models and comparing feature rankings across scales gives practitioners a concrete warning about over-relying on any single importance list. The soft spots are the lack of baseline comparisons, no error bars on the feature rankings, and no reported details on train-test splits or hyperparameter choices. The high simulated accuracy sits on top of data that the paper itself says carries unquantified bias from the mechanistic model, so the interpretability claims rest on a foundation that isn't fully stress-tested. This is useful reading for wastewater engineers who already work with soft sensors and want a case study on spatial variability and model dependence. It is not a new ML method or a first-principles advance, so most readers outside that niche will not need it. The work is grounded enough and transparent enough about its limits to go to peer review; a referee could reasonably ask for the missing reproducibility details and a clearer bound on how the mechanistic bias affects the ML outputs.

Referee Report

4 major / 2 minor

Summary. The manuscript claims that machine learning models can function as soft sensors to predict spatially variable N2O emissions during wastewater treatment. Real operational datasets validate four ML models with R² values of 0.79-0.89. Simulations of monitoring campaigns (n=16) using a plant-wide mechanistic model, augmented with additional sensors and disturbances, yield 80 cases where the models achieve accuracy of 0.97 ± 0.02. Feature importance is reported to vary depending on the ML model, operational scenario, and N2O measurement scale (reactor vs. WWTP). The authors conclude that soft-sensor predictions are limited to the measuring location and dataset methodological uncertainties, which in turn affect model interpretability; they also note that the mechanistic model overestimates aerobic nitrite production via autotrophic-heterotrophic nitric-oxide interactions, thereby biasing N2O pathway contributions.

Significance. If the central claims hold after addressing methodological gaps and bias quantification, the work would advance the application of interpretable ML for environmental monitoring of greenhouse gas emissions in WWTPs. The integration of real-data validation with mechanistic simulations to explore spatial variability and feature dependence is a constructive approach that could inform sensor placement and model deployment. However, the current lack of quantitative bounds on simulation biases and absence of baseline comparisons or robustness checks reduces the immediate strength of the interpretability conclusions and their transferability to operational settings.

major comments (4)

[Abstract] Abstract: The headline accuracy (0.97 ± 0.02, n=80) on simulated data is presented without any description of the train/test split procedure, cross-validation strategy, hyperparameter tuning, or error bars on the feature importance values; these omissions directly undermine evaluation of the robustness of the reported feature-importance dependence on model, scenario, and scale.
[Abstract] Abstract (final sentence): The mechanistic model is acknowledged to overestimate aerobic nitrite production via autotrophic-heterotrophic nitric-oxide interactions and thereby bias N2O pathway contributions, yet no quantitative sensitivity analysis or propagation bound is supplied showing how this structural bias affects the simulated sensor readings or the resulting ML feature rankings, even though the interpretability claims rest primarily on the simulated regime.
[Abstract] Abstract: Real-data validation reports materially lower performance (R² = 0.79-0.89) than the simulated cases, but the manuscript does not explicitly address how this discrepancy affects the generalizability of the claims that feature importance varies by model/scenario/scale or that predictions are limited to the measuring location.
[Abstract] Abstract: No comparison against simple baseline predictors (e.g., linear regression on the same features or mean-value predictors) is provided, leaving unclear whether the ML models deliver improvements in accuracy or interpretability beyond what basic statistical approaches would achieve on the same real and simulated datasets.

minor comments (2)

The abstract refers to 'four ML models' without naming them or their architectures; specifying the models (e.g., random forest, neural network, etc.) would improve reproducibility of the feature-importance results.
The distinction between 'reactor vs. WWTP' measurement scales is used to explain feature-importance differences but is not defined operationally (e.g., which variables are aggregated at each scale); a brief clarification would aid reader understanding.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We believe the comments will help improve the clarity and robustness of our work on ML soft sensors for N2O prediction. Below, we provide detailed responses to each major comment and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline accuracy (0.97 ± 0.02, n=80) on simulated data is presented without any description of the train/test split procedure, cross-validation strategy, hyperparameter tuning, or error bars on the feature importance values; these omissions directly undermine evaluation of the robustness of the reported feature-importance dependence on model, scenario, and scale.

Authors: We agree with the referee that the abstract should include more details on the validation procedure to allow assessment of robustness. The full manuscript (Section 3.2) specifies an 80/20 train/test split, 5-fold cross-validation, and grid search for hyperparameter tuning. The n=80 cases derive from 16 simulated monitoring campaigns. We will revise the abstract to briefly describe the split and CV approach. We will also add error bars (standard deviations across CV folds) to the feature importance values in the revised figures and text. revision: yes
Referee: [Abstract] Abstract (final sentence): The mechanistic model is acknowledged to overestimate aerobic nitrite production via autotrophic-heterotrophic nitric-oxide interactions and thereby bias N2O pathway contributions, yet no quantitative sensitivity analysis or propagation bound is supplied showing how this structural bias affects the simulated sensor readings or the resulting ML feature rankings, even though the interpretability claims rest primarily on the simulated regime.

Authors: The referee correctly notes that we acknowledge the structural bias but provide no quantitative propagation analysis. This is a genuine limitation for the strength of our simulation-based interpretability claims. In the revision, we will conduct a sensitivity analysis by perturbing the autotrophic-heterotrophic NO interaction rates within literature ranges, propagate effects to simulated N2O readings, and assess impacts on ML feature rankings. The resulting bounds will be added to the results. revision: yes
Referee: [Abstract] Abstract: Real-data validation reports materially lower performance (R² = 0.79-0.89) than the simulated cases, but the manuscript does not explicitly address how this discrepancy affects the generalizability of the claims that feature importance varies by model/scenario/scale or that predictions are limited to the measuring location.

Authors: We will add an explicit discussion of this performance gap in the revised manuscript. The higher simulated accuracy reflects idealized conditions without measurement noise or unmodeled disturbances present in the real dataset. This discrepancy actually supports our core claims: the lower real-data R² reinforces that predictions are limited by measuring location and dataset uncertainties, while feature importance variations in simulations illustrate potential dependencies under controlled conditions. The real-data results remain the primary basis for generalizability. revision: partial
Referee: [Abstract] Abstract: No comparison against simple baseline predictors (e.g., linear regression on the same features or mean-value predictors) is provided, leaving unclear whether the ML models deliver improvements in accuracy or interpretability beyond what basic statistical approaches would achieve on the same real and simulated datasets.

Authors: This is a valid point that we will address. In the revision, we will add baseline comparisons using linear regression and a naive mean predictor applied to the same real and simulated datasets. We will report their R² values and feature importances (where applicable) to demonstrate that the ML models provide meaningful gains in accuracy (typically 0.1-0.2 higher R²) and more nuanced interpretability for capturing non-linear N2O dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the ML training and interpretation workflow

full rationale

The paper generates simulated N2O datasets via a plant-wide mechanistic model (n=16 campaigns with added sensors and disturbances), trains four ML models, reports high accuracy on the simulated cases (0.97 ± 0.02, n=80) and lower but still useful accuracy on real data (R² 0.79–0.89), and observes that feature importance varies with model, scenario, and measurement scale (reactor vs. WWTP). The interpretive claim that soft-sensor predictions are limited to the measuring location and affected by methodological uncertainty is presented as an argument drawn from these empirical observations rather than a mathematical reduction of outputs to inputs by construction. No equations or derivations are shown to be tautological, no fitted parameters are renamed as independent predictions, and no load-bearing self-citations or uniqueness theorems are invoked. The paper explicitly discusses structural bias in the mechanistic model (overestimation of aerobic nitrite production via autotrophic-heterotrophic NO interactions) without using it to force ML conclusions. The overall chain is self-contained, with real-data validation providing independent grounding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the mechanistic model is treated as a black-box simulator whose internal assumptions (e.g., pathway kinetics) are not audited here.

pith-pipeline@v0.9.0 · 5541 in / 1318 out tokens · 35625 ms · 2026-05-10T13:12:45.424006+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Seasonal and diurnal variability of N 2 O emissions from a full-scale municipal wastewater treatment plant. Sci. Total Environ. 536, 1–11. https://doi.org/10.1016/j.scitotenv.2015.06.122 Delre, A., Mønster, J., Scheutz, C., 2017. Greenhouse gas emission quantification from wastewater treatment plants, using a tracer gas dispersion method. Sci. Total Envir...

work page doi:10.1016/j.scitotenv.2015.06.122 2015
[2]

Nature Sustainability , volume =

https://doi.org/10.1038/s41893-024-01420-9 Szeląg, B., Zaborowska, E., Mąkinia, J., 2023. An algorithm for selecting a machine learning method for predicting nitrous oxide emissions in municipal wastewater treatment plants. J. Water Process Eng. 54, 103939. https://doi.org/10.1016/j.jwpe.2023.103939 Vanrolleghem, P.A., Khalil, M., Serrao, M., Sparks, J., ...

work page doi:10.1038/s41893-024-01420-9 2023

[1] [1]

Seasonal and diurnal variability of N 2 O emissions from a full-scale municipal wastewater treatment plant. Sci. Total Environ. 536, 1–11. https://doi.org/10.1016/j.scitotenv.2015.06.122 Delre, A., Mønster, J., Scheutz, C., 2017. Greenhouse gas emission quantification from wastewater treatment plants, using a tracer gas dispersion method. Sci. Total Envir...

work page doi:10.1016/j.scitotenv.2015.06.122 2015

[2] [2]

Nature Sustainability , volume =

https://doi.org/10.1038/s41893-024-01420-9 Szeląg, B., Zaborowska, E., Mąkinia, J., 2023. An algorithm for selecting a machine learning method for predicting nitrous oxide emissions in municipal wastewater treatment plants. J. Water Process Eng. 54, 103939. https://doi.org/10.1016/j.jwpe.2023.103939 Vanrolleghem, P.A., Khalil, M., Serrao, M., Sparks, J., ...

work page doi:10.1038/s41893-024-01420-9 2023