Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

Bogdan Raoni\'c; Samuel Lanthaler; Siddhartha Mishra

arxiv: 2509.25080 · v3 · submitted 2025-09-29 · 💻 cs.LG

Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

Bogdan Raoni\'c , Siddhartha Mishra , Samuel Lanthaler This is my paper

Pith reviewed 2026-05-18 12:08 UTC · model grok-4.3

classification 💻 cs.LG

keywords OOD detectionscientific machine learningdiffusion modelsregression reliabilitytask-aware scoringprediction error correlationtrustworthy AI

0 comments

The pith

A diffusion model estimating joint likelihood of inputs and predictions yields a score that tracks error in scientific regression tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a task-aware method for spotting when data-driven models will fail on unfamiliar inputs in regression problems common to science. Rather than assessing inputs alone, the approach trains a score-based diffusion model on the combined distribution of inputs and the downstream model's outputs to produce a single reliability score. Experiments across PDE solvers, satellite imagery, and brain-tumor segmentation show this score tracks prediction error closely. The result offers a practical way to gauge trustworthiness before relying on AI outputs in fields where mistakes carry high costs.

Core claim

Estimating the joint likelihood of both an input and a regression model's prediction via a score-based diffusion model produces a reliability score that correlates strongly with actual prediction error on out-of-distribution scientific data.

What carries the argument

Score-based diffusion model trained on the joint distribution of inputs and regression predictions to compute a task-aware likelihood score.

If this is right

Unreliable predictions can be flagged before they are used in downstream scientific decisions.
The same joint-likelihood approach supplies a concrete numerical certificate for each individual forecast or segmentation.
The method applies uniformly to PDE regression, image-based regression, and segmentation tasks without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the joint model to classification outputs could yield analogous reliability scores for non-regression scientific tasks.
Combining the likelihood score with existing ensemble or dropout-based uncertainty estimates may produce tighter error bounds.
Repeated application across successive model updates could track how trustworthiness changes as training data or architectures evolve.

Load-bearing premise

The diffusion model must faithfully capture the joint distribution of inputs and predictions so that the resulting score generalizes beyond the specific datasets tested.

What would settle it

On a new scientific dataset never seen during training or evaluation, the computed likelihood score shows no consistent correlation with measured prediction error.

Figures

Figures reproduced from arXiv: 2509.25080 by Bogdan Raoni\'c, Samuel Lanthaler, Siddhartha Mishra.

**Figure 2.** Figure 2: Left: Navier-Stokes. L1 Error vs input-only likelihood log pθ(x) for NS-Mix. Right: CIFAR10 Image Classification. Accuracy vs Likelihood Certificate. histogram of our 12-hour forecasts against a persistence baseline (humidity assumed constant). The model clearly outperforms persistence, with its error distribution shifted to the left. 4.4 Classification To complement the regression datasets considered befo… view at source ↗

**Figure 3.** Figure 3: Performance of the trained MLP fθ on four different target functions. The figure illustrates the target functions, training samples, prediction errors, and overall model performance. Training is conducted with N+ = 100 or N+ = 200, and the results show that the performance on the + set is consistently 3 to 10 times better. For exact error values, refer to the legend in the middle figures. 18 [PITH_FULL_IM… view at source ↗

**Figure 4.** Figure 4: Impact of varying N+ and ν on L2 errors for the regression problems from [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: (Left) Ground truth values f(x) and predicted values fθ(x), (Center) Samples drawn from the trained diffusion model, (Right) Joint prior log-likelihood log p0(x, y). positive values of x, a phenomenon known as collapse to the mean value (as described in [32]) occurs. In this region, where f has a high Lipschitz constant, fθ lacks the capacity to accurately approximate the function. The ground truth values … view at source ↗

**Figure 6.** Figure 6: Wave equation. A randomly selected ID sample (q dist.). Absolute L1 error is 0.097. The estimated log likelihood is 19.78. Parameters for this samples are K = 28 and r = −0.79. A posteriori error estimate (defined in B.4) is 0.10 ± 0.02 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Wave equation. A randomly selected OOD sample (q dist.). Absolute L1 error is 0.227. The estimated log likelihood is 16.12. Parameters for this samples are K = 31 and r = −0.85. A posteriori error estimate (defined in B.4) is 0.21 ± 0.02. 16 18 20 22 24 Joint Log Likelihood log p(x, ypred) 0.05 0.10 0.15 0.20 0.25 L1 Error 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Density 14 16 18 20 22 24 Joint Log Likelihood log p(x, … view at source ↗

**Figure 8.** Figure 8: Wave equation. Likelihood–error plane illustrating in-distribution (ID) and out-of [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Wave equation. 2d histograms of the values of the parameter [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Wave equation. Left: Histogram of estimated likelihoods of the samples drawn from [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Evolution of the joint log-likelihood log p(x, ypred) versus the L1 error across training checkpoints of the diffusion model. Likelihoods are estimated using models trained for 100, 200, 250, 300, and 500 epochs. As training progresses, the joint likelihood estimates become more informative for error detection, with the final model (500 epochs) exhibiting a clear correlation between likelihood and predict… view at source ↗

**Figure 12.** Figure 12: Evolution of the median estimated joint log-likelihood on the training distribution [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of the L1 error versus estimated joint log-likelihood log p(x, ypred) at training epochs 400 (left) and 500 (right). The similarity between the two plots indicates that the model’s predictive behavior stabilizes, and the likelihood estimates remain consistent once sufficient training is achieved. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Effect of the number of training samples on the stability of classification boundaries [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Comparison of L1 errors and estimated joint log-likelihoods log p(x, ypred) for different regression architectures (ViT, UNet, C-FNO) using the same diffusion model. While low likelihoods consistently correspond to high-error samples within each model, the absolute likelihood values are not comparable across models. B.1.4 Regression Model Architecture Ablation In this section, we evaluate the proposed fra… view at source ↗

**Figure 16.** Figure 16: Absolute L1 error versus likelihood for four experimental settings. Each plot [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

**Figure 17.** Figure 17: Ablation Study for the certificate. Training distribution is NS-MIX. [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗

**Figure 18.** Figure 18: Ablation Study for the certificate. Training distribution is NS-Sin-Moderate. Left: [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗

**Figure 19.** Figure 19: Ablation Study for the certificate. Training distribution is NS-Sines Moderate. [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗

**Figure 20.** Figure 20: Randomly selected samples from the testing distributions in NS-MIX experiment, [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

**Figure 21.** Figure 21: Randomly selected samples from the testing distributions in NS-PwC experiment, [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗

**Figure 22.** Figure 22: Humidity prediction. Left: L1 errors vs. Estimated log likelihood for different testing datasets. Right: Histogram of absolute errors for 12-hour humidity predictions. The comparison is between our trained model and a persistence forecasting baseline, which assumes no change in humidity over time. B.3 Humidity Forecast In this experiment, we utilize MERRA-2 satellite data to forecast surface-level specifi… view at source ↗

**Figure 23.** Figure 23: Humidity prediction over different testing regions. [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗

**Figure 24.** Figure 24: Fitted exponential curves of regression error as a function of the certificate (estimated [PITH_FULL_IMAGE:figures/full_fig_p040_24.png] view at source ↗

**Figure 25.** Figure 25: Error fits and corresponding error–certificate histograms for the training distributions. [PITH_FULL_IMAGE:figures/full_fig_p041_25.png] view at source ↗

**Figure 26.** Figure 26: Evolution of the uncertainty bounds for thresholds set at the 65th, 75th, 85th, and [PITH_FULL_IMAGE:figures/full_fig_p042_26.png] view at source ↗

**Figure 27.** Figure 27: CIFAR Dataset. Samples from the dataset. [PITH_FULL_IMAGE:figures/full_fig_p044_27.png] view at source ↗

**Figure 28.** Figure 28: CIFAR Dataset. Labels passed to the diffusion model [PITH_FULL_IMAGE:figures/full_fig_p044_28.png] view at source ↗

**Figure 29.** Figure 29: MNIST Image Classification. Accuracy vs. Likelihood Certificate. [PITH_FULL_IMAGE:figures/full_fig_p044_29.png] view at source ↗

**Figure 30.** Figure 30: MNIST Dataset. Up: Labels passed to the diffusion model. Down: Samples to be [PITH_FULL_IMAGE:figures/full_fig_p045_30.png] view at source ↗

**Figure 31.** Figure 31: Left: Median estimated likelihood vs. classification accuracy for each CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p047_31.png] view at source ↗

**Figure 32.** Figure 32: Histograms of estimated likelihoods for CIFAR-10 and SVHN test samples under [PITH_FULL_IMAGE:figures/full_fig_p047_32.png] view at source ↗

**Figure 33.** Figure 33: SVHN Dataset. Up: Labels passed to the diffusion model. Down: Samples to be [PITH_FULL_IMAGE:figures/full_fig_p047_33.png] view at source ↗

**Figure 34.** Figure 34: Scatter plots showing the relationship between relative [PITH_FULL_IMAGE:figures/full_fig_p050_34.png] view at source ↗

**Figure 35.** Figure 35: Histograms illustrating the relationship between segmentation quality and model [PITH_FULL_IMAGE:figures/full_fig_p050_35.png] view at source ↗

**Figure 36.** Figure 36: An example of an HGG brain samples (first row), ground truth segmentation masks [PITH_FULL_IMAGE:figures/full_fig_p051_36.png] view at source ↗

**Figure 37.** Figure 37: An example of an LGGx brain samples (first row), ground truth segmentation masks [PITH_FULL_IMAGE:figures/full_fig_p051_37.png] view at source ↗

**Figure 38.** Figure 38: An example of an HGG-T2 brain samples (first row), ground truth segmentation [PITH_FULL_IMAGE:figures/full_fig_p052_38.png] view at source ↗

**Figure 39.** Figure 39: Ablation study: diffusion model trained without noise injection into non-semantic [PITH_FULL_IMAGE:figures/full_fig_p053_39.png] view at source ↗

**Figure 40.** Figure 40: Brain segmentation results for the HGG L2 case. Each plot includes the corresponding [PITH_FULL_IMAGE:figures/full_fig_p054_40.png] view at source ↗

read the original abstract

Data-driven models are increasingly adopted in critical scientific fields like weather forecasting and fluid dynamics. These methods can fail on out-of-distribution (OOD) data, but detecting such failures in regression tasks is an open challenge. We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model. This approach considers not just the input but also the regression model's prediction, providing a task-aware reliability score. Across numerous scientific datasets, including PDE datasets, satellite imagery and brain tumor segmentation, we show that this likelihood strongly correlates with prediction error. Our work provides a foundational step towards building a verifiable 'certificate of trust', thereby offering a practical tool for assessing the trustworthiness of AI-based scientific predictions. Our code is publicly available at https://github.com/bogdanraonic3/OOD_Detection_ScientificML

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a joint diffusion likelihood on input and regression output correlating with error on scientific datasets, but the abstract gives little on baselines or stats so the practical gain is still unclear.

read the letter

The main thing to know is that the authors train a score-based diffusion model on the joint distribution of the input and the regressor's prediction, then treat the resulting likelihood as a task-aware OOD score that tracks prediction error on scientific regression problems. This is a direct extension of existing diffusion OOD work, but the inclusion of the prediction itself is the incremental step that makes it specific to regression rather than generic input detection. They report the correlation on PDE grids, satellite imagery, and brain tumor segmentation, which lines up with the domains where reliable failure detection matters. Public code is a straightforward plus for anyone who wants to reproduce or adapt it. The approach is simple enough that groups already running diffusion models could add it without major new machinery. The soft spots sit in the validation details. The abstract states a strong correlation but supplies no error bars, significance tests, or head-to-head numbers against input-only likelihoods or other regression OOD baselines. In high-dimensional scientific data the score-matching approximation can introduce systematic bias in low-density regions, and if that bias happens to align with regions where the regressor already struggles, the observed link could be partly artifactual. The central claim therefore rests on whether the full experiments control for this and show the joint score adds signal beyond simpler alternatives. This paper is aimed at researchers building or deploying regression models in physics, imaging, or medical settings who need a concrete reliability signal rather than a theoretical guarantee. A reader already working on OOD for scientific ML would pick up a usable idea and the code, but would still need to run their own controls to judge whether the correlation survives stronger distribution shifts. I would send it to peer review so the experimental section can be checked for proper baselines and statistical reporting.

Referee Report

2 major / 2 minor

Summary. The paper proposes a task-aware OOD detection method for regression tasks in scientific AI by training a score-based diffusion model on the joint distribution of inputs x and model predictions ŷ, then using the estimated joint likelihood p(x, ŷ) as a reliability score. It reports that this score correlates with prediction error across multiple scientific datasets (PDE, satellite imagery, brain tumor segmentation) and positions the approach as a step toward verifiable certificates of trust for AI-based scientific predictions. Code is released publicly.

Significance. If the reported correlations prove robust under statistical controls and the diffusion approximation generalizes without confounding bias, the method could supply a practical, task-aware reliability signal for high-stakes scientific regression. Public code release supports reproducibility and follow-up work.

major comments (2)

[§4.3, Table 2] §4.3 and Table 2: the central claim that joint likelihood 'strongly correlates with prediction error' is presented without reported p-values, confidence intervals, or controls for multiple-testing across the 'numerous' datasets; the abstract-level summary therefore leaves the statistical support for the correlation unclear.
[§3.1–3.2] §3.1–3.2: the score-based diffusion model is trained on finite ID samples to recover p(x, ŷ); no analysis quantifies approximation error of the ODE likelihood estimator in the high-dimensional regimes of the PDE grids or segmentation maps, nor tests whether such error systematically aligns with regions of high regressor failure.

minor comments (2)

[Figure 3] Figure 3 caption: axis labels and color scale for the likelihood–error scatter plots are not defined in the caption or legend, hindering direct interpretation.
[Related Work] Related-work section: the discussion of prior likelihood-based OOD detectors (e.g., those using normalizing flows or VAEs) omits direct quantitative comparison on the same scientific regression benchmarks.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of statistical rigor and methodological limitations that we have addressed through revisions and additional discussion. We respond to each major comment below.

read point-by-point responses

Referee: [§4.3, Table 2] §4.3 and Table 2: the central claim that joint likelihood 'strongly correlates with prediction error' is presented without reported p-values, confidence intervals, or controls for multiple-testing across the 'numerous' datasets; the abstract-level summary therefore leaves the statistical support for the correlation unclear.

Authors: We agree that statistical measures are needed to support the correlation claims. In the revised manuscript, we have updated Table 2 and Section 4.3 to report Pearson correlation coefficients along with p-values for each dataset. We also include 95% bootstrap confidence intervals. To address multiple testing across the datasets, we apply Bonferroni correction and confirm that the key correlations remain statistically significant at the adjusted threshold. The abstract has been revised to note that the correlations are statistically supported rather than simply 'strong'. revision: yes
Referee: [§3.1–3.2] §3.1–3.2: the score-based diffusion model is trained on finite ID samples to recover p(x, ŷ); no analysis quantifies approximation error of the ODE likelihood estimator in the high-dimensional regimes of the PDE grids or segmentation maps, nor tests whether such error systematically aligns with regions of high regressor failure.

Authors: We acknowledge this limitation in our original analysis. We have added a dedicated paragraph in Section 3.2 discussing sources of approximation error in the probability flow ODE solver, including numerical integration tolerances and finite-sample effects in high dimensions. We also include a new sensitivity analysis in the appendix examining how likelihood estimates vary with the number of ODE function evaluations on PDE data. However, a comprehensive quantification of how estimator error systematically aligns with regions of high regressor failure would require substantial new theoretical and experimental work beyond the current scope. revision: partial

standing simulated objections not resolved

Full quantification of the ODE likelihood estimator's approximation error in high-dimensional regimes and explicit testing of whether this error aligns with high regressor failure regions.

Circularity Check

0 steps flagged

No significant circularity; empirical validation of joint-likelihood OOD score

full rationale

The paper proposes an empirical OOD detection technique that trains a score-based diffusion model on the joint distribution of inputs and model predictions, then reports observed correlation between the resulting likelihood and prediction error across held-out scientific datasets. No derivation chain, uniqueness theorem, or self-citation is invoked to force the central claim; the correlation is presented as an experimental outcome rather than a quantity recovered by construction from fitted parameters or prior self-referential results. The method remains falsifiable on new data distributions and does not reduce the reported reliability score to a renaming or re-fitting of the same quantities used to train the detector.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that score-based diffusion models can reliably approximate the joint density of high-dimensional scientific inputs and model outputs; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Score-based diffusion models can be trained to estimate joint likelihoods of inputs and regression predictions for OOD scoring.
Invoked in the method description as the basis for the task-aware reliability score.

pith-pipeline@v0.9.0 · 5676 in / 1258 out tokens · 29682 ms · 2026-05-18T12:08:07.976417+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model... log ℓ(y⋆,Ψφ(x⋆)) ≈ log(ε) − log p(x⋆, ypred)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our work provides a foundational step towards building a verifiable 'certificate of trust'

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

[2]

L. Abdi, F. Caetano, A. Valiuddin, C. Viviers, H. Joudeh, and F. van der Sommen. Out-of-distribution detection in medical imaging via diffusion trajectories.arXiv preprint arXiv:2507.23411, 2025

work page arXiv 2025
[3]

Bodnar, W

C. Bodnar, W. P. Bruinsma, A. Lucic, M. Stanley, A. Allen, J. Brandstetter, P. Garvan, M. Riechert, J. A. Weyn, H. Dong, et al. A foundation model for the earth system.Nature, pages 1–8, 2025

work page 2025
[4]

Cao and Z

S. Cao and Z. Zhang. Deep hybrid models for out-of-distribution detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4733–4743, 2022

work page 2022
[5]

Chali, I

S. Chali, I. Kucher, M. Duranton, and J.-O. Klein. Improving normalizing flows with the approximate mass for out-of-distribution detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 750–758, 2023

work page 2023
[6]

Y. Ding, A. Aleksandrauskas, A. Ahmadian, J. Unger, F. Lindsten, and G. Eilertsen. Revisiting likelihood-based out-of-distribution detection by modeling representations.arXiv preprint arXiv:2504.07793, 2025

work page arXiv 2025
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

Drummond and R

N. Drummond and R. Shearer. The open world assumption. IneSI Workshop: The Closed World of Databases meets the Open World of the Semantic Web, volume 15, page 1, 2006

work page 2006
[9]

Elsharkawy and Y

I. Elsharkawy and Y. Kahn. Contrastive Normalizing Flows for Uncertainty-Aware Param- eter Estimation.arXiv preprint arXiv:2505.08709, 2025

work page arXiv 2025
[10]

L. C. Evans.Partial differential equations, volume 19. American Mathematical Society, 2022

work page 2022
[11]

Fanelli, J

C. Fanelli, J. Giroux, and Z. Papandreou. ‘flux+ mutability’: a conditional generative approach to one-class classification and anomaly detection.Machine Learning: Science and Technology, 3(4):045012, 2022

work page 2022
[12]

MERRA-2 tavg1_2d_flx_Nx (M2T1NXFLX): 2D, 1-Hourly, Time-Averaged, Single-Level, Assimilation, Surface Flux Diagnostics, Version 5.12.4

Global Modeling and Assimilation Office (GMAO). MERRA-2 tavg1_2d_flx_Nx (M2T1NXFLX): 2D, 1-Hourly, Time-Averaged, Single-Level, Assimilation, Surface Flux Diagnostics, Version 5.12.4. Goddard Earth Sciences Data and Information Services Center 60 Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI (GES DISC), Greenbelt, MD, USA, 201...

work page 2015
[13]

Goodier and N

J. Goodier and N. D. Campbell. Likelihood-based out-of-distribution detection with denoising diffusion probabilistic models.arXiv preprint arXiv:2310.17432, 2023

work page arXiv 2023
[14]

M. S. Graham, W. H. Pinaya, P.-D. Tudosiu, P. Nachev, S. Ourselin, and J. Cardoso. De- noising diffusion models for out-of-distribution detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2948–2957, 2023

work page 2023
[15]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

A. Heng, H. Soh, et al. Out-of-distribution detection with a single unconditional diffusion model.Advances in Neural Information Processing Systems, 37:43952–43974, 2024

work page 2024
[19]

Herde, B

M. Herde, B. Raonić, T. Rohner, R. Käppeli, R. Molinaro, E. de Bézenac, and S. Mishra. Poseidon: Efficient foundation models for pdes, 2024

work page 2024
[20]

Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10951–10960, 2020

work page 2020
[21]

Järve, K

J. Järve, K. K. Haavel, and M. Kull. Probability density from latent diffusion models for out-of-distribution detection.arXiv preprint arXiv:2508.15737, 2025

work page arXiv 2025
[22]

Kamkari, B

H. Kamkari, B. L. Ross, J. C. Cresswell, A. L. Caterini, R. Krishnan, and G. Loaiza- Ganem. A geometric explanation of the likelihood OOD detection paradox. InForty-first International Conference on Machine Learning, 2024

work page 2024
[23]

Karras, M

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models.Advances in Neural Information Processing Systems, 35:26565–26577, 2022

work page 2022
[24]

K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting out-of- distribution samples and adversarial attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018
[25]

S. Lee, J. Jo, and S. J. Hwang. Exploring chemical space with score-based out-of-distribution generation. InInternational Conference on Machine Learning, pages 18872–18892. PMLR, 2023. 61 Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

work page 2023
[26]

Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021

work page 2021
[27]

W. Liu, X. Wang, J. Owens, and Y. Li. Energy-based out-of-distribution detection.Advances in neural information processing systems, 33:21464–21475, 2020

work page 2020
[28]

L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators.Nature Machine Intelligence, 3(3):218–229, 2021

work page 2021
[29]

Mahmood, J

A. Mahmood, J. Oliva, and M. Styner. Multiscale score matching for out-of-distribution detection.arXiv preprint arXiv:2010.13132, 2020

work page arXiv 2010
[30]

B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014

work page 1993
[31]

Mishra and A

S. Mishra and A. E. Townsend.Numerical Analysis meets Machine Learning. Handbook of Numerical Analysis. Springer, 2024

work page 2024
[32]

Molinaro, S

R. Molinaro, S. Lanthaler, B. Raonić, T. Rohner, V. Armegioiu, S. Simonis, D. Grund, Y. Ramic, Z. Y. Wan, F. Sha, S. Mishra, and L. Zepeda-Núñez. Generative ai for fast and accurate statistical computation of fluids, 2025

work page 2025
[33]

Nalisnick and et al

E. Nalisnick and et al. Why Normalizing Flows Fail to Detect Out-of-Distribution Data. InNeurIPS, 2020

work page 2020
[34]

Nalisnick, A

E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Do deep generative models know what they don’t know? InInternational Conference on Learning Representations, 2019

work page 2019
[35]

Nalisnick, A

E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Hybrid models with deep and invertible features. InInternational Conference on Machine Learning, pages 4723–4732. PMLR, 2019

work page 2019
[36]

Nalisnick, A

E. Nalisnick, A. Matsukawa, Y. W. Teh, and B. Lakshminarayanan. Detecting out-of- distribution inputs to deep generative models using typicality, 2019

work page 2019
[37]

Pfaff, M

T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning Mesh-Based Simulation with Graph Networks, June 2021. arXiv:2010.03409 [cs]

work page arXiv 2021
[38]

Pleiss, A

G. Pleiss, A. Souza, J. Kim, B. Li, and K. Q. Weinberger. Neural network out-of-distribution detection for regression tasks. 2019

work page 2019
[39]

Quarteroni and A

A. Quarteroni and A. Valli.Numerical approximation of Partial differential equations, volume 23. Springer, 1994. 62 Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

work page 1994
[40]

Raonic, R

B. Raonic, R. Molinaro, T. De Ryck, T. Rohner, F. Bartolucci, R. Alaifari, S. Mishra, and E. de Bézenac. Convolutional neural operators for robust and accurate learning of pdes. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 77187–77200. Curran Associates...

work page 2023
[41]

J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lak- shminarayanan. Likelihood ratios for out-of-distribution detection.Advances in neural information processing systems, 32, 2019

work page 2019
[42]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer- assisted intervention, pages 234–241. Springer, 2015

work page 2015
[43]

Tang and H

W. Tang and H. Zhao. Score-based diffusion models via stochastic differential equations – a technical tutorial, 2024

work page 2024
[44]

J. Yang, K. Zhou, Y. Li, and Z. Liu. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12):5635–5662, Dec 2024

work page 2024
[45]

Zhang, Q

J. Zhang, Q. Fu, X. Chen, L. Du, Z. Li, G. Wang, S. Han, D. Zhang, et al. Out-of- distribution detection based on in-distribution data patterns memorization with modern hopfield energy. InThe Eleventh International Conference on Learning Representations, 2022. 63

work page 2022

[1] [2]

L. Abdi, F. Caetano, A. Valiuddin, C. Viviers, H. Joudeh, and F. van der Sommen. Out-of-distribution detection in medical imaging via diffusion trajectories.arXiv preprint arXiv:2507.23411, 2025

work page arXiv 2025

[2] [3]

Bodnar, W

C. Bodnar, W. P. Bruinsma, A. Lucic, M. Stanley, A. Allen, J. Brandstetter, P. Garvan, M. Riechert, J. A. Weyn, H. Dong, et al. A foundation model for the earth system.Nature, pages 1–8, 2025

work page 2025

[3] [4]

Cao and Z

S. Cao and Z. Zhang. Deep hybrid models for out-of-distribution detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4733–4743, 2022

work page 2022

[4] [5]

Chali, I

S. Chali, I. Kucher, M. Duranton, and J.-O. Klein. Improving normalizing flows with the approximate mass for out-of-distribution detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 750–758, 2023

work page 2023

[5] [6]

Y. Ding, A. Aleksandrauskas, A. Ahmadian, J. Unger, F. Lindsten, and G. Eilertsen. Revisiting likelihood-based out-of-distribution detection by modeling representations.arXiv preprint arXiv:2504.07793, 2025

work page arXiv 2025

[6] [7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [8]

Drummond and R

N. Drummond and R. Shearer. The open world assumption. IneSI Workshop: The Closed World of Databases meets the Open World of the Semantic Web, volume 15, page 1, 2006

work page 2006

[8] [9]

Elsharkawy and Y

I. Elsharkawy and Y. Kahn. Contrastive Normalizing Flows for Uncertainty-Aware Param- eter Estimation.arXiv preprint arXiv:2505.08709, 2025

work page arXiv 2025

[9] [10]

L. C. Evans.Partial differential equations, volume 19. American Mathematical Society, 2022

work page 2022

[10] [11]

Fanelli, J

C. Fanelli, J. Giroux, and Z. Papandreou. ‘flux+ mutability’: a conditional generative approach to one-class classification and anomaly detection.Machine Learning: Science and Technology, 3(4):045012, 2022

work page 2022

[11] [12]

MERRA-2 tavg1_2d_flx_Nx (M2T1NXFLX): 2D, 1-Hourly, Time-Averaged, Single-Level, Assimilation, Surface Flux Diagnostics, Version 5.12.4

Global Modeling and Assimilation Office (GMAO). MERRA-2 tavg1_2d_flx_Nx (M2T1NXFLX): 2D, 1-Hourly, Time-Averaged, Single-Level, Assimilation, Surface Flux Diagnostics, Version 5.12.4. Goddard Earth Sciences Data and Information Services Center 60 Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI (GES DISC), Greenbelt, MD, USA, 201...

work page 2015

[12] [13]

Goodier and N

J. Goodier and N. D. Campbell. Likelihood-based out-of-distribution detection with denoising diffusion probabilistic models.arXiv preprint arXiv:2310.17432, 2023

work page arXiv 2023

[13] [14]

M. S. Graham, W. H. Pinaya, P.-D. Tudosiu, P. Nachev, S. Ourselin, and J. Cardoso. De- noising diffusion models for out-of-distribution detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2948–2957, 2023

work page 2023

[14] [15]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [18]

A. Heng, H. Soh, et al. Out-of-distribution detection with a single unconditional diffusion model.Advances in Neural Information Processing Systems, 37:43952–43974, 2024

work page 2024

[16] [19]

Herde, B

M. Herde, B. Raonić, T. Rohner, R. Käppeli, R. Molinaro, E. de Bézenac, and S. Mishra. Poseidon: Efficient foundation models for pdes, 2024

work page 2024

[17] [20]

Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10951–10960, 2020

work page 2020

[18] [21]

Järve, K

J. Järve, K. K. Haavel, and M. Kull. Probability density from latent diffusion models for out-of-distribution detection.arXiv preprint arXiv:2508.15737, 2025

work page arXiv 2025

[19] [22]

Kamkari, B

H. Kamkari, B. L. Ross, J. C. Cresswell, A. L. Caterini, R. Krishnan, and G. Loaiza- Ganem. A geometric explanation of the likelihood OOD detection paradox. InForty-first International Conference on Machine Learning, 2024

work page 2024

[20] [23]

Karras, M

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models.Advances in Neural Information Processing Systems, 35:26565–26577, 2022

work page 2022

[21] [24]

K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting out-of- distribution samples and adversarial attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018

[22] [25]

S. Lee, J. Jo, and S. J. Hwang. Exploring chemical space with score-based out-of-distribution generation. InInternational Conference on Machine Learning, pages 18872–18892. PMLR, 2023. 61 Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

work page 2023

[23] [26]

Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021

work page 2021

[24] [27]

W. Liu, X. Wang, J. Owens, and Y. Li. Energy-based out-of-distribution detection.Advances in neural information processing systems, 33:21464–21475, 2020

work page 2020

[25] [28]

L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators.Nature Machine Intelligence, 3(3):218–229, 2021

work page 2021

[26] [29]

Mahmood, J

A. Mahmood, J. Oliva, and M. Styner. Multiscale score matching for out-of-distribution detection.arXiv preprint arXiv:2010.13132, 2020

work page arXiv 2010

[27] [30]

B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014

work page 1993

[28] [31]

Mishra and A

S. Mishra and A. E. Townsend.Numerical Analysis meets Machine Learning. Handbook of Numerical Analysis. Springer, 2024

work page 2024

[29] [32]

Molinaro, S

R. Molinaro, S. Lanthaler, B. Raonić, T. Rohner, V. Armegioiu, S. Simonis, D. Grund, Y. Ramic, Z. Y. Wan, F. Sha, S. Mishra, and L. Zepeda-Núñez. Generative ai for fast and accurate statistical computation of fluids, 2025

work page 2025

[30] [33]

Nalisnick and et al

E. Nalisnick and et al. Why Normalizing Flows Fail to Detect Out-of-Distribution Data. InNeurIPS, 2020

work page 2020

[31] [34]

Nalisnick, A

E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Do deep generative models know what they don’t know? InInternational Conference on Learning Representations, 2019

work page 2019

[32] [35]

Nalisnick, A

E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Hybrid models with deep and invertible features. InInternational Conference on Machine Learning, pages 4723–4732. PMLR, 2019

work page 2019

[33] [36]

Nalisnick, A

E. Nalisnick, A. Matsukawa, Y. W. Teh, and B. Lakshminarayanan. Detecting out-of- distribution inputs to deep generative models using typicality, 2019

work page 2019

[34] [37]

Pfaff, M

T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning Mesh-Based Simulation with Graph Networks, June 2021. arXiv:2010.03409 [cs]

work page arXiv 2021

[35] [38]

Pleiss, A

G. Pleiss, A. Souza, J. Kim, B. Li, and K. Q. Weinberger. Neural network out-of-distribution detection for regression tasks. 2019

work page 2019

[36] [39]

Quarteroni and A

A. Quarteroni and A. Valli.Numerical approximation of Partial differential equations, volume 23. Springer, 1994. 62 Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

work page 1994

[37] [40]

Raonic, R

B. Raonic, R. Molinaro, T. De Ryck, T. Rohner, F. Bartolucci, R. Alaifari, S. Mishra, and E. de Bézenac. Convolutional neural operators for robust and accurate learning of pdes. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 77187–77200. Curran Associates...

work page 2023

[38] [41]

J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lak- shminarayanan. Likelihood ratios for out-of-distribution detection.Advances in neural information processing systems, 32, 2019

work page 2019

[39] [42]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer- assisted intervention, pages 234–241. Springer, 2015

work page 2015

[40] [43]

Tang and H

W. Tang and H. Zhao. Score-based diffusion models via stochastic differential equations – a technical tutorial, 2024

work page 2024

[41] [44]

J. Yang, K. Zhou, Y. Li, and Z. Liu. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12):5635–5662, Dec 2024

work page 2024

[42] [45]

Zhang, Q

J. Zhang, Q. Fu, X. Chen, L. Du, Z. Li, G. Wang, S. Han, D. Zhang, et al. Out-of- distribution detection based on in-distribution data patterns memorization with modern hopfield energy. InThe Eleventh International Conference on Learning Representations, 2022. 63

work page 2022