pith. sign in

arxiv: 2509.25080 · v3 · submitted 2025-09-29 · 💻 cs.LG

Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

Pith reviewed 2026-05-18 12:08 UTC · model grok-4.3

classification 💻 cs.LG
keywords OOD detectionscientific machine learningdiffusion modelsregression reliabilitytask-aware scoringprediction error correlationtrustworthy AI
0
0 comments X

The pith

A diffusion model estimating joint likelihood of inputs and predictions yields a score that tracks error in scientific regression tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a task-aware method for spotting when data-driven models will fail on unfamiliar inputs in regression problems common to science. Rather than assessing inputs alone, the approach trains a score-based diffusion model on the combined distribution of inputs and the downstream model's outputs to produce a single reliability score. Experiments across PDE solvers, satellite imagery, and brain-tumor segmentation show this score tracks prediction error closely. The result offers a practical way to gauge trustworthiness before relying on AI outputs in fields where mistakes carry high costs.

Core claim

Estimating the joint likelihood of both an input and a regression model's prediction via a score-based diffusion model produces a reliability score that correlates strongly with actual prediction error on out-of-distribution scientific data.

What carries the argument

Score-based diffusion model trained on the joint distribution of inputs and regression predictions to compute a task-aware likelihood score.

If this is right

  • Unreliable predictions can be flagged before they are used in downstream scientific decisions.
  • The same joint-likelihood approach supplies a concrete numerical certificate for each individual forecast or segmentation.
  • The method applies uniformly to PDE regression, image-based regression, and segmentation tasks without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the joint model to classification outputs could yield analogous reliability scores for non-regression scientific tasks.
  • Combining the likelihood score with existing ensemble or dropout-based uncertainty estimates may produce tighter error bounds.
  • Repeated application across successive model updates could track how trustworthiness changes as training data or architectures evolve.

Load-bearing premise

The diffusion model must faithfully capture the joint distribution of inputs and predictions so that the resulting score generalizes beyond the specific datasets tested.

What would settle it

On a new scientific dataset never seen during training or evaluation, the computed likelihood score shows no consistent correlation with measured prediction error.

Figures

Figures reproduced from arXiv: 2509.25080 by Bogdan Raoni\'c, Samuel Lanthaler, Siddhartha Mishra.

Figure 1
Figure 1. Figure 1: Illustration of the approach: (A) A regression model [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Navier-Stokes. L1 Error vs input-only likelihood log pθ(x) for NS-Mix. Right: CIFAR10 Image Classification. Accuracy vs Likelihood Certificate. histogram of our 12-hour forecasts against a persistence baseline (humidity assumed constant). The model clearly outperforms persistence, with its error distribution shifted to the left. 4.4 Classification To complement the regression datasets considered befo… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of the trained MLP fθ on four different target functions. The figure illustrates the target functions, training samples, prediction errors, and overall model performance. Training is conducted with N+ = 100 or N+ = 200, and the results show that the performance on the + set is consistently 3 to 10 times better. For exact error values, refer to the legend in the middle figures. 18 [PITH_FULL_IM… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of varying N+ and ν on L2 errors for the regression problems from [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (Left) Ground truth values f(x) and predicted values fθ(x), (Center) Samples drawn from the trained diffusion model, (Right) Joint prior log-likelihood log p0(x, y). positive values of x, a phenomenon known as collapse to the mean value (as described in [32]) occurs. In this region, where f has a high Lipschitz constant, fθ lacks the capacity to accurately approximate the function. The ground truth values … view at source ↗
Figure 6
Figure 6. Figure 6: Wave equation. A randomly selected ID sample (q dist.). Absolute L1 error is 0.097. The estimated log likelihood is 19.78. Parameters for this samples are K = 28 and r = −0.79. A posteriori error estimate (defined in B.4) is 0.10 ± 0.02 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Wave equation. A randomly selected OOD sample (q dist.). Absolute L1 error is 0.227. The estimated log likelihood is 16.12. Parameters for this samples are K = 31 and r = −0.85. A posteriori error estimate (defined in B.4) is 0.21 ± 0.02. 16 18 20 22 24 Joint Log Likelihood log p(x, ypred) 0.05 0.10 0.15 0.20 0.25 L1 Error 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Density 14 16 18 20 22 24 Joint Log Likelihood log p(x, … view at source ↗
Figure 8
Figure 8. Figure 8: Wave equation. Likelihood–error plane illustrating in-distribution (ID) and out-of [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Wave equation. 2d histograms of the values of the parameter [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Wave equation. Left: Histogram of estimated likelihoods of the samples drawn from [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evolution of the joint log-likelihood log p(x, ypred) versus the L1 error across training checkpoints of the diffusion model. Likelihoods are estimated using models trained for 100, 200, 250, 300, and 500 epochs. As training progresses, the joint likelihood estimates become more informative for error detection, with the final model (500 epochs) exhibiting a clear correlation between likelihood and predict… view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of the median estimated joint log-likelihood on the training distribution [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of the L1 error versus estimated joint log-likelihood log p(x, ypred) at training epochs 400 (left) and 500 (right). The similarity between the two plots indicates that the model’s predictive behavior stabilizes, and the likelihood estimates remain consistent once sufficient training is achieved. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Effect of the number of training samples on the stability of classification boundaries [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparison of L1 errors and estimated joint log-likelihoods log p(x, ypred) for different regression architectures (ViT, UNet, C-FNO) using the same diffusion model. While low likelihoods consistently correspond to high-error samples within each model, the absolute likelihood values are not comparable across models. B.1.4 Regression Model Architecture Ablation In this section, we evaluate the proposed fra… view at source ↗
Figure 16
Figure 16. Figure 16: Absolute L1 error versus likelihood for four experimental settings. Each plot [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Ablation Study for the certificate. Training distribution is NS-MIX. [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Ablation Study for the certificate. Training distribution is NS-Sin-Moderate. Left: [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Ablation Study for the certificate. Training distribution is NS-Sines Moderate. [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Randomly selected samples from the testing distributions in NS-MIX experiment, [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Randomly selected samples from the testing distributions in NS-PwC experiment, [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Humidity prediction. Left: L1 errors vs. Estimated log likelihood for different testing datasets. Right: Histogram of absolute errors for 12-hour humidity predictions. The comparison is between our trained model and a persistence forecasting baseline, which assumes no change in humidity over time. B.3 Humidity Forecast In this experiment, we utilize MERRA-2 satellite data to forecast surface-level specifi… view at source ↗
Figure 23
Figure 23. Figure 23: Humidity prediction over different testing regions. [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Fitted exponential curves of regression error as a function of the certificate (estimated [PITH_FULL_IMAGE:figures/full_fig_p040_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Error fits and corresponding error–certificate histograms for the training distributions. [PITH_FULL_IMAGE:figures/full_fig_p041_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Evolution of the uncertainty bounds for thresholds set at the 65th, 75th, 85th, and [PITH_FULL_IMAGE:figures/full_fig_p042_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: CIFAR Dataset. Samples from the dataset. [PITH_FULL_IMAGE:figures/full_fig_p044_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: CIFAR Dataset. Labels passed to the diffusion model [PITH_FULL_IMAGE:figures/full_fig_p044_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: MNIST Image Classification. Accuracy vs. Likelihood Certificate. [PITH_FULL_IMAGE:figures/full_fig_p044_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: MNIST Dataset. Up: Labels passed to the diffusion model. Down: Samples to be [PITH_FULL_IMAGE:figures/full_fig_p045_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Left: Median estimated likelihood vs. classification accuracy for each CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p047_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Histograms of estimated likelihoods for CIFAR-10 and SVHN test samples under [PITH_FULL_IMAGE:figures/full_fig_p047_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: SVHN Dataset. Up: Labels passed to the diffusion model. Down: Samples to be [PITH_FULL_IMAGE:figures/full_fig_p047_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Scatter plots showing the relationship between relative [PITH_FULL_IMAGE:figures/full_fig_p050_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Histograms illustrating the relationship between segmentation quality and model [PITH_FULL_IMAGE:figures/full_fig_p050_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: An example of an HGG brain samples (first row), ground truth segmentation masks [PITH_FULL_IMAGE:figures/full_fig_p051_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: An example of an LGGx brain samples (first row), ground truth segmentation masks [PITH_FULL_IMAGE:figures/full_fig_p051_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: An example of an HGG-T2 brain samples (first row), ground truth segmentation [PITH_FULL_IMAGE:figures/full_fig_p052_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Ablation study: diffusion model trained without noise injection into non-semantic [PITH_FULL_IMAGE:figures/full_fig_p053_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Brain segmentation results for the HGG L2 case. Each plot includes the corresponding [PITH_FULL_IMAGE:figures/full_fig_p054_40.png] view at source ↗
read the original abstract

Data-driven models are increasingly adopted in critical scientific fields like weather forecasting and fluid dynamics. These methods can fail on out-of-distribution (OOD) data, but detecting such failures in regression tasks is an open challenge. We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model. This approach considers not just the input but also the regression model's prediction, providing a task-aware reliability score. Across numerous scientific datasets, including PDE datasets, satellite imagery and brain tumor segmentation, we show that this likelihood strongly correlates with prediction error. Our work provides a foundational step towards building a verifiable 'certificate of trust', thereby offering a practical tool for assessing the trustworthiness of AI-based scientific predictions. Our code is publicly available at https://github.com/bogdanraonic3/OOD_Detection_ScientificML

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a task-aware OOD detection method for regression tasks in scientific AI by training a score-based diffusion model on the joint distribution of inputs x and model predictions ŷ, then using the estimated joint likelihood p(x, ŷ) as a reliability score. It reports that this score correlates with prediction error across multiple scientific datasets (PDE, satellite imagery, brain tumor segmentation) and positions the approach as a step toward verifiable certificates of trust for AI-based scientific predictions. Code is released publicly.

Significance. If the reported correlations prove robust under statistical controls and the diffusion approximation generalizes without confounding bias, the method could supply a practical, task-aware reliability signal for high-stakes scientific regression. Public code release supports reproducibility and follow-up work.

major comments (2)
  1. [§4.3, Table 2] §4.3 and Table 2: the central claim that joint likelihood 'strongly correlates with prediction error' is presented without reported p-values, confidence intervals, or controls for multiple-testing across the 'numerous' datasets; the abstract-level summary therefore leaves the statistical support for the correlation unclear.
  2. [§3.1–3.2] §3.1–3.2: the score-based diffusion model is trained on finite ID samples to recover p(x, ŷ); no analysis quantifies approximation error of the ODE likelihood estimator in the high-dimensional regimes of the PDE grids or segmentation maps, nor tests whether such error systematically aligns with regions of high regressor failure.
minor comments (2)
  1. [Figure 3] Figure 3 caption: axis labels and color scale for the likelihood–error scatter plots are not defined in the caption or legend, hindering direct interpretation.
  2. [Related Work] Related-work section: the discussion of prior likelihood-based OOD detectors (e.g., those using normalizing flows or VAEs) omits direct quantitative comparison on the same scientific regression benchmarks.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of statistical rigor and methodological limitations that we have addressed through revisions and additional discussion. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§4.3, Table 2] §4.3 and Table 2: the central claim that joint likelihood 'strongly correlates with prediction error' is presented without reported p-values, confidence intervals, or controls for multiple-testing across the 'numerous' datasets; the abstract-level summary therefore leaves the statistical support for the correlation unclear.

    Authors: We agree that statistical measures are needed to support the correlation claims. In the revised manuscript, we have updated Table 2 and Section 4.3 to report Pearson correlation coefficients along with p-values for each dataset. We also include 95% bootstrap confidence intervals. To address multiple testing across the datasets, we apply Bonferroni correction and confirm that the key correlations remain statistically significant at the adjusted threshold. The abstract has been revised to note that the correlations are statistically supported rather than simply 'strong'. revision: yes

  2. Referee: [§3.1–3.2] §3.1–3.2: the score-based diffusion model is trained on finite ID samples to recover p(x, ŷ); no analysis quantifies approximation error of the ODE likelihood estimator in the high-dimensional regimes of the PDE grids or segmentation maps, nor tests whether such error systematically aligns with regions of high regressor failure.

    Authors: We acknowledge this limitation in our original analysis. We have added a dedicated paragraph in Section 3.2 discussing sources of approximation error in the probability flow ODE solver, including numerical integration tolerances and finite-sample effects in high dimensions. We also include a new sensitivity analysis in the appendix examining how likelihood estimates vary with the number of ODE function evaluations on PDE data. However, a comprehensive quantification of how estimator error systematically aligns with regions of high regressor failure would require substantial new theoretical and experimental work beyond the current scope. revision: partial

standing simulated objections not resolved
  • Full quantification of the ODE likelihood estimator's approximation error in high-dimensional regimes and explicit testing of whether this error aligns with high regressor failure regions.

Circularity Check

0 steps flagged

No significant circularity; empirical validation of joint-likelihood OOD score

full rationale

The paper proposes an empirical OOD detection technique that trains a score-based diffusion model on the joint distribution of inputs and model predictions, then reports observed correlation between the resulting likelihood and prediction error across held-out scientific datasets. No derivation chain, uniqueness theorem, or self-citation is invoked to force the central claim; the correlation is presented as an experimental outcome rather than a quantity recovered by construction from fitted parameters or prior self-referential results. The method remains falsifiable on new data distributions and does not reduce the reported reliability score to a renaming or re-fitting of the same quantities used to train the detector.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that score-based diffusion models can reliably approximate the joint density of high-dimensional scientific inputs and model outputs; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Score-based diffusion models can be trained to estimate joint likelihoods of inputs and regression predictions for OOD scoring.
    Invoked in the method description as the basis for the task-aware reliability score.

pith-pipeline@v0.9.0 · 5676 in / 1258 out tokens · 29682 ms · 2026-05-18T12:08:07.976417+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

  1. [2]

    L. Abdi, F. Caetano, A. Valiuddin, C. Viviers, H. Joudeh, and F. van der Sommen. Out-of-distribution detection in medical imaging via diffusion trajectories.arXiv preprint arXiv:2507.23411, 2025

  2. [3]

    Bodnar, W

    C. Bodnar, W. P. Bruinsma, A. Lucic, M. Stanley, A. Allen, J. Brandstetter, P. Garvan, M. Riechert, J. A. Weyn, H. Dong, et al. A foundation model for the earth system.Nature, pages 1–8, 2025

  3. [4]

    Cao and Z

    S. Cao and Z. Zhang. Deep hybrid models for out-of-distribution detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4733–4743, 2022

  4. [5]

    Chali, I

    S. Chali, I. Kucher, M. Duranton, and J.-O. Klein. Improving normalizing flows with the approximate mass for out-of-distribution detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 750–758, 2023

  5. [6]

    Y. Ding, A. Aleksandrauskas, A. Ahmadian, J. Unger, F. Lindsten, and G. Eilertsen. Revisiting likelihood-based out-of-distribution detection by modeling representations.arXiv preprint arXiv:2504.07793, 2025

  6. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  7. [8]

    Drummond and R

    N. Drummond and R. Shearer. The open world assumption. IneSI Workshop: The Closed World of Databases meets the Open World of the Semantic Web, volume 15, page 1, 2006

  8. [9]

    Elsharkawy and Y

    I. Elsharkawy and Y. Kahn. Contrastive Normalizing Flows for Uncertainty-Aware Param- eter Estimation.arXiv preprint arXiv:2505.08709, 2025

  9. [10]

    L. C. Evans.Partial differential equations, volume 19. American Mathematical Society, 2022

  10. [11]

    Fanelli, J

    C. Fanelli, J. Giroux, and Z. Papandreou. ‘flux+ mutability’: a conditional generative approach to one-class classification and anomaly detection.Machine Learning: Science and Technology, 3(4):045012, 2022

  11. [12]

    MERRA-2 tavg1_2d_flx_Nx (M2T1NXFLX): 2D, 1-Hourly, Time-Averaged, Single-Level, Assimilation, Surface Flux Diagnostics, Version 5.12.4

    Global Modeling and Assimilation Office (GMAO). MERRA-2 tavg1_2d_flx_Nx (M2T1NXFLX): 2D, 1-Hourly, Time-Averaged, Single-Level, Assimilation, Surface Flux Diagnostics, Version 5.12.4. Goddard Earth Sciences Data and Information Services Center 60 Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI (GES DISC), Greenbelt, MD, USA, 201...

  12. [13]

    Goodier and N

    J. Goodier and N. D. Campbell. Likelihood-based out-of-distribution detection with denoising diffusion probabilistic models.arXiv preprint arXiv:2310.17432, 2023

  13. [14]

    M. S. Graham, W. H. Pinaya, P.-D. Tudosiu, P. Nachev, S. Ourselin, and J. Cardoso. De- noising diffusion models for out-of-distribution detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2948–2957, 2023

  14. [15]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136, 2016

  15. [18]

    A. Heng, H. Soh, et al. Out-of-distribution detection with a single unconditional diffusion model.Advances in Neural Information Processing Systems, 37:43952–43974, 2024

  16. [19]

    Herde, B

    M. Herde, B. Raonić, T. Rohner, R. Käppeli, R. Molinaro, E. de Bézenac, and S. Mishra. Poseidon: Efficient foundation models for pdes, 2024

  17. [20]

    Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10951–10960, 2020

  18. [21]

    Järve, K

    J. Järve, K. K. Haavel, and M. Kull. Probability density from latent diffusion models for out-of-distribution detection.arXiv preprint arXiv:2508.15737, 2025

  19. [22]

    Kamkari, B

    H. Kamkari, B. L. Ross, J. C. Cresswell, A. L. Caterini, R. Krishnan, and G. Loaiza- Ganem. A geometric explanation of the likelihood OOD detection paradox. InForty-first International Conference on Machine Learning, 2024

  20. [23]

    Karras, M

    T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models.Advances in Neural Information Processing Systems, 35:26565–26577, 2022

  21. [24]

    K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting out-of- distribution samples and adversarial attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

  22. [25]

    S. Lee, J. Jo, and S. J. Hwang. Exploring chemical space with score-based out-of-distribution generation. InInternational Conference on Machine Learning, pages 18872–18892. PMLR, 2023. 61 Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

  23. [26]

    Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021

  24. [27]

    W. Liu, X. Wang, J. Owens, and Y. Li. Energy-based out-of-distribution detection.Advances in neural information processing systems, 33:21464–21475, 2020

  25. [28]

    L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators.Nature Machine Intelligence, 3(3):218–229, 2021

  26. [29]

    Mahmood, J

    A. Mahmood, J. Oliva, and M. Styner. Multiscale score matching for out-of-distribution detection.arXiv preprint arXiv:2010.13132, 2020

  27. [30]

    B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014

  28. [31]

    Mishra and A

    S. Mishra and A. E. Townsend.Numerical Analysis meets Machine Learning. Handbook of Numerical Analysis. Springer, 2024

  29. [32]

    Molinaro, S

    R. Molinaro, S. Lanthaler, B. Raonić, T. Rohner, V. Armegioiu, S. Simonis, D. Grund, Y. Ramic, Z. Y. Wan, F. Sha, S. Mishra, and L. Zepeda-Núñez. Generative ai for fast and accurate statistical computation of fluids, 2025

  30. [33]

    Nalisnick and et al

    E. Nalisnick and et al. Why Normalizing Flows Fail to Detect Out-of-Distribution Data. InNeurIPS, 2020

  31. [34]

    Nalisnick, A

    E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Do deep generative models know what they don’t know? InInternational Conference on Learning Representations, 2019

  32. [35]

    Nalisnick, A

    E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Hybrid models with deep and invertible features. InInternational Conference on Machine Learning, pages 4723–4732. PMLR, 2019

  33. [36]

    Nalisnick, A

    E. Nalisnick, A. Matsukawa, Y. W. Teh, and B. Lakshminarayanan. Detecting out-of- distribution inputs to deep generative models using typicality, 2019

  34. [37]

    Pfaff, M

    T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning Mesh-Based Simulation with Graph Networks, June 2021. arXiv:2010.03409 [cs]

  35. [38]

    Pleiss, A

    G. Pleiss, A. Souza, J. Kim, B. Li, and K. Q. Weinberger. Neural network out-of-distribution detection for regression tasks. 2019

  36. [39]

    Quarteroni and A

    A. Quarteroni and A. Valli.Numerical approximation of Partial differential equations, volume 23. Springer, 1994. 62 Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

  37. [40]

    Raonic, R

    B. Raonic, R. Molinaro, T. De Ryck, T. Rohner, F. Bartolucci, R. Alaifari, S. Mishra, and E. de Bézenac. Convolutional neural operators for robust and accurate learning of pdes. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 77187–77200. Curran Associates...

  38. [41]

    J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lak- shminarayanan. Likelihood ratios for out-of-distribution detection.Advances in neural information processing systems, 32, 2019

  39. [42]

    Ronneberger, P

    O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer- assisted intervention, pages 234–241. Springer, 2015

  40. [43]

    Tang and H

    W. Tang and H. Zhao. Score-based diffusion models via stochastic differential equations – a technical tutorial, 2024

  41. [44]

    J. Yang, K. Zhou, Y. Li, and Z. Liu. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12):5635–5662, Dec 2024

  42. [45]

    Zhang, Q

    J. Zhang, Q. Fu, X. Chen, L. Du, Z. Li, G. Wang, S. Han, D. Zhang, et al. Out-of- distribution detection based on in-distribution data patterns memorization with modern hopfield energy. InThe Eleventh International Conference on Learning Representations, 2022. 63