pith. sign in

arxiv: 2605.28076 · v2 · pith:PLY5OKRBnew · submitted 2026-05-27 · 📊 stat.ML · cs.NA· math.NA· nlin.CD· physics.data-an

Diagnosing the conditional-mean barrier in scientific machine-learning surrogates

Pith reviewed 2026-06-29 10:10 UTC · model grok-4.3

classification 📊 stat.ML cs.NAmath.NAnlin.CDphysics.data-an
keywords conditional-mean barrierscientific machine learningsurrogatesaleatoric uncertaintydistributional lossesresidual orthogonalitycoefficient of determinationone-to-many mappings
0
0 comments X

The pith

Squared-loss predictors in scientific machine learning reach a conditional-mean barrier where further improvement requires distributional losses instead of point predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In many scientific problems, the same input can map to multiple outputs due to coarse graining or partial observation. Deterministic models trained with squared loss learn the average response but cannot represent the spread around it. The paper introduces two diagnostics to check if a model has reached this barrier: checking if residuals are orthogonal to input features and comparing the coefficient of determination to its maximum possible value given the variance. It also proves that adding random latent variables to such a model forces it back to predicting the conditional mean. Recognizing the barrier matters because it tells practitioners when to switch from simple regression to methods that model full probability distributions for better uncertainty handling in applications like fluid dynamics closures.

Core claim

The conditional-mean barrier occurs when a squared-loss trained surrogate has learned the conditional expectation of the target given the inputs, after which the error is only irreducible aleatoric variance. The paper provides residual-feature orthogonality and the coefficient of determination against its explained-variance ceiling as diagnostics to locate this barrier in finite data, and demonstrates that introducing latent randomness into a squared-loss predictor causes it to collapse back to the conditional mean. Crossing the barrier requires objectives that score entire distributions rather than single points.

What carries the argument

The conditional-mean barrier, detected via residual-feature orthogonality and R-squared against the explained-variance ceiling, marks the transition from reducible to irreducible error in squared-loss training.

If this is right

  • Detecting the barrier allows distinguishing deterministic underfitting from inherent variability in the data.
  • Adding latent randomness to a squared-loss model reverts it to the conditional mean predictor.
  • Distributional losses such as negative log-likelihood or moment matching are needed to model uncertainty beyond the barrier.
  • The diagnostics apply to problems like subgrid forcing in simulations and effective response in materials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to time-series forecasting where aleatoric noise is common.
  • Practitioners might integrate these diagnostics into training loops to decide when to switch loss functions.
  • Further work could test whether the barrier location depends on model architecture beyond the loss.

Load-bearing premise

The input features and training data are sufficient for a squared-loss model to reach the conditional mean in finite samples.

What would settle it

If a squared-loss model trained to convergence still shows residuals correlated with features, or if its R2 falls short of the explained variance ceiling, this would indicate the barrier has not been reached; observing that a model with added latent variables produces different predictions than the mean would falsify the collapse result.

Figures

Figures reproduced from arXiv: 2605.28076 by Junfeng Chen.

Figure 1
Figure 1. Figure 1: The conditional-mean barrier. In the deterministic regime (left), increasing model capacity drives the squared-loss risk [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A controlled two-branch experiment. (a) Samples from (22), the two branches [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Residual–feature diagnostics for the two-branch example, for the degree-2 (underfit) and degree-9 (high-capacity) least-squares fits, [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diagnostic results for the two-scale Lorenz–96 closure. (a) One-step closure [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Lorenz–96 slow-energy statistics from independent long rollouts of the reference, deterministic-mean, and stochastic closures. (a) Slow [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

Many problems in computational science and engineering become one-to-many after coarse graining, partial observation, or inverse reconstruction: a resolved state may not determine a unique subgrid forcing, a structural descriptor may not determine a unique effective response, and a low-resolution observation may correspond to many plausible high-resolution fields. In such settings, deterministic surrogates may learn a well-defined mathematical object while still missing application-relevant uncertainty. This tutorial develops a self-contained module centered on the conditional-mean barrier: the point at which a squared-loss predictor has reached the conditional mean and the remaining error is irreducible aleatoric variance. We give two diagnostics for locating this barrier, residual-feature orthogonality and the coefficient of determination against its explained-variance ceiling, and prove that adding latent randomness to a squared-loss predictor collapses it back to the conditional mean. Crossing the barrier therefore requires a loss that scores distributions rather than point predictions. We briefly organize common distributional objectives, including negative log-likelihood, moment and observable matching, variational objectives, adversarial divergences, and score matching, by the feature of the conditional law each targets. The emphasis is the boundary itself and a finite-data procedure for recognizing it, rather than a survey of methods beyond it. CPU-based demonstrations on a two-branch law and a two-scale Lorenz-96 closure problem show how the diagnostics distinguish deterministic underfitting from residual distributional variability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a self-contained tutorial on the conditional-mean barrier for squared-loss predictors in scientific ML surrogates for one-to-many problems. It defines the barrier as the point where the model has reached the conditional expectation E[Y|X] and remaining error is irreducible aleatoric variance. Two diagnostics are introduced—residual-feature orthogonality and R² against its explained-variance ceiling—along with a proof that injecting latent randomness into a squared-loss model forces collapse back to the conditional mean. The work argues that crossing the barrier requires distributional losses and organizes common objectives by the conditional-law features they target. CPU demonstrations on a two-branch law and Lorenz-96 closure illustrate distinction between underfitting and residual variability.

Significance. If the diagnostics prove reliable, the contribution supplies a practical finite-data procedure for recognizing when deterministic surrogates have exhausted squared-loss capacity, informing the switch to probabilistic modeling in coarse-graining, inverse problems, and subgrid closures. The explicit proof of collapse under randomness and the organization of distributional objectives by targeted features of the conditional law are clear strengths; the emphasis on boundary detection rather than method survey is appropriately focused.

major comments (3)
  1. [§3] §3 (Proof of collapse under latent randomness): The argument is conditioned on exact attainment of the conditional mean by the squared-loss minimizer. No analysis is given for the finite-sample regime in which optimization limits, expressivity, or data insufficiency prevent attainment; the diagnostics would then misattribute underfitting to the barrier. This assumption is load-bearing for the claim that the diagnostics reliably locate the barrier without access to the true conditional law.
  2. [§4.1] §4.1 (Residual-feature orthogonality diagnostic): The test is presented as an independent check, yet its finite-sample distribution and power against underfitting (as opposed to barrier crossing) are not derived or bounded. The two-branch and Lorenz-96 examples may satisfy the attainment assumption, but the general case lacks an independent verification procedure.
  3. [§5] §5 (Demonstrations): Both examples are constructed so that the conditional mean is plausibly reachable; no ablation or counter-example is provided where a squared-loss model is deliberately underfit yet the diagnostics are applied, leaving the risk of misdiagnosis untested.
minor comments (2)
  1. [§4.2] Notation for the explained-variance ceiling in the R² diagnostic should be introduced with an explicit equation reference in §4.2 to avoid ambiguity with standard R².
  2. [§6] The organization of distributional objectives in §6 would benefit from a summary table mapping each objective to the conditional feature it targets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the paper's focus on boundary detection for the conditional-mean barrier. We address each major comment below, agreeing where the points identify areas for clarification or strengthening and outlining targeted revisions.

read point-by-point responses
  1. Referee: [§3] §3 (Proof of collapse under latent randomness): The argument is conditioned on exact attainment of the conditional mean by the squared-loss minimizer. No analysis is given for the finite-sample regime in which optimization limits, expressivity, or data insufficiency prevent attainment; the diagnostics would then misattribute underfitting to the barrier. This assumption is load-bearing for the claim that the diagnostics reliably locate the barrier without access to the true conditional law.

    Authors: We agree that the proof in §3 is stated for the population case in which the squared-loss minimizer exactly attains E[Y|X]. In finite samples, optimization or capacity limits could produce underfitting that the diagnostics might misclassify. The residual-feature orthogonality check is intended to flag such cases via nonzero correlations, but we accept that this does not constitute a formal separation. We will add a short discussion paragraph in §3 clarifying the population assumption, noting that practitioners should first verify that training has converged (e.g., via validation loss plateau), and stating that the diagnostics are most reliable once that check passes. This revision makes the scope explicit without altering the core proof. revision: partial

  2. Referee: [§4.1] §4.1 (Residual-feature orthogonality diagnostic): The test is presented as an independent check, yet its finite-sample distribution and power against underfitting (as opposed to barrier crossing) are not derived or bounded. The two-branch and Lorenz-96 examples may satisfy the attainment assumption, but the general case lacks an independent verification procedure.

    Authors: The referee is correct that we provide no analytic finite-sample distribution or power bounds for the orthogonality diagnostic. Such bounds would require strong assumptions on the joint distribution of features and residuals and lie outside the tutorial's intended scope. The diagnostic is offered as a practical, model-agnostic sample correlation test that can be supplemented by permutation or bootstrap procedures in applications. We will revise the opening of §4.1 to label the procedure explicitly as a heuristic finite-data check rather than a formal statistical test, and we will add a brief remark on using resampling to gauge significance. This preserves the emphasis on usability while acknowledging the theoretical gap. revision: partial

  3. Referee: [§5] §5 (Demonstrations): Both examples are constructed so that the conditional mean is plausibly reachable; no ablation or counter-example is provided where a squared-loss model is deliberately underfit yet the diagnostics are applied, leaving the risk of misdiagnosis untested.

    Authors: We accept that the current demonstrations were chosen to illustrate barrier crossing when the conditional mean is attainable, leaving the underfitting case untested. To close this gap we will add a short ablation subsection in §5 that deliberately restricts model capacity on the two-branch example (e.g., a linear predictor on a nonlinear target). The revised text will report that both diagnostics correctly flag residual feature correlation and an R² well below the variance ceiling, thereby indicating underfitting rather than barrier attainment. This addition directly addresses the requested counter-example while remaining within the CPU-scale setting of the original demonstrations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; diagnostics and proof are independent statistical properties

full rationale

The paper's core derivation introduces residual-feature orthogonality and R2-vs-ceiling diagnostics as direct consequences of the definition of conditional expectation (E[residual | features] = 0 and variance decomposition), without defining them from fitted model outputs. The stated proof that latent randomness forces collapse to the conditional mean under squared loss is a standard optimality result for L2, presented as a mathematical fact rather than a self-referential fit. No self-citations, ansatzes, or renamings are invoked as load-bearing steps for the barrier location procedure. The finite-sample attainment assumption is an external modeling premise, not a reduction of the claimed diagnostics to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard probability concepts without introducing new fitted parameters or postulated entities.

axioms (1)
  • standard math Standard properties of conditional expectation and decomposition of variance into explained and aleatoric components
    Invoked to define the barrier at which squared loss reaches its minimum and remaining error is irreducible.

pith-pipeline@v0.9.1-grok · 5778 in / 1274 out tokens · 48033 ms · 2026-06-29T10:10:48.405790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the National Academy of Sciences 113 (15) (2016) 3932–3937

  2. [2]

    S. H. Rudy, S. L. Brunton, J. L. Proctor, J. N. Kutz, Data-driven discovery of partial differential equations, Science Advances 3 (4) (2017) e1602614

  3. [3]

    Duraisamy, G

    K. Duraisamy, G. Iaccarino, H. Xiao, Turbulence modeling in the age of data, Annual Review of Fluid Mechanics 51 (1) (2019) 357–377

  4. [4]

    L. Lu, P. Jin, G. Pang, Z. Zhang, G. E. Karniadakis, Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators, Nature Machine Intelligence 3 (3) (2021) 218–229

  5. [5]

    Kovachki, Z

    N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, A. Anandkumar, Neural operator: Learning maps between function spaces with applications to PDEs, Journal of Machine Learning Research 24 (89) (2023) 1–97

  6. [6]

    P. C. Hansen, Discrete inverse problems: Insight and algorithms, Society for Industrial and Applied Mathematics, 2010

  7. [7]

    Benning, M

    M. Benning, M. Burger, Modern regularization methods for inverse problems, Acta Numerica 27 (2018) 1–111

  8. [8]

    A. J. Chorin, O. H. Hald, R. Kupferman, Optimal prediction and the Mori–Zwanzig representation of irreversible processes, Proceedings of the National Academy of Sciences 97 (7) (2000) 2968–2973

  9. [9]

    F. Lu, K. K. Lin, A. J. Chorin, Data-based stochastic model reduction for the Kuramoto–Sivashinsky equation, Physica D 340 (2017) 46–57

  10. [10]

    C. J. Gommes, Y . Jiao, S. Torquato, Microstructural degeneracy associated with a two-point correlation function and its information content, Physical Review E 85 (5) (2012) 051140

  11. [11]

    Bostanabad, Y

    R. Bostanabad, Y . Zhang, X. Li, et al., Computational microstructure characterization and reconstruction: Review of the state-of-the-art techniques, Progress in Materials Science 95 (2018) 1–41

  12. [12]

    Ledig, L

    C. Ledig, L. Theis, F. Husz ´ar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, W. Shi, Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690. J. Chen19 2 4 6 Xk Reference 2 4 6 Xk Determin...

  13. [13]

    Saharia, J

    C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, M. Norouzi, Image super-resolution via iterative refinement, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (4) (2023) 4713–4726

  14. [14]

    Hastie, R

    T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition, Springer, New York, 2009

  15. [15]

    C. M. Bishop, Pattern recognition and machine learning, Springer, 2006

  16. [16]

    T. M. Cover, J. A. Thomas, Elements of information theory, 2nd Edition, John Wiley & Sons, Hoboken, NJ, 2006

  17. [17]

    Goodfellow, J

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y . Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, V ol. 27, 2014, pp. 2672–2680

  18. [18]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, B. Poole, Score-based generative modeling through stochastic differential equations, in: International Conference on Learning Representations, 2021

  19. [19]

    Kallenberg, Foundations of Modern Probability, 2nd Edition, Springer, New York, 2002

    O. Kallenberg, Foundations of Modern Probability, 2nd Edition, Springer, New York, 2002

  20. [20]

    Mohri, A

    M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of machine learning, MIT press, 2018

  21. [21]

    Steinwart, On the influence of the kernel on the consistency of support vector machines, Journal of Machine Learning Research 2 (Nov) (2001) 67–93

    I. Steinwart, On the influence of the kernel on the consistency of support vector machines, Journal of Machine Learning Research 2 (Nov) (2001) 67–93

  22. [22]

    Schaback, H

    R. Schaback, H. Wendland, Kernel techniques: from machine learning to meshless methods, Acta Numerica 15 (2006) 543–639

  23. [23]

    Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4 (1991) 251–257

    K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4 (1991) 251–257

  24. [24]

    C. F. Higham, D. J. Higham, Deep learning: An introduction for applied mathematicians, SIAM Review 61 (4) (2019) 860–891

  25. [25]

    Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems 2 (1989) 303–314

    G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems 2 (1989) 303–314

  26. [26]

    Leshno, V

    M. Leshno, V . Y . Lin, A. Pinkus, S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks 6 (1993) 861–867

  27. [27]

    L. P. Hansen, Large sample properties of generalized method of moments estimators, Econometrica 50 (1982) 1029–1054

  28. [28]

    Wasserman, All of Statistics: A Concise Course in Statistical Inference, Springer, New York, 2004

    L. Wasserman, All of Statistics: A Concise Course in Statistical Inference, Springer, New York, 2004

  29. [29]

    D. A. Nix, A. S. Weigend, Estimating the mean and variance of the target probability distribution, in: Proceedings of the IEEE International Conference on Neural Networks, 1994, pp. 55–60

  30. [30]

    E. M. Stein, R. Shakarchi, Measure theory, integration, and Hilbert spaces (2005)

  31. [31]

    D. P. Kingma, M. Welling, Auto-encoding variational Bayes, in: International Conference on Learning Representations, 2014

  32. [32]

    Kendall, Y

    A. Kendall, Y . Gal, What uncertainties do we need in Bayesian deep learning for computer vision?, in: Advances in Neural Information Processing Systems, V ol. 30, 2017, pp. 5574–5584

  33. [33]

    A. P. Guillaumin, L. Zanna, Stochastic-deep learning parameterization of ocean momentum forcing, Journal of Advances in Modeling Earth Systems 13 (9) (2021) e2021MS002534

  34. [34]

    Papamakarios, E

    G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, B. Lakshminarayanan, Normalizing flows for probabilistic modeling and infer- ence, Journal of Machine Learning Research 22 (57) (2021) 1–64

  35. [35]

    L. Guo, H. Wu, T. Zhou, Normalizing field flows: Solving forward and inverse stochastic differential equations using physics-informed flow models, Journal of Computational Physics 461 (2022) 111202

  36. [36]

    M. Yang, P. Wang, D. del Castillo-Negrete, Y . Cao, G. Zhang, A pseudoreversible normalizing flow for stochastic dynamical systems with various initial distributions, SIAM Journal on Scientific Computing 46 (4) (2024) C508–C533

  37. [37]

    Cleary, A

    E. Cleary, A. Garbuno-Inigo, S. Lan, T. Schneider, A. M. Stuart, Calibrate, emulate, sample, Journal of Computational Physics 424 (2021) 109716

  38. [38]

    D. Qi, J. Harlim, A data-driven statistical-stochastic surrogate modeling strategy for complex nonlinear non-stationary dynamics, Journal of Computational Physics 485 (2023) 112085

  39. [39]

    D. J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, in: Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 1278–1286

  40. [40]

    Gundersen, A

    K. Gundersen, A. Oleynik, N. Blaser, G. Alendal, Semi-conditional variational auto-encoder for flow reconstruction and uncertainty quantifi- cation from limited observations, Physics of Fluids 33 (1)

  41. [41]

    Conditional Generative Adversarial Nets

    M. Mirza, S. Osindero, Conditional generative adversarial nets (2014).arXiv:1411.1784

  42. [42]

    L. Yang, D. Zhang, G. E. Karniadakis, Physics-informed generative adversarial networks for stochastic differential equations, SIAM Journal on Scientific Computing 42 (1) (2020) A292–A317

  43. [43]

    J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, in: Advances in Neural Information Processing Systems, V ol. 33, 2020, pp. 6840–6851

  44. [44]

    Vincent, A connection between score matching and denoising autoencoders, Neural Computation 23 (2011) 1661–1674

    P. Vincent, A connection between score matching and denoising autoencoders, Neural Computation 23 (2011) 1661–1674

  45. [45]

    Y . Liu, Y . Chen, D. Xiu, G. Zhang, A training-free conditional diffusion model for learning stochastic dynamical systems, SIAM Journal on Scientific Computing 47 (5) (2025) C1144–C1171

  46. [46]

    E. N. Lorenz, Predictability: A problem partly solved, in: Proc. Seminar on Predictability, V ol. 1, Reading, 1996, pp. 1–18

  47. [47]

    D. S. Wilks, Effects of stochastic parametrizations in the Lorenz’96 system, Quarterly Journal of the Royal Meteorological Society 131 (606) (2005) 389–407