pith. sign in

arxiv: 2502.14698 · v2 · submitted 2025-02-20 · 💻 cs.LG · cs.AI· stat.AP· stat.ML

General Uncertainty Estimation with Delta Variances

Pith reviewed 2026-05-23 02:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.APstat.ML
keywords epistemic uncertaintyuncertainty quantificationneural networksgradient computationdelta variancesweather simulationmachine learning
0
0 comments X

The pith

Delta Variances estimate epistemic uncertainty for neural networks and their compositions using one gradient computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Delta Variances as a family of algorithms for quantifying epistemic uncertainty induced by limited data. These algorithms require only a single gradient computation and apply directly to neural networks as well as functions built from them, without any modifications to architecture or training. Special cases of the approach recover existing popular methods, and a unified theoretical view leads to a natural extension whose benefit is shown empirically. The method is demonstrated on a weather simulator whose step function is neural-network based, where it achieves competitive performance.

Core claim

Delta Variances form a family of algorithms for epistemic uncertainty quantification that remain computationally efficient at the cost of one gradient computation. The family applies without change to neural networks and to more general functions composed of neural networks. Multiple theoretical derivations are discussed, under which special cases recover popular techniques and a unified perspective emerges; this perspective yields a natural extension that improves empirical results.

What carries the argument

Delta Variances, a family of uncertainty estimators obtained from several theoretical derivations that unify related methods and operate through gradient computations.

If this is right

  • The same procedure works on any function built by composing neural networks.
  • No retraining or architectural modification is required.
  • Special cases match well-known existing uncertainty techniques.
  • The unified view produces an extension that improves performance on the tested simulator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-gradient property could make the method attractive for very large models where repeated forward passes are prohibitive.
  • Similar derivations might apply to uncertainty in other gradient-based systems such as physics-informed networks.
  • The approach could be tested on sequential decision tasks where uncertainty must be estimated inside a simulator loop.

Load-bearing premise

The derivations remain valid when the functions involved are arbitrary compositions of neural networks.

What would settle it

A direct comparison on a new neural-network composition task where the single-gradient Delta Variance estimates are less accurate than standard multi-sample methods.

Figures

Figures reproduced from arXiv: 2502.14698 by Hado van Hasselt, John Shawe-Taylor, Simon Schmitt.

Figure 1
Figure 1. Figure 1: We compare the computational overhead of training [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative survival prediction example. Actual epis [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of variance estimators in terms of their [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: To investigate more intricate quantities of interest, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Decision makers may suffer from uncertainty induced by limited data. This may be mitigated by accounting for epistemic uncertainty, which is however challenging to estimate efficiently for large neural networks. To this extent we investigate Delta Variances, a family of algorithms for epistemic uncertainty quantification, that is computationally efficient and convenient to implement. It can be applied to neural networks and more general functions composed of neural networks. As an example we consider a weather simulator with a neural-network-based step function inside -- here Delta Variances empirically obtain competitive results at the cost of a single gradient computation. The approach is convenient as it requires no changes to the neural network architecture or training procedure. We discuss multiple ways to derive Delta Variances theoretically noting that special cases recover popular techniques and present a unified perspective on multiple related methods. Finally we observe that this general perspective gives rise to a natural extension and empirically show its benefit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Delta Variances, a family of methods for epistemic uncertainty quantification applicable to neural networks and arbitrary compositions of neural networks. It claims that these methods require only a single gradient computation, need no changes to architecture or training, recover known techniques as special cases via multiple theoretical derivations, provide a unified perspective, and yield competitive empirical results on a weather-simulator example with an NN-based step function, plus a beneficial natural extension.

Significance. If the derivations hold under general conditions and the single-example competitiveness generalizes, the approach would offer a convenient, low-cost way to obtain epistemic uncertainty estimates for composite models without retraining or architectural modification, unifying several existing techniques under one framework.

major comments (3)
  1. [Theoretical derivations (multiple sections referenced in abstract)] The central claim of applicability to general compositions f ∘ g with a single gradient rests on unstated assumptions (e.g., linearity of the outer function, bounded higher-order terms, or specific differentiability conditions). No section explicitly enumerates these assumptions or proves the formula holds beyond special cases.
  2. [Empirical evaluation (weather simulator example)] Empirical support is limited to a single weather-simulator case with an NN step function. No additional experiments test nonlinear outer functions, deeper inner networks, or other compositions to substantiate the generalization claim.
  3. [Unified perspective and extension] The unified perspective and natural extension are presented as arising from the general view, but without explicit comparison tables or ablation studies showing how the extension improves over the base Delta Variances or recovered special cases, the benefit remains under-supported.
minor comments (2)
  1. [Introduction / Methods] Notation for Delta Variances and related quantities should be introduced with a dedicated table or equation block early in the manuscript for clarity.
  2. [Abstract / Theoretical sections] The abstract mentions 'multiple ways to derive Delta Variances' but the manuscript would benefit from a short summary table mapping each derivation to its recovered special cases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: [Theoretical derivations (multiple sections referenced in abstract)] The central claim of applicability to general compositions f ∘ g with a single gradient rests on unstated assumptions (e.g., linearity of the outer function, bounded higher-order terms, or specific differentiability conditions). No section explicitly enumerates these assumptions or proves the formula holds beyond special cases.

    Authors: The derivations rely on the standard delta-method approximation, which assumes the outer function is differentiable and that higher-order terms are negligible for small perturbations around the mean. These conditions are implicit in the linearization step used throughout the paper. We will add an explicit subsection enumerating the assumptions and the scope under which the general formula applies to compositions. revision: yes

  2. Referee: [Empirical evaluation (weather simulator example)] Empirical support is limited to a single weather-simulator case with an NN step function. No additional experiments test nonlinear outer functions, deeper inner networks, or other compositions to substantiate the generalization claim.

    Authors: The weather-simulator example was selected to illustrate a practical composite model with an NN-based step function. The results demonstrate competitive epistemic uncertainty estimates at the cost of one gradient computation. We acknowledge the limited scope and will expand the discussion section to address generalization limits and outline conditions under which the approach extends to other compositions, without adding new experiments at this stage. revision: partial

  3. Referee: [Unified perspective and extension] The unified perspective and natural extension are presented as arising from the general view, but without explicit comparison tables or ablation studies showing how the extension improves over the base Delta Variances or recovered special cases, the benefit remains under-supported.

    Authors: The unified view is obtained by recovering existing methods as special cases via the different derivations. The empirical benefit of the extension is shown on the weather example. We will add a comparison table of recovered special cases and an ablation study quantifying the extension's improvement in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations presented as independent theoretical unifications

full rationale

The paper states it derives Delta Variances in multiple ways, with special cases recovering known techniques and a unified perspective on related methods. No quoted equations or text exhibit self-definitional reductions (e.g., a quantity defined in terms of itself), fitted parameters renamed as predictions, or load-bearing self-citations whose justification collapses to the current work. The approach is described as applying to general NN compositions without architectural changes, supported by theoretical discussion and one empirical example, making the derivation chain self-contained against external benchmarks rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5682 in / 1008 out tokens · 35292 ms · 2026-05-23T02:36:55.033529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3): 235--256

  2. [2]

    Bernstein, S. N. 1917. The Theory of Probabilities

  3. [3]

    Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra, D. 2015. Weight Uncertainty in Neural Network. In Bach, F.; and Blei, D., eds., Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 1613--1622. Lille, France: PMLR

  4. [4]

    D.; and Weisberg, S

    Cook, R. D.; and Weisberg, S. 1982. Residuals and Influence in Regression. Monographs on Statistics & Applied Probability. Chapman & Hall. ISBN 9780412242809

  5. [5]

    Cotes, R. 1722. Harmonia Mensurarum. Robert Smith

  6. [6]

    D.; Schumi, J.; Schweinsberg, J.; and Ungar, L

    de Veaux, R. D.; Schumi, J.; Schweinsberg, J.; and Ungar, L. H. 1998. Prediction Intervals for Neural Networks via Nonlinear Regression. Technometrics, 40(4): 273--282

  7. [7]

    Denker, J.; and LeCun, Y. 1990. Transforming Neural-Net Output Levels to Probability Distributions. In Lippmann, R.; Moody, J.; and Touretzky, D., eds., Advances in Neural Information Processing Systems, volume 3. Morgan-Kaufmann

  8. [8]

    Doob, J. L. 1935. The limiting distributions of certain statistics. Ann. Math. Stat., 6(3): 160--169

  9. [9]

    Dorfman, R. 1938. A note on the delta-method for finding variance formulae. Biometric Bulletin

  10. [10]

    Duff, M. 2002. Optimal Learning: Computational procedures for Bayes -adaptive Markov decision processes . Ph.D. thesis, University of Massachusetts Amherst

  11. [11]

    Huber Sandwich Estimator

    Freedman, D. A. 2006. On The So-Called “Huber Sandwich Estimator” and “Robust Standard Errors”. The American Statistician, 60(4): 299--302

  12. [12]

    Gal, Y.; and Ghahramani, Z. 2016. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Balcan, M. F.; and Weinberger, K. Q., eds., Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, 1050--1059. New York, New York, USA: PMLR

  13. [13]

    Gauss, C. 1823. Theoria combinationis observationum erroribus minimis obnoxiae. H. Dieterich

  14. [14]

    Gorroochurn, P. 2020. Who invented the delta method, really? Math. Intelligencer, 42(3): 46--49

  15. [15]

    Gull, S. F. 1989. Developments in Maximum Entropy Data Analysis, 53--71. Dordrecht: Springer Netherlands. ISBN 978-94-015-7860-8

  16. [16]

    Hampel, F. R. 1974. The Influence Curve and Its Role in Robust Estimation. Journal of the American Statistical Association, 69(346): 383--393

  17. [17]

    Heger, M. 1994. Consideration of Risk in Reinforcement Learning. In Machine Learning: Proceedings of the 11th International Conference, 105--111. Morgan Kaufmann Publishers, San Francisco, CA

  18. [18]

    Hodges, J. L. 1967. Efficiency in normal samples and tolerance of extreme values for some estimates of location. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 163--–186. Berkeley. University of California Press

  19. [19]

    Hwang, J. T. G.; and Ding, A. A. 1997. Prediction Intervals for Artificial Neural Networks. Journal of the American Statistical Association, 92(438): 748--757

  20. [20]

    Immer, A.; Korzepa, M.; and Bauer, M. 2021. Improving predictions of Bayesian neural nets via local linearization. In Banerjee, A.; and Fukumizu, K., eds., Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, 703--711. PMLR

  21. [21]

    Jaeckel, L. 1972. The Infinitesimal Jackknife. Bell Lab. Memorandum, MM72-1215-11

  22. [22]

    Kallus, N.; and McInerney, J. 2022. The Implicit Delta Method. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 37471--37483. Curran Associates, Inc

  23. [23]

    Kelley, T. L. 1928. Crossroads in the mind of man; a study of differentiable mental abilities. Palo Alto: Stanford Univ. Press

  24. [24]

    Kleijn, B.; and van der Vaart, A. 2012. The Bernstein-Von-Mises theorem under misspecification . Electronic Journal of Statistics, 6(none): 354 -- 381

  25. [25]

    W.; and Liang, P

    Koh, P. W.; and Liang, P. 2017. Understanding Black-box Predictions via Influence Functions. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 1885--1894. PMLR

  26. [26]

    Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. 2017. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

  27. [27]

    Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Alet, F.; Ravuri, S.; Ewalds, T.; Eaton-Rosen, Z.; Hu, W.; Merose, A.; Hoyer, S.; Holland, G.; Vinyals, O.; Stott, J.; Pritzel, A.; Mohamed, S.; and Battaglia, P. 2023. Learning skillful medium-range global weather forecasting. Science, 382(6677): 1416--1421

  28. [28]

    a ge zum Gebrauche der Mathematik und deren Anwendung , volume 1, chapter 13 of Beytr \

    Lambert, J. H. 1765. Beytr \"a ge zum Gebrauche der Mathematik und deren Anwendung , volume 1, chapter 13 of Beytr \"a ge zum Gebrauche der Mathematik und deren Anwendung . Verlag des Buchladens der Realschule

  29. [29]

    Laplace, P. S. 1774. Mémoire sur la probabilité des causes par les événements. Mémoires de Mathématique et de Physique, 6

  30. [30]

    Le Cam, L. 1953. On some asymptotic properties of maximum likelihood estimates and related Baye 's estimates . University of California Press, Berkeley

  31. [31]

    MacKay, D. J. C. 1992 a . Information-based objective functions for active data selection. Neural Computation, 4(2): 550--604

  32. [32]

    MacKay, D. J. C. 1992 b . A Practical Bayesian Framework for Backpropagation Networks. Neural Computation, 4: 448--472

  33. [33]

    J.; Izmailov, P.; Garipov, T.; Vetrov, D

    Maddox, W. J.; Izmailov, P.; Garipov, T.; Vetrov, D. P.; and Wilson, A. G. 2019. A Simple Baseline for Bayesian Uncertainty in Deep Learning. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d Alch\' e -Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc

  34. [34]

    Magnus, J. R. 1985. On Differentiating Eigenvalues and Eigenvectors. Econometric Theory, 1(2): 179--191

  35. [35]

    Mahalanobis, P. C. 1936. On The Generalized Distance in Statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-), 80: pp. S1--S7

  36. [36]

    Martens, J. 2014. New perspectives on the natural gradient method. CoRR, abs/1412.1193

  37. [37]

    Martens, J.; and Grosse, R. B. 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, 2408--2417

  38. [38]

    Miller, R. G. 1974. The Jackknife--A Review. Biometrika, 61(1): 1--15

  39. [39]

    K.; Munthe-Kaas, A

    Nilsen, G. K.; Munthe-Kaas, A. Z.; Skaug, H. J.; and Brun, M. 2022. Epistemic uncertainty quantification in deep learning classification by the Delta method. Neural Networks, 145: 164--176

  40. [40]

    M.; Dwaracherla, V.; IBRAHIMI, M.; Lu, X.; and Van Roy, B

    Osband, I.; Wen, Z.; Asghari, S. M.; Dwaracherla, V.; IBRAHIMI, M.; Lu, X.; and Van Roy, B. 2023. Epistemic Neural Networks. In Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; and Levine, S., eds., Advances in Neural Information Processing Systems, volume 36, 2795--2823. Curran Associates, Inc

  41. [41]

    Quenouille, M. H. 1949. Approximate Tests of Correlation in Time-Series. Journal of the Royal Statistical Society. Series B (Methodological), 11(1): 68--84

  42. [42]

    Ritter, H.; Botev, A.; and Barber, D. 2018. A Scalable Laplace Approximation for Neural Networks. In International Conference on Learning Representations

  43. [43]

    Schnaus, D.; Lee, J.; Cremers, D.; and Triebel, R. 2023. Learning Expressive Priors for Generalization and Uncertainty Estimation in Neural Networks. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning ...

  44. [44]

    Sun, Y.; Ming, Y.; Zhu, X.; and Li, Y. 2022. Out-of-Distribution Detection with Deep Nearest Neighbors. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 20827--20840. PMLR

  45. [45]

    Tibshirani, R. 1996. A Comparison of Some Error Estimates for Neural Network Models. Neural Computation, 8(1): 152--163

  46. [46]

    Tishby, N.; Levin, E.; and Solla, S. 1989. Consistent inference of probabilities in layered networks: Predictions and generalization. In Anon, ed., IJCNN Int Jt Conf Neural Network, 403--409. Publ by IEEE. IJCNN International Joint Conference on Neural Networks ; Conference date: 18-06-1989 Through 22-06-1989

  47. [47]

    Tukey, J. W. 1958. Bias and confidence in not-quite large samples (abstract). j-ANN-MATH-STAT, 29(2): 614--614

  48. [48]

    W.; and Gal, Y

    Van Amersfoort, J.; Smith, L.; Teh, Y. W.; and Gal, Y. 2020. Uncertainty estimation using a single deep deterministic neural network. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org

  49. [49]

    van der Vaart, A. W. 1998. Asymptotic Statistics. Cambridge University Press

  50. [50]

    von Mises, R. 1931. Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und theoretischen Physik , volume 1. Franz Deuticke

  51. [51]

    Wright, S. 1934. The method of path coefficients. Ann. Math. Stat., 5(3): 161--215

  52. [52]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  53. [53]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...