General Uncertainty Estimation with Delta Variances
Pith reviewed 2026-05-23 02:36 UTC · model grok-4.3
The pith
Delta Variances estimate epistemic uncertainty for neural networks and their compositions using one gradient computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Delta Variances form a family of algorithms for epistemic uncertainty quantification that remain computationally efficient at the cost of one gradient computation. The family applies without change to neural networks and to more general functions composed of neural networks. Multiple theoretical derivations are discussed, under which special cases recover popular techniques and a unified perspective emerges; this perspective yields a natural extension that improves empirical results.
What carries the argument
Delta Variances, a family of uncertainty estimators obtained from several theoretical derivations that unify related methods and operate through gradient computations.
If this is right
- The same procedure works on any function built by composing neural networks.
- No retraining or architectural modification is required.
- Special cases match well-known existing uncertainty techniques.
- The unified view produces an extension that improves performance on the tested simulator.
Where Pith is reading between the lines
- The single-gradient property could make the method attractive for very large models where repeated forward passes are prohibitive.
- Similar derivations might apply to uncertainty in other gradient-based systems such as physics-informed networks.
- The approach could be tested on sequential decision tasks where uncertainty must be estimated inside a simulator loop.
Load-bearing premise
The derivations remain valid when the functions involved are arbitrary compositions of neural networks.
What would settle it
A direct comparison on a new neural-network composition task where the single-gradient Delta Variance estimates are less accurate than standard multi-sample methods.
Figures
read the original abstract
Decision makers may suffer from uncertainty induced by limited data. This may be mitigated by accounting for epistemic uncertainty, which is however challenging to estimate efficiently for large neural networks. To this extent we investigate Delta Variances, a family of algorithms for epistemic uncertainty quantification, that is computationally efficient and convenient to implement. It can be applied to neural networks and more general functions composed of neural networks. As an example we consider a weather simulator with a neural-network-based step function inside -- here Delta Variances empirically obtain competitive results at the cost of a single gradient computation. The approach is convenient as it requires no changes to the neural network architecture or training procedure. We discuss multiple ways to derive Delta Variances theoretically noting that special cases recover popular techniques and present a unified perspective on multiple related methods. Finally we observe that this general perspective gives rise to a natural extension and empirically show its benefit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Delta Variances, a family of methods for epistemic uncertainty quantification applicable to neural networks and arbitrary compositions of neural networks. It claims that these methods require only a single gradient computation, need no changes to architecture or training, recover known techniques as special cases via multiple theoretical derivations, provide a unified perspective, and yield competitive empirical results on a weather-simulator example with an NN-based step function, plus a beneficial natural extension.
Significance. If the derivations hold under general conditions and the single-example competitiveness generalizes, the approach would offer a convenient, low-cost way to obtain epistemic uncertainty estimates for composite models without retraining or architectural modification, unifying several existing techniques under one framework.
major comments (3)
- [Theoretical derivations (multiple sections referenced in abstract)] The central claim of applicability to general compositions f ∘ g with a single gradient rests on unstated assumptions (e.g., linearity of the outer function, bounded higher-order terms, or specific differentiability conditions). No section explicitly enumerates these assumptions or proves the formula holds beyond special cases.
- [Empirical evaluation (weather simulator example)] Empirical support is limited to a single weather-simulator case with an NN step function. No additional experiments test nonlinear outer functions, deeper inner networks, or other compositions to substantiate the generalization claim.
- [Unified perspective and extension] The unified perspective and natural extension are presented as arising from the general view, but without explicit comparison tables or ablation studies showing how the extension improves over the base Delta Variances or recovered special cases, the benefit remains under-supported.
minor comments (2)
- [Introduction / Methods] Notation for Delta Variances and related quantities should be introduced with a dedicated table or equation block early in the manuscript for clarity.
- [Abstract / Theoretical sections] The abstract mentions 'multiple ways to derive Delta Variances' but the manuscript would benefit from a short summary table mapping each derivation to its recovered special cases.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated.
read point-by-point responses
-
Referee: [Theoretical derivations (multiple sections referenced in abstract)] The central claim of applicability to general compositions f ∘ g with a single gradient rests on unstated assumptions (e.g., linearity of the outer function, bounded higher-order terms, or specific differentiability conditions). No section explicitly enumerates these assumptions or proves the formula holds beyond special cases.
Authors: The derivations rely on the standard delta-method approximation, which assumes the outer function is differentiable and that higher-order terms are negligible for small perturbations around the mean. These conditions are implicit in the linearization step used throughout the paper. We will add an explicit subsection enumerating the assumptions and the scope under which the general formula applies to compositions. revision: yes
-
Referee: [Empirical evaluation (weather simulator example)] Empirical support is limited to a single weather-simulator case with an NN step function. No additional experiments test nonlinear outer functions, deeper inner networks, or other compositions to substantiate the generalization claim.
Authors: The weather-simulator example was selected to illustrate a practical composite model with an NN-based step function. The results demonstrate competitive epistemic uncertainty estimates at the cost of one gradient computation. We acknowledge the limited scope and will expand the discussion section to address generalization limits and outline conditions under which the approach extends to other compositions, without adding new experiments at this stage. revision: partial
-
Referee: [Unified perspective and extension] The unified perspective and natural extension are presented as arising from the general view, but without explicit comparison tables or ablation studies showing how the extension improves over the base Delta Variances or recovered special cases, the benefit remains under-supported.
Authors: The unified view is obtained by recovering existing methods as special cases via the different derivations. The empirical benefit of the extension is shown on the weather example. We will add a comparison table of recovered special cases and an ablation study quantifying the extension's improvement in the revised manuscript. revision: yes
Circularity Check
No circularity: derivations presented as independent theoretical unifications
full rationale
The paper states it derives Delta Variances in multiple ways, with special cases recovering known techniques and a unified perspective on related methods. No quoted equations or text exhibit self-definitional reductions (e.g., a quantity defined in terms of itself), fitted parameters renamed as predictions, or load-bearing self-citations whose justification collapses to the current work. The approach is described as applying to general NN compositions without architectural changes, supported by theoretical discussion and one empirical example, making the derivation chain self-contained against external benchmarks rather than circular by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3): 235--256
work page 2002
-
[2]
Bernstein, S. N. 1917. The Theory of Probabilities
work page 1917
-
[3]
Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra, D. 2015. Weight Uncertainty in Neural Network. In Bach, F.; and Blei, D., eds., Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 1613--1622. Lille, France: PMLR
work page 2015
-
[4]
Cook, R. D.; and Weisberg, S. 1982. Residuals and Influence in Regression. Monographs on Statistics & Applied Probability. Chapman & Hall. ISBN 9780412242809
work page 1982
-
[5]
Cotes, R. 1722. Harmonia Mensurarum. Robert Smith
-
[6]
D.; Schumi, J.; Schweinsberg, J.; and Ungar, L
de Veaux, R. D.; Schumi, J.; Schweinsberg, J.; and Ungar, L. H. 1998. Prediction Intervals for Neural Networks via Nonlinear Regression. Technometrics, 40(4): 273--282
work page 1998
-
[7]
Denker, J.; and LeCun, Y. 1990. Transforming Neural-Net Output Levels to Probability Distributions. In Lippmann, R.; Moody, J.; and Touretzky, D., eds., Advances in Neural Information Processing Systems, volume 3. Morgan-Kaufmann
work page 1990
-
[8]
Doob, J. L. 1935. The limiting distributions of certain statistics. Ann. Math. Stat., 6(3): 160--169
work page 1935
-
[9]
Dorfman, R. 1938. A note on the delta-method for finding variance formulae. Biometric Bulletin
work page 1938
-
[10]
Duff, M. 2002. Optimal Learning: Computational procedures for Bayes -adaptive Markov decision processes . Ph.D. thesis, University of Massachusetts Amherst
work page 2002
-
[11]
Freedman, D. A. 2006. On The So-Called “Huber Sandwich Estimator” and “Robust Standard Errors”. The American Statistician, 60(4): 299--302
work page 2006
-
[12]
Gal, Y.; and Ghahramani, Z. 2016. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Balcan, M. F.; and Weinberger, K. Q., eds., Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, 1050--1059. New York, New York, USA: PMLR
work page 2016
-
[13]
Gauss, C. 1823. Theoria combinationis observationum erroribus minimis obnoxiae. H. Dieterich
-
[14]
Gorroochurn, P. 2020. Who invented the delta method, really? Math. Intelligencer, 42(3): 46--49
work page 2020
-
[15]
Gull, S. F. 1989. Developments in Maximum Entropy Data Analysis, 53--71. Dordrecht: Springer Netherlands. ISBN 978-94-015-7860-8
work page 1989
-
[16]
Hampel, F. R. 1974. The Influence Curve and Its Role in Robust Estimation. Journal of the American Statistical Association, 69(346): 383--393
work page 1974
-
[17]
Heger, M. 1994. Consideration of Risk in Reinforcement Learning. In Machine Learning: Proceedings of the 11th International Conference, 105--111. Morgan Kaufmann Publishers, San Francisco, CA
work page 1994
-
[18]
Hodges, J. L. 1967. Efficiency in normal samples and tolerance of extreme values for some estimates of location. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 163--–186. Berkeley. University of California Press
work page 1967
-
[19]
Hwang, J. T. G.; and Ding, A. A. 1997. Prediction Intervals for Artificial Neural Networks. Journal of the American Statistical Association, 92(438): 748--757
work page 1997
-
[20]
Immer, A.; Korzepa, M.; and Bauer, M. 2021. Improving predictions of Bayesian neural nets via local linearization. In Banerjee, A.; and Fukumizu, K., eds., Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, 703--711. PMLR
work page 2021
-
[21]
Jaeckel, L. 1972. The Infinitesimal Jackknife. Bell Lab. Memorandum, MM72-1215-11
work page 1972
-
[22]
Kallus, N.; and McInerney, J. 2022. The Implicit Delta Method. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 37471--37483. Curran Associates, Inc
work page 2022
-
[23]
Kelley, T. L. 1928. Crossroads in the mind of man; a study of differentiable mental abilities. Palo Alto: Stanford Univ. Press
work page 1928
-
[24]
Kleijn, B.; and van der Vaart, A. 2012. The Bernstein-Von-Mises theorem under misspecification . Electronic Journal of Statistics, 6(none): 354 -- 381
work page 2012
-
[25]
Koh, P. W.; and Liang, P. 2017. Understanding Black-box Predictions via Influence Functions. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 1885--1894. PMLR
work page 2017
-
[26]
Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. 2017. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc
work page 2017
-
[27]
Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Alet, F.; Ravuri, S.; Ewalds, T.; Eaton-Rosen, Z.; Hu, W.; Merose, A.; Hoyer, S.; Holland, G.; Vinyals, O.; Stott, J.; Pritzel, A.; Mohamed, S.; and Battaglia, P. 2023. Learning skillful medium-range global weather forecasting. Science, 382(6677): 1416--1421
work page 2023
-
[28]
a ge zum Gebrauche der Mathematik und deren Anwendung , volume 1, chapter 13 of Beytr \
Lambert, J. H. 1765. Beytr \"a ge zum Gebrauche der Mathematik und deren Anwendung , volume 1, chapter 13 of Beytr \"a ge zum Gebrauche der Mathematik und deren Anwendung . Verlag des Buchladens der Realschule
-
[29]
Laplace, P. S. 1774. Mémoire sur la probabilité des causes par les événements. Mémoires de Mathématique et de Physique, 6
-
[30]
Le Cam, L. 1953. On some asymptotic properties of maximum likelihood estimates and related Baye 's estimates . University of California Press, Berkeley
work page 1953
-
[31]
MacKay, D. J. C. 1992 a . Information-based objective functions for active data selection. Neural Computation, 4(2): 550--604
work page 1992
-
[32]
MacKay, D. J. C. 1992 b . A Practical Bayesian Framework for Backpropagation Networks. Neural Computation, 4: 448--472
work page 1992
-
[33]
J.; Izmailov, P.; Garipov, T.; Vetrov, D
Maddox, W. J.; Izmailov, P.; Garipov, T.; Vetrov, D. P.; and Wilson, A. G. 2019. A Simple Baseline for Bayesian Uncertainty in Deep Learning. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d Alch\' e -Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc
work page 2019
-
[34]
Magnus, J. R. 1985. On Differentiating Eigenvalues and Eigenvectors. Econometric Theory, 1(2): 179--191
work page 1985
-
[35]
Mahalanobis, P. C. 1936. On The Generalized Distance in Statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-), 80: pp. S1--S7
work page 1936
- [36]
-
[37]
Martens, J.; and Grosse, R. B. 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, 2408--2417
work page 2015
-
[38]
Miller, R. G. 1974. The Jackknife--A Review. Biometrika, 61(1): 1--15
work page 1974
-
[39]
Nilsen, G. K.; Munthe-Kaas, A. Z.; Skaug, H. J.; and Brun, M. 2022. Epistemic uncertainty quantification in deep learning classification by the Delta method. Neural Networks, 145: 164--176
work page 2022
-
[40]
M.; Dwaracherla, V.; IBRAHIMI, M.; Lu, X.; and Van Roy, B
Osband, I.; Wen, Z.; Asghari, S. M.; Dwaracherla, V.; IBRAHIMI, M.; Lu, X.; and Van Roy, B. 2023. Epistemic Neural Networks. In Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; and Levine, S., eds., Advances in Neural Information Processing Systems, volume 36, 2795--2823. Curran Associates, Inc
work page 2023
-
[41]
Quenouille, M. H. 1949. Approximate Tests of Correlation in Time-Series. Journal of the Royal Statistical Society. Series B (Methodological), 11(1): 68--84
work page 1949
-
[42]
Ritter, H.; Botev, A.; and Barber, D. 2018. A Scalable Laplace Approximation for Neural Networks. In International Conference on Learning Representations
work page 2018
-
[43]
Schnaus, D.; Lee, J.; Cremers, D.; and Triebel, R. 2023. Learning Expressive Priors for Generalization and Uncertainty Estimation in Neural Networks. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning ...
work page 2023
-
[44]
Sun, Y.; Ming, Y.; Zhu, X.; and Li, Y. 2022. Out-of-Distribution Detection with Deep Nearest Neighbors. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 20827--20840. PMLR
work page 2022
-
[45]
Tibshirani, R. 1996. A Comparison of Some Error Estimates for Neural Network Models. Neural Computation, 8(1): 152--163
work page 1996
-
[46]
Tishby, N.; Levin, E.; and Solla, S. 1989. Consistent inference of probabilities in layered networks: Predictions and generalization. In Anon, ed., IJCNN Int Jt Conf Neural Network, 403--409. Publ by IEEE. IJCNN International Joint Conference on Neural Networks ; Conference date: 18-06-1989 Through 22-06-1989
work page 1989
-
[47]
Tukey, J. W. 1958. Bias and confidence in not-quite large samples (abstract). j-ANN-MATH-STAT, 29(2): 614--614
work page 1958
-
[48]
Van Amersfoort, J.; Smith, L.; Teh, Y. W.; and Gal, Y. 2020. Uncertainty estimation using a single deep deterministic neural network. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org
work page 2020
-
[49]
van der Vaart, A. W. 1998. Asymptotic Statistics. Cambridge University Press
work page 1998
-
[50]
von Mises, R. 1931. Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und theoretischen Physik , volume 1. Franz Deuticke
work page 1931
-
[51]
Wright, S. 1934. The method of path coefficients. Ann. Math. Stat., 5(3): 161--215
work page 1934
-
[52]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[53]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.