'In-Between' Uncertainty in Bayesian Neural Networks
Pith reviewed 2026-05-25 14:42 UTC · model grok-4.3
The pith
Mean-field variational inference fails to produce calibrated uncertainty estimates between separated regions of observations in Bayesian neural networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mean-field variational inference fails to give calibrated uncertainty estimates in between separated regions of observations. This can lead to catastrophically overconfident predictions when testing on out-of-distribution data. The linearised Laplace approximation can handle in-between uncertainty much better for small network architectures.
What carries the argument
The mean-field variational inference approximation, whose factorized posterior prevents it from expressing uncertainty that varies appropriately between separated data clusters.
If this is right
- Overconfident predictions arise on inputs that fall between training data clusters when using MFVI.
- Applications such as active learning and Bayesian optimisation can be harmed by this form of miscalibration.
- The linearised Laplace approximation avoids the same overconfidence for small architectures.
- Out-of-distribution robustness requires inference methods that capture in-between uncertainty.
Where Pith is reading between the lines
- Alternative posterior approximations that allow parameter dependence might restore proper uncertainty between data clusters.
- The limitation could become more or less severe as network depth or width increases beyond the small architectures studied.
- Testing on high-dimensional inputs with natural gaps between modes would clarify whether the problem appears in realistic settings.
Load-bearing premise
The observed failure of mean-field variational inference to capture in-between uncertainty is a general property of the mean-field approximation rather than an artifact of the specific network sizes, datasets, or implementation details used.
What would settle it
Running the same experiments on a new dataset with clearly separated observation clusters or on networks larger than those tested would show whether MFVI continues to underestimate uncertainty in the gaps.
Figures
read the original abstract
We describe a limitation in the expressiveness of the predictive uncertainty estimate given by mean-field variational inference (MFVI), a popular approximate inference method for Bayesian neural networks. In particular, MFVI fails to give calibrated uncertainty estimates in between separated regions of observations. This can lead to catastrophically overconfident predictions when testing on out-of-distribution data. Avoiding such overconfidence is critical for active learning, Bayesian optimisation and out-of-distribution robustness. We instead find that a classical technique, the linearised Laplace approximation, can handle 'in-between' uncertainty much better for small network architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that mean-field variational inference (MFVI) for Bayesian neural networks fails to produce calibrated uncertainty estimates in the regions between separated clusters of observations, resulting in catastrophically overconfident predictions on out-of-distribution inputs. It contrasts this with the linearised Laplace approximation, which the authors find handles in-between uncertainty more reliably, at least on small network architectures. The finding is presented as an empirical observation with implications for active learning, Bayesian optimisation, and OOD robustness.
Significance. If the reported limitation of MFVI is shown to be general rather than an artifact of specific architectures or training details, the result would be significant for uncertainty quantification in deep learning, as MFVI remains a popular and scalable approximate inference method. The contrast with linearised Laplace provides a concrete, falsifiable comparison that could guide practitioners. The work does not include machine-checked proofs, parameter-free derivations, or open reproducible code, but the empirical claim is in principle testable via controlled experiments.
major comments (2)
- [Experiments] Experiments section: the central claim attributes the in-between uncertainty failure specifically to the mean-field factorization, yet the reported comparisons are restricted to small architectures without systematic ablations on network width, depth, initialization scale, or optimizer hyperparameters that would isolate the mean-field assumption from capacity or optimization effects.
- [Abstract] Abstract and introduction: the claim that MFVI 'fails to give calibrated uncertainty estimates' is stated without accompanying quantitative metrics, error bars, or explicit experimental protocol in the summary text, making it impossible to assess the magnitude or statistical reliability of the reported overconfidence on OOD data.
minor comments (2)
- Notation for the linearised Laplace approximation could be clarified with an explicit equation relating the Hessian or Jacobian to the predictive variance.
- Figure legends should explicitly label which curves correspond to MFVI versus linearised Laplace to improve readability of the in-between uncertainty comparisons.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight opportunities to strengthen the empirical support for our claims. We address each major comment below and will incorporate revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim attributes the in-between uncertainty failure specifically to the mean-field factorization, yet the reported comparisons are restricted to small architectures without systematic ablations on network width, depth, initialization scale, or optimizer hyperparameters that would isolate the mean-field assumption from capacity or optimization effects.
Authors: We agree that systematic ablations would better isolate the contribution of the mean-field assumption. Our choice of small architectures was intended to make the in-between uncertainty failure visually and quantitatively clear without confounding effects from high capacity; however, we acknowledge this limits generalizability. In the revised manuscript we will add experiments varying network width (e.g., 50 to 500 hidden units) and depth (1 to 4 layers), as well as different initialization scales and optimizers, while keeping the mean-field vs. linearised Laplace comparison fixed. These results will be reported with error bars over multiple random seeds. revision: yes
-
Referee: [Abstract] Abstract and introduction: the claim that MFVI 'fails to give calibrated uncertainty estimates' is stated without accompanying quantitative metrics, error bars, or explicit experimental protocol in the summary text, making it impossible to assess the magnitude or statistical reliability of the reported overconfidence on OOD data.
Authors: We accept this criticism. The current abstract is deliberately concise, but it should convey the scale of the effect. In the revision we will expand the abstract to include concrete quantitative indicators (e.g., predictive variance on OOD inputs being orders of magnitude lower than on in-distribution data, and negative log-likelihood values) together with a brief statement of the experimental protocol (toy regression tasks with separated clusters, 5 random seeds). Corresponding numbers and error bars will also be added to the introduction. revision: yes
Circularity Check
No significant circularity; empirical observation of MFVI limitation
full rationale
The paper reports an empirical finding that MFVI produces overconfident predictions between separated data regions, demonstrated via experiments contrasting it with the linearised Laplace approximation on small networks. No derivation chain, equations, or load-bearing steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the central claim rests on direct experimental observation rather than any theoretical reduction that would invoke the listed circularity patterns. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Optimality of Sub-network Laplace Approximations: New Results and Methods
Sub-network Laplace approximations always underestimate full-model predictive variance, and two new gradient-based and greedy selection rules provide theoretically grounded improvements.
-
Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks
Derives optimal low-rank subspace for Laplace approx in BNNs, provides scalable outperforming version, and new comparison metric.
Reference graph
Works this paper leans on
-
[1]
Barber, D. and Bishop, C. M. Ensemble learning in B ayesian neural networks. Nato ASI Series F Computer and Systems Sciences, 168: 0 215--238, 1998
work page 1998
-
[2]
Weight uncertainty in neural network
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural network. In International Conference on Machine Learning, pp.\ 1613--1622, 2015
work page 2015
-
[3]
Deep G aussian processes for regression using approximate expectation propagation
Bui, T., Hern \'a ndez-Lobato, D., Hernandez-Lobato, J., Li, Y., and Turner, R. Deep G aussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pp.\ 1472--1481, 2016
work page 2016
-
[4]
Cho, Y. and Saul, L. K. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pp.\ 342--350, 2009
work page 2009
-
[5]
Deep reinforcement learning in a handful of trials using probabilistic dynamics models
Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp.\ 4754--4765, 2018
work page 2018
-
[6]
Denker, J. S. and LeCun, Y. Transforming neural-net output levels to probability distributions. In Advances in Neural Information Processing Systems, pp.\ 853--859, 1991
work page 1991
-
[7]
Duvenaud, D. and Adams, R. P. Black-box stochastic variational inference in five lines of P ython. In NIPS Workshop on Black-box Learning and Inference, 2015
work page 2015
-
[8]
Gal, Y. Uncertainty in deep learning. PhD thesis, University of Cambridge, 2016
work page 2016
-
[9]
Gal, Y. and Ghahramani, Z. Dropout as a B ayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pp.\ 1050--1059, 2016
work page 2016
-
[10]
Deep B ayesian active learning with image data
Gal, Y., Islam, R., and Ghahramani, Z. Deep B ayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.\ 1183--1192. JMLR. org, 2017
work page 2017
-
[11]
Hern \'a ndez-Lobato, J. M. and Adams, R. Probabilistic backpropagation for scalable learning of B ayesian neural networks. In International Conference on Machine Learning, pp.\ 1861--1869, 2015
work page 2015
-
[12]
Hinton, G. and Van Camp, D. Keeping neural networks simple by minimizing the description length of the weights. In in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory. Citeseer, 1993
work page 1993
-
[13]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
P., Salimans, T., and Welling, M
Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp.\ 2575--2583, 2015
work page 2015
-
[15]
Simple and scalable predictive uncertainty estimation using deep ensembles
Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp.\ 6402--6413, 2017
work page 2017
-
[16]
Lawrence, N. D. Variational inference in probabilistic models. PhD thesis, University of Cambridge, 2001
work page 2001
-
[17]
Deep Neural Networks as Gaussian Processes
Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. Deep neural networks as G aussian processes. arXiv preprint arXiv:1711.00165, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
MacKay, D. J. C. A practical B ayesian framework for backpropagation networks. Neural computation, 4 0 (3): 0 448--472, 1992
work page 1992
-
[19]
On the Importance of Strong Baselines in Bayesian Deep Learning
Mukhoti, J., Stenetorp, P., and Gal, Y. On the importance of strong baselines in B ayesian deep learning. arXiv preprint arXiv:1811.09385, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012
work page 2012
-
[21]
Riquelme, C., Tucker, G., and Snoek, J. Deep B ayesian bandits showdown: An empirical comparison of B ayesian deep networks for T hompson sampling. arXiv preprint arXiv:1802.09127, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
A scalable L aplace approximation for neural networks
Ritter, H., Botev, A., and Barber, D. A scalable L aplace approximation for neural networks. In International Conference on Learning Representations, 2018
work page 2018
-
[23]
Scalable B ayesian optimization using deep neural networks
Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. Scalable B ayesian optimization using deep neural networks. In International Conference on Machine Learning, pp.\ 2171--2180, 2015
work page 2015
-
[24]
B., Swaroop, S., and Turner, R
Tomczak, M. B., Swaroop, S., and Turner, R. E. Neural network ensembles and variational inference revisited. In 1st Symposium on Advances in Approximate Bayesian Inference, pp.\ 1--11, 2018
work page 2018
-
[25]
Trippe, B. and Turner, R. E. Overpruning in variational B ayesian neural networks. In NIPS Workshop on Advances in Approximate Bayesian Inferenc, 2017
work page 2017
-
[26]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.