pith. sign in

arxiv: 1906.11537 · v1 · pith:EM53U57Mnew · submitted 2019-06-27 · 📊 stat.ML · cs.AI· cs.LG

'In-Between' Uncertainty in Bayesian Neural Networks

Pith reviewed 2026-05-25 14:42 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords Bayesian neural networksmean-field variational inferenceuncertainty estimationout-of-distribution datalinearised Laplace approximationpredictive uncertaintyapproximate inferencecalibration
0
0 comments X

The pith

Mean-field variational inference fails to produce calibrated uncertainty estimates between separated regions of observations in Bayesian neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a specific limitation in mean-field variational inference when used for approximate inference in Bayesian neural networks. This method does not assign appropriate uncertainty to regions that lie between distinct clusters of training observations. As a result, predictions on out-of-distribution inputs can be overconfident in ways that affect downstream tasks. The authors contrast this behavior with the linearised Laplace approximation, which produces more suitable uncertainty estimates for small network sizes. The issue matters because reliable uncertainty is required in active learning, Bayesian optimisation, and robustness to distribution shift.

Core claim

Mean-field variational inference fails to give calibrated uncertainty estimates in between separated regions of observations. This can lead to catastrophically overconfident predictions when testing on out-of-distribution data. The linearised Laplace approximation can handle in-between uncertainty much better for small network architectures.

What carries the argument

The mean-field variational inference approximation, whose factorized posterior prevents it from expressing uncertainty that varies appropriately between separated data clusters.

If this is right

  • Overconfident predictions arise on inputs that fall between training data clusters when using MFVI.
  • Applications such as active learning and Bayesian optimisation can be harmed by this form of miscalibration.
  • The linearised Laplace approximation avoids the same overconfidence for small architectures.
  • Out-of-distribution robustness requires inference methods that capture in-between uncertainty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alternative posterior approximations that allow parameter dependence might restore proper uncertainty between data clusters.
  • The limitation could become more or less severe as network depth or width increases beyond the small architectures studied.
  • Testing on high-dimensional inputs with natural gaps between modes would clarify whether the problem appears in realistic settings.

Load-bearing premise

The observed failure of mean-field variational inference to capture in-between uncertainty is a general property of the mean-field approximation rather than an artifact of the specific network sizes, datasets, or implementation details used.

What would settle it

Running the same experiments on a new dataset with clearly separated observation clusters or on networks larger than those tested would show whether MFVI continues to underestimate uncertainty in the gaps.

Figures

Figures reproduced from arXiv: 1906.11537 by Andrew Y. K. Foong, Jos\'e Miguel Hern\'andez-Lobato, Richard E. Turner, Yingzhen Li.

Figure 1
Figure 1. Figure 1: Mean and two standard deviation bars of the predictive distribution for fθ(x) (without output noise). uncertainty between regions of low uncertainty. This would not be the case if the output weights had an unrestricted distribution. Although this insight does not immediately apply to BNNs with tanh activations and mean-field input weights, it shows that the mean field assumption can in some cases severely … view at source ↗
Figure 2
Figure 2. Figure 2: Average test log-likelihoods on the standard splits for BNNs with one hidden layer (top) and two hidden layers (bottom). There are 50 hidden units in each layer. 3.0 2.9 2.8 2.7 avg. test log likelihood boston 3.75 3.70 3.65 3.60 3.55 3.50 3.45 3.40 3.35 concrete 160 140 120 100 80 60 40 20 0 energy 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 kin8nm 2500 2000 1500 1000 500 0 naval 3.5 3.4 3.3 3.2 3.1 3.0 2.9 2… view at source ↗
Figure 3
Figure 3. Figure 3: Average test log-likelihoods on the gap splits for BNNs with one hidden layer (top) and two hidden layers (bottom). Note the scale on energy and naval, where MAP and MFVI fail catastrophically. There are 50 hidden units in each layer [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Predictive variances (without observation noise) on the 1D dataset. Black lines show x-locations of the data. ai = 0. However if we consider φ 00 to be a bump function of arbitrarily small width and area 1, then all these derivatives exist and φ(ai)φ 00(ai) is non-negative. Since the Hessian of Var[yk(x)] is PSD, it follows that Var[yk(x)] is a convex function of x. 5 Therefore it is impossible for this ki… view at source ↗
Figure 5
Figure 5. Figure 5: Samples from a 2-hidden unit neural network obtained by HMC. Notice how the position of the kinks varies between samples, leading to larger uncertainty in between the 2 datapoints (x1, y1) and (x2, y2), marked by black crosses. (For some of these samples, only one kink is between x1 and x2; the other is to the left of x1.) hidden layer network with two ReLU hidden units mapping x ∈ R → y ∈ R: y(x) = W1φ(U1… view at source ↗
Figure 6
Figure 6. Figure 6: Mean and two standard deviation bars of the predictive distribution for fθ(x) (without output noise) using ReLU activations. them. The diagonal entries were initialised to log(0.05) and the off-diagonals were initialised to 0. The mean vector was initialised from N (0, 0.1). For both MFVI and FCVI we approximate the ELBO during training with 32 samples. For HMC, the number of leapfrog steps was chosen unif… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of MFVI-ReLU and linearised Laplace tanh on the standard splits. Positive difference means Laplace performs better than MFVI. rameterised directly as a lower triangular matrix, with the diagonal entries constrained to be positive by exponentiat￾ing them. The diagonal entries were initialised to log(10−5 ) and the off-diagonal entries were initialised to 0. The mean vector was initialised randoml… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of MFVI-ReLU and linearised Laplace tanh on the gap splits. Positive difference means Laplace performs better than MFVI. MFVI fails catastrophically on energy and naval. 0.5 0.0 0.5 1.0 test log likelihood difference 0 2 4 6 8 10 count avg diff: 0.00 0.50 0.25 0.00 0.25 0.50 test log likelihood difference 0 1 2 3 4 5 count avg diff: 0.02 4 2 0 laplace linearised 1HL tanh 4 3 2 1 0 1 MFVI 1HL ReL… view at source ↗
read the original abstract

We describe a limitation in the expressiveness of the predictive uncertainty estimate given by mean-field variational inference (MFVI), a popular approximate inference method for Bayesian neural networks. In particular, MFVI fails to give calibrated uncertainty estimates in between separated regions of observations. This can lead to catastrophically overconfident predictions when testing on out-of-distribution data. Avoiding such overconfidence is critical for active learning, Bayesian optimisation and out-of-distribution robustness. We instead find that a classical technique, the linearised Laplace approximation, can handle 'in-between' uncertainty much better for small network architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that mean-field variational inference (MFVI) for Bayesian neural networks fails to produce calibrated uncertainty estimates in the regions between separated clusters of observations, resulting in catastrophically overconfident predictions on out-of-distribution inputs. It contrasts this with the linearised Laplace approximation, which the authors find handles in-between uncertainty more reliably, at least on small network architectures. The finding is presented as an empirical observation with implications for active learning, Bayesian optimisation, and OOD robustness.

Significance. If the reported limitation of MFVI is shown to be general rather than an artifact of specific architectures or training details, the result would be significant for uncertainty quantification in deep learning, as MFVI remains a popular and scalable approximate inference method. The contrast with linearised Laplace provides a concrete, falsifiable comparison that could guide practitioners. The work does not include machine-checked proofs, parameter-free derivations, or open reproducible code, but the empirical claim is in principle testable via controlled experiments.

major comments (2)
  1. [Experiments] Experiments section: the central claim attributes the in-between uncertainty failure specifically to the mean-field factorization, yet the reported comparisons are restricted to small architectures without systematic ablations on network width, depth, initialization scale, or optimizer hyperparameters that would isolate the mean-field assumption from capacity or optimization effects.
  2. [Abstract] Abstract and introduction: the claim that MFVI 'fails to give calibrated uncertainty estimates' is stated without accompanying quantitative metrics, error bars, or explicit experimental protocol in the summary text, making it impossible to assess the magnitude or statistical reliability of the reported overconfidence on OOD data.
minor comments (2)
  1. Notation for the linearised Laplace approximation could be clarified with an explicit equation relating the Hessian or Jacobian to the predictive variance.
  2. Figure legends should explicitly label which curves correspond to MFVI versus linearised Laplace to improve readability of the in-between uncertainty comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the empirical support for our claims. We address each major comment below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim attributes the in-between uncertainty failure specifically to the mean-field factorization, yet the reported comparisons are restricted to small architectures without systematic ablations on network width, depth, initialization scale, or optimizer hyperparameters that would isolate the mean-field assumption from capacity or optimization effects.

    Authors: We agree that systematic ablations would better isolate the contribution of the mean-field assumption. Our choice of small architectures was intended to make the in-between uncertainty failure visually and quantitatively clear without confounding effects from high capacity; however, we acknowledge this limits generalizability. In the revised manuscript we will add experiments varying network width (e.g., 50 to 500 hidden units) and depth (1 to 4 layers), as well as different initialization scales and optimizers, while keeping the mean-field vs. linearised Laplace comparison fixed. These results will be reported with error bars over multiple random seeds. revision: yes

  2. Referee: [Abstract] Abstract and introduction: the claim that MFVI 'fails to give calibrated uncertainty estimates' is stated without accompanying quantitative metrics, error bars, or explicit experimental protocol in the summary text, making it impossible to assess the magnitude or statistical reliability of the reported overconfidence on OOD data.

    Authors: We accept this criticism. The current abstract is deliberately concise, but it should convey the scale of the effect. In the revision we will expand the abstract to include concrete quantitative indicators (e.g., predictive variance on OOD inputs being orders of magnitude lower than on in-distribution data, and negative log-likelihood values) together with a brief statement of the experimental protocol (toy regression tasks with separated clusters, 5 random seeds). Corresponding numbers and error bars will also be added to the introduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation of MFVI limitation

full rationale

The paper reports an empirical finding that MFVI produces overconfident predictions between separated data regions, demonstrated via experiments contrasting it with the linearised Laplace approximation on small networks. No derivation chain, equations, or load-bearing steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the central claim rests on direct experimental observation rather than any theoretical reduction that would invoke the listed circularity patterns. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no free parameters, axioms, or invented entities; all content is observational.

pith-pipeline@v0.9.0 · 5637 in / 910 out tokens · 28724 ms · 2026-05-25T14:42:30.497261+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Optimality of Sub-network Laplace Approximations: New Results and Methods

    stat.ML 2026-05 conditional novelty 7.0

    Sub-network Laplace approximations always underestimate full-model predictive variance, and two new gradient-based and greedy selection rules provide theoretically grounded improvements.

  2. Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks

    cs.LG 2025-02 unverdicted novelty 6.0

    Derives optimal low-rank subspace for Laplace approx in BNNs, provides scalable outperforming version, and new comparison metric.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    and Bishop, C

    Barber, D. and Bishop, C. M. Ensemble learning in B ayesian neural networks. Nato ASI Series F Computer and Systems Sciences, 168: 0 215--238, 1998

  2. [2]

    Weight uncertainty in neural network

    Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural network. In International Conference on Machine Learning, pp.\ 1613--1622, 2015

  3. [3]

    Deep G aussian processes for regression using approximate expectation propagation

    Bui, T., Hern \'a ndez-Lobato, D., Hernandez-Lobato, J., Li, Y., and Turner, R. Deep G aussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pp.\ 1472--1481, 2016

  4. [4]

    and Saul, L

    Cho, Y. and Saul, L. K. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pp.\ 342--350, 2009

  5. [5]

    Deep reinforcement learning in a handful of trials using probabilistic dynamics models

    Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp.\ 4754--4765, 2018

  6. [6]

    Denker, J. S. and LeCun, Y. Transforming neural-net output levels to probability distributions. In Advances in Neural Information Processing Systems, pp.\ 853--859, 1991

  7. [7]

    and Adams, R

    Duvenaud, D. and Adams, R. P. Black-box stochastic variational inference in five lines of P ython. In NIPS Workshop on Black-box Learning and Inference, 2015

  8. [8]

    Uncertainty in deep learning

    Gal, Y. Uncertainty in deep learning. PhD thesis, University of Cambridge, 2016

  9. [9]

    and Ghahramani, Z

    Gal, Y. and Ghahramani, Z. Dropout as a B ayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pp.\ 1050--1059, 2016

  10. [10]

    Deep B ayesian active learning with image data

    Gal, Y., Islam, R., and Ghahramani, Z. Deep B ayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.\ 1183--1192. JMLR. org, 2017

  11. [11]

    Hern \'a ndez-Lobato, J. M. and Adams, R. Probabilistic backpropagation for scalable learning of B ayesian neural networks. In International Conference on Machine Learning, pp.\ 1861--1869, 2015

  12. [12]

    and Van Camp, D

    Hinton, G. and Van Camp, D. Keeping neural networks simple by minimizing the description length of the weights. In in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory. Citeseer, 1993

  13. [13]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  14. [14]

    P., Salimans, T., and Welling, M

    Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp.\ 2575--2583, 2015

  15. [15]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp.\ 6402--6413, 2017

  16. [16]

    Lawrence, N. D. Variational inference in probabilistic models. PhD thesis, University of Cambridge, 2001

  17. [17]

    Deep Neural Networks as Gaussian Processes

    Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. Deep neural networks as G aussian processes. arXiv preprint arXiv:1711.00165, 2017

  18. [18]

    MacKay, D. J. C. A practical B ayesian framework for backpropagation networks. Neural computation, 4 0 (3): 0 448--472, 1992

  19. [19]

    On the Importance of Strong Baselines in Bayesian Deep Learning

    Mukhoti, J., Stenetorp, P., and Gal, Y. On the importance of strong baselines in B ayesian deep learning. arXiv preprint arXiv:1811.09385, 2018

  20. [20]

    Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012

  21. [21]

    Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

    Riquelme, C., Tucker, G., and Snoek, J. Deep B ayesian bandits showdown: An empirical comparison of B ayesian deep networks for T hompson sampling. arXiv preprint arXiv:1802.09127, 2018

  22. [22]

    A scalable L aplace approximation for neural networks

    Ritter, H., Botev, A., and Barber, D. A scalable L aplace approximation for neural networks. In International Conference on Learning Representations, 2018

  23. [23]

    Scalable B ayesian optimization using deep neural networks

    Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. Scalable B ayesian optimization using deep neural networks. In International Conference on Machine Learning, pp.\ 2171--2180, 2015

  24. [24]

    B., Swaroop, S., and Turner, R

    Tomczak, M. B., Swaroop, S., and Turner, R. E. Neural network ensembles and variational inference revisited. In 1st Symposium on Advances in Approximate Bayesian Inference, pp.\ 1--11, 2018

  25. [25]

    and Turner, R

    Trippe, B. and Turner, R. E. Overpruning in variational B ayesian neural networks. In NIPS Workshop on Advances in Approximate Bayesian Inferenc, 2017

  26. [26]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...