pith. sign in

arxiv: 1906.09686 · v1 · pith:FG7P25AMnew · submitted 2019-06-24 · 💻 cs.LG · stat.ML

Quality of Uncertainty Quantification for Bayesian Neural Network Inference

Pith reviewed 2026-05-25 17:51 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords Bayesian neural networksuncertainty quantificationinference methodsposterior approximationpredictive uncertaintyregression tasksclassification tasksevaluation metrics
0
0 comments X

The pith

Common metrics like test log-likelihood can mislead when judging Bayesian neural network uncertainty quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs an empirical comparison of predictive uncertainty from ten standard inference methods applied to Bayesian neural networks on both regression and classification tasks. It finds that widely used scores such as test log-likelihood often fail to track actual quality of the uncertainty estimates. The work further shows that inference techniques built to capture posterior structure do not reliably deliver better approximations. A reader would care because these networks are chosen precisely for their uncertainty outputs in decision settings, so flawed evaluation can lead to selecting models that give unreliable uncertainty. The central object carrying the argument is the set of chosen proxies for uncertainty quality measured across the ten methods.

Core claim

Our experiments demonstrate that commonly used metrics (e.g. test log-likelihood) can be misleading. Our experiments also indicate that inference innovations designed to capture structure in the posterior do not necessarily produce high quality posterior approximations.

What carries the argument

Empirical side-by-side evaluation of predictive uncertainty quality for ten common BNN inference methods, using regression and classification tasks together with chosen proxies for uncertainty quality.

If this is right

  • Test log-likelihood alone cannot be trusted as a signal of good uncertainty quantification in BNNs.
  • Inference methods that target posterior structure may still yield low-quality uncertainty estimates on standard tasks.
  • Evaluation of BNN inference requires multiple complementary metrics beyond likelihood-based scores.
  • Model selection based on current common practice can favor methods that do not actually deliver reliable uncertainty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether the same pattern holds when uncertainty quality is measured directly against downstream decision error rather than the paper's chosen proxies.
  • The findings suggest practitioners should run targeted calibration checks on any new inference method before deployment instead of relying on published likelihood rankings.
  • If the pattern generalizes, research effort may shift from inventing new posterior approximations toward inventing better diagnostic metrics for uncertainty quality.

Load-bearing premise

The chosen regression and classification tasks together with the chosen proxies for uncertainty quality are representative of the settings in which BNN uncertainty estimates are actually used.

What would settle it

A controlled experiment on a new task where an inference method designed to capture posterior structure produces clearly superior uncertainty estimates according to the paper's own proxies, while standard metrics like test log-likelihood still rank it poorly.

Figures

Figures reproduced from arXiv: 1906.09686 by Finale Doshi-Velez, Jiayu Yao, Soumya Ghosh, Weiwei Pan.

Figure 1
Figure 1. Figure 1: A comparison of the posterior predictives. Ground truth (HMC) reveals that our BNN model class perhaps has more flexibility than needed (as indicated by the widening in the predictive posterior where there are no data). BBB, MVG and BBH produce approximate posterior predictives that incorrectly have lower variance but all have test log-likelihoods that are comparable if not higher to that of the ground tru… view at source ↗
Figure 2
Figure 2. Figure 2: Ground truth (HMC) indicates that the a priori model uncertainty is overly high. MVG produces approximate posterior predictives that have lower uncertainty than the ground truth but have test log-likelihoods and ROC’s that are identical to the ground truth. More comparisons in Appendix 6 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A comparison of the posterior predictives for Regression with Matched A Priori Uncertainty. Ground truth (HMC) indicates that the model is calibrated. All inference methods produce posterior predictives that model the data well over regions well represented in the training data. Of these methods, all, except for Ensemble and SGHMC, underestimate uncertainty over regions sparsely represented in the training… view at source ↗
Figure 4
Figure 4. Figure 4: A comparison of the posterior predictives for Regression with Mimatched A Priori Uncertainty. Ground truth (HMC) indicates that the prior over estimates the variations in the data. All methods, except for SGHMC, produce approximate posterior predictives that have lower variance but all have test log-likelihoods that are comparable if not higher to that of the ground truth. HMC BBB PBP BB-α MVG MNF BbH Drop… view at source ↗
Figure 5
Figure 5. Figure 5: A comparison of the posterior predictives for Regression with Matched A Priori Uncertainty. Ground truth (HMC) indicate that the model is calibrated. With the exception of BB-Alpha, all inference methods produce posterior predictives that model the data well over regions well represented in the training data. Of these methods, all, except for Ensemble and SGHMC, underestimate uncertainty over regions spars… view at source ↗
Figure 6
Figure 6. Figure 6: A comparison of the posterior predictives for Classification with Model Mismatch. Posterior predictive mean over probabilities, posterior predictive standard deviation and posterior predctive mean over labels (from left to right). Ground truth (HMC) indicates that the model is a mismatch for the data. All methods, with the exception of SGHMC underestimates predictive uncertainty. HMC BBB MVG MNF BbH Dropou… view at source ↗
Figure 7
Figure 7. Figure 7: A comparison of the posterior predictives for Classification with No Model Mismatch. Posterior predictive mean over probabilities, posterior predictive standard deviation and posterior predctive mean over labels (from left to right). Ground truth (HMC) indicates that the model is a good match for the data. All methods, with the exception of SGHMC, underestimates predictive uncertainty. Although all methods… view at source ↗
Figure 8
Figure 8. Figure 8: The posterior predictives of models with best objective function values for Regression with Mimatched A Priori Uncertainty [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The posterior predictives of models with best objective function values for Regression with Matched A Priori Uncertainty [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The posterior predictives of models of which variational parameters are initialized from the empirical mean of HMC samples for Regression with Mimatched A Priori Uncertainty [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The posterior predictives of models of which variational parameters are initialized from the empirical mean of HMC samples with Matched A Priori Uncertainty [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: The posterior predictives by approximating HMC posterior samples with a multivariate normal distribution. initial stepsize  of 2 × 10−3 . Acceptance rate α is checked every 100 iteration.  is increased by 1.1 times if α > 0.8 or decreased by 0.9 times if α < 0.2. We used 50K iterations and a burnin of 40K and a thin￾ning of interval 20. Convergence is verified through trace-plots and autocorrelation for… view at source ↗
read the original abstract

Bayesian Neural Networks (BNNs) place priors over the parameters in a neural network. Inference in BNNs, however, is difficult; all inference methods for BNNs are approximate. In this work, we empirically compare the quality of predictive uncertainty estimates for 10 common inference methods on both regression and classification tasks. Our experiments demonstrate that commonly used metrics (e.g. test log-likelihood) can be misleading. Our experiments also indicate that inference innovations designed to capture structure in the posterior do not necessarily produce high quality posterior approximations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper empirically compares the quality of predictive uncertainty estimates from 10 common approximate inference methods for Bayesian neural networks across regression and classification tasks. It claims that standard metrics such as test log-likelihood can be misleading and that inference innovations intended to capture posterior structure do not necessarily yield higher-quality posterior approximations.

Significance. If the empirical comparisons hold after addressing experimental details, the work would be significant for highlighting limitations of conventional evaluation practices in BNN uncertainty quantification and for cautioning against assumptions that structural posterior approximations are inherently superior. The breadth of the comparison across 10 methods is a strength.

major comments (2)
  1. [Experimental evaluation] Experimental evaluation (throughout results): The manuscript reports empirical results comparing inference methods but supplies no information on the number of independent runs, hyperparameter tuning protocols, statistical significance testing, or precise definitions of the uncertainty-quality proxies employed. This directly affects verifiability of the central claims that test log-likelihood is misleading and that structure-capturing methods do not improve quality.
  2. [Results and Discussion] Task and metric selection (results section): The headline claims rest on the chosen regression/classification tasks and uncertainty-quality proxies (e.g., calibration, OOD, decision metrics) being representative. No justification or sensitivity analysis is provided for why these specific tasks and proxies generalize to regimes where BNNs are deployed, which is load-bearing for the generalizability of the findings on misleading metrics and ineffective structural approximations.
minor comments (1)
  1. [Abstract] The abstract could more explicitly name the specific uncertainty-quality proxies (beyond the parenthetical examples) used to reach the conclusions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and will revise the manuscript to improve experimental transparency and discussion of generalizability.

read point-by-point responses
  1. Referee: [Experimental evaluation] Experimental evaluation (throughout results): The manuscript reports empirical results comparing inference methods but supplies no information on the number of independent runs, hyperparameter tuning protocols, statistical significance testing, or precise definitions of the uncertainty-quality proxies employed. This directly affects verifiability of the central claims that test log-likelihood is misleading and that structure-capturing methods do not improve quality.

    Authors: We agree that explicit reporting of these details strengthens verifiability. The original manuscript described the experimental setup and provided some hyperparameter information in the appendix, but we will expand the revision to state: (i) all results are averaged over 10 independent random seeds with standard errors; (ii) hyperparameter tuning followed a fixed grid search protocol over learning rate, batch size, and prior variance (detailed in the updated appendix); (iii) statistical significance is assessed via paired Wilcoxon tests on the key metric differences; and (iv) a new subsection will give precise mathematical definitions for each uncertainty-quality proxy (e.g., ECE, OOD AUROC, decision cost). These additions directly address the concern while preserving the reported findings. revision: yes

  2. Referee: [Results and Discussion] Task and metric selection (results section): The headline claims rest on the chosen regression/classification tasks and uncertainty-quality proxies (e.g., calibration, OOD, decision metrics) being representative. No justification or sensitivity analysis is provided for why these specific tasks and proxies generalize to regimes where BNNs are deployed, which is load-bearing for the generalizability of the findings on misleading metrics and ineffective structural approximations.

    Authors: The tasks (UCI regression suites and standard image classification benchmarks) were deliberately selected because they are the most commonly used in the BNN literature, allowing direct comparison with prior studies. We will add an explicit justification paragraph citing this precedent and noting the diversity in input dimensionality and output type. A exhaustive sensitivity analysis over every conceivable deployment regime is outside the scope of a single paper; however, we will include a limitations paragraph acknowledging this and emphasizing that the two headline observations (misleading test log-likelihood and limited benefit from posterior-structure methods) were consistent across all evaluated tasks. If the referee suggests particular additional tasks, we can incorporate a subset in the revision. revision: partial

Circularity Check

0 steps flagged

Empirical comparison contains no circular derivation steps

full rationale

The paper conducts an empirical evaluation of 10 BNN inference methods on regression and classification tasks, reporting that test log-likelihood can be misleading and that posterior-structure innovations do not guarantee high-quality approximations. No equations, fitted parameters, or self-citations are used to derive the reported quality scores; all results are direct measurements against external task performance and uncertainty proxies. The work is therefore self-contained against external benchmarks with no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the study rests on standard assumptions of supervised learning evaluation.

pith-pipeline@v0.9.0 · 5616 in / 1063 out tokens · 23835 ms · 2026-05-25T17:51:04.227035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data

    cs.LG 2026-05 unverdicted novelty 7.0

    SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships ...

  2. LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy

    cs.LG 2026-05 unverdicted novelty 5.0

    ACSE estimates LLM prompt uncertainty via adaptive clustering of semantic entropy across multiple responses and uses conformal prediction to bound error rates on accepted answers with distribution-free guarantees.

  3. Robust SGLD algorithm for solving non-convex distributionally robust optimisation problems

    math.OC 2024-03 unverdicted novelty 5.0

    Develops robust SGLD with non-asymptotic convergence bounds for non-convex DRO and applies it to neural network regression under adversarial corruption.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 3 Pith papers · 4 internal anchors

  1. [1]

    Weight Uncertainty in Neural Networks

    Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015

  2. [2]

    Stochastic gradient hamiltonian monte carlo

    Chen, T., Fox, E., and Guestrin, C. Stochastic gradient hamiltonian monte carlo. In International Conference on Machine Learning, pp.\ 1683--1691, 2014

  3. [3]

    and Ghahramani, Z

    Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp.\ 1050--1059, 2016

  4. [4]

    Nonparametric variational inference

    Gershman, S., Hoffman, M., and Blei, D. Nonparametric variational inference. arXiv preprint arXiv:1206.4665, 2012

  5. [5]

    Practical variational inference for neural networks

    Graves, A. Practical variational inference for neural networks. In Advances in neural information processing systems, pp.\ 2348--2356, 2011

  6. [6]

    Hern \'a ndez-Lobato, J. M. and Adams, R. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pp.\ 1861--1869, 2015

  7. [7]

    M., Li, Y., Rowland, M., Hern \'a ndez-Lobato, D., Bui, T., and Turner, R

    Hern \'a ndez-Lobato, J. M., Li, Y., Rowland, M., Hern \'a ndez-Lobato, D., Bui, T., and Turner, R. E. Black-box -divergence minimization. 2016

  8. [8]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp.\ 6402--6413, 2017

  9. [9]

    Deep learning

    LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521 0 (7553): 0 436, 2015

  10. [10]

    and Welling, M

    Louizos, C. and Welling, M. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pp.\ 1708--1716, 2016

  11. [11]

    Multiplicative Normalizing Flows for Variational Bayesian Neural Networks

    Louizos, C. and Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017

  12. [12]

    MacKay, D. J. A practical bayesian framework for backpropagation networks. Neural computation, 4 0 (3): 0 448--472, 1992

  13. [13]

    C., Foti, N

    Miller, A. C., Foti, N. J., and Adams, R. P. Variational boosting: Iteratively refining posterior approximations. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.\ 2420--2429. JMLR. org, 2017

  14. [14]

    and Julier, S

    Myshkov, P. and Julier, S. Posterior distribution analysis for bayesian inference in neural networks. 2016

  15. [15]

    Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012

  16. [16]

    Pawlowski , N., Brock , A., Lee , M. C. H., Rajchl , M., and Glocker , B. Implicit Weight Uncertainty in Neural Networks . ArXiv e-prints, November 2017

  17. [17]

    Uncertainty in neural networks: Bayesian ensembling

    Pearce, T., Zaki, M., Brintrup, A., and Neel, A. Uncertainty in neural networks: Bayesian ensembling. arXiv preprint arXiv:1810.05546, 2018

  18. [18]

    Hierarchical variational models

    Ranganath, R., Tran, D., and Blei, D. Hierarchical variational models. In International Conference on Machine Learning, pp.\ 324--333, 2016

  19. [19]

    Functional Variational Bayesian Neural Networks

    Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019

  20. [20]

    and Lopez-Paz, D

    Tagasovska, N. and Lopez-Paz, D. Frequentist uncertainty estimates for deep learning. arXiv preprint arXiv:1811.00908, 2018

  21. [21]

    and Teh, Y

    Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.\ 681--688, 2011

  22. [22]

    and Ji, Q

    Zhao, R. and Ji, Q. An empirical evaluation of bayesian inference methods for bayesian neural networks. 2018