Quality of Uncertainty Quantification for Bayesian Neural Network Inference
Pith reviewed 2026-05-25 17:51 UTC · model grok-4.3
The pith
Common metrics like test log-likelihood can mislead when judging Bayesian neural network uncertainty quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our experiments demonstrate that commonly used metrics (e.g. test log-likelihood) can be misleading. Our experiments also indicate that inference innovations designed to capture structure in the posterior do not necessarily produce high quality posterior approximations.
What carries the argument
Empirical side-by-side evaluation of predictive uncertainty quality for ten common BNN inference methods, using regression and classification tasks together with chosen proxies for uncertainty quality.
If this is right
- Test log-likelihood alone cannot be trusted as a signal of good uncertainty quantification in BNNs.
- Inference methods that target posterior structure may still yield low-quality uncertainty estimates on standard tasks.
- Evaluation of BNN inference requires multiple complementary metrics beyond likelihood-based scores.
- Model selection based on current common practice can favor methods that do not actually deliver reliable uncertainty.
Where Pith is reading between the lines
- Future work could test whether the same pattern holds when uncertainty quality is measured directly against downstream decision error rather than the paper's chosen proxies.
- The findings suggest practitioners should run targeted calibration checks on any new inference method before deployment instead of relying on published likelihood rankings.
- If the pattern generalizes, research effort may shift from inventing new posterior approximations toward inventing better diagnostic metrics for uncertainty quality.
Load-bearing premise
The chosen regression and classification tasks together with the chosen proxies for uncertainty quality are representative of the settings in which BNN uncertainty estimates are actually used.
What would settle it
A controlled experiment on a new task where an inference method designed to capture posterior structure produces clearly superior uncertainty estimates according to the paper's own proxies, while standard metrics like test log-likelihood still rank it poorly.
Figures
read the original abstract
Bayesian Neural Networks (BNNs) place priors over the parameters in a neural network. Inference in BNNs, however, is difficult; all inference methods for BNNs are approximate. In this work, we empirically compare the quality of predictive uncertainty estimates for 10 common inference methods on both regression and classification tasks. Our experiments demonstrate that commonly used metrics (e.g. test log-likelihood) can be misleading. Our experiments also indicate that inference innovations designed to capture structure in the posterior do not necessarily produce high quality posterior approximations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically compares the quality of predictive uncertainty estimates from 10 common approximate inference methods for Bayesian neural networks across regression and classification tasks. It claims that standard metrics such as test log-likelihood can be misleading and that inference innovations intended to capture posterior structure do not necessarily yield higher-quality posterior approximations.
Significance. If the empirical comparisons hold after addressing experimental details, the work would be significant for highlighting limitations of conventional evaluation practices in BNN uncertainty quantification and for cautioning against assumptions that structural posterior approximations are inherently superior. The breadth of the comparison across 10 methods is a strength.
major comments (2)
- [Experimental evaluation] Experimental evaluation (throughout results): The manuscript reports empirical results comparing inference methods but supplies no information on the number of independent runs, hyperparameter tuning protocols, statistical significance testing, or precise definitions of the uncertainty-quality proxies employed. This directly affects verifiability of the central claims that test log-likelihood is misleading and that structure-capturing methods do not improve quality.
- [Results and Discussion] Task and metric selection (results section): The headline claims rest on the chosen regression/classification tasks and uncertainty-quality proxies (e.g., calibration, OOD, decision metrics) being representative. No justification or sensitivity analysis is provided for why these specific tasks and proxies generalize to regimes where BNNs are deployed, which is load-bearing for the generalizability of the findings on misleading metrics and ineffective structural approximations.
minor comments (1)
- [Abstract] The abstract could more explicitly name the specific uncertainty-quality proxies (beyond the parenthetical examples) used to reach the conclusions.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. We address each major comment below and will revise the manuscript to improve experimental transparency and discussion of generalizability.
read point-by-point responses
-
Referee: [Experimental evaluation] Experimental evaluation (throughout results): The manuscript reports empirical results comparing inference methods but supplies no information on the number of independent runs, hyperparameter tuning protocols, statistical significance testing, or precise definitions of the uncertainty-quality proxies employed. This directly affects verifiability of the central claims that test log-likelihood is misleading and that structure-capturing methods do not improve quality.
Authors: We agree that explicit reporting of these details strengthens verifiability. The original manuscript described the experimental setup and provided some hyperparameter information in the appendix, but we will expand the revision to state: (i) all results are averaged over 10 independent random seeds with standard errors; (ii) hyperparameter tuning followed a fixed grid search protocol over learning rate, batch size, and prior variance (detailed in the updated appendix); (iii) statistical significance is assessed via paired Wilcoxon tests on the key metric differences; and (iv) a new subsection will give precise mathematical definitions for each uncertainty-quality proxy (e.g., ECE, OOD AUROC, decision cost). These additions directly address the concern while preserving the reported findings. revision: yes
-
Referee: [Results and Discussion] Task and metric selection (results section): The headline claims rest on the chosen regression/classification tasks and uncertainty-quality proxies (e.g., calibration, OOD, decision metrics) being representative. No justification or sensitivity analysis is provided for why these specific tasks and proxies generalize to regimes where BNNs are deployed, which is load-bearing for the generalizability of the findings on misleading metrics and ineffective structural approximations.
Authors: The tasks (UCI regression suites and standard image classification benchmarks) were deliberately selected because they are the most commonly used in the BNN literature, allowing direct comparison with prior studies. We will add an explicit justification paragraph citing this precedent and noting the diversity in input dimensionality and output type. A exhaustive sensitivity analysis over every conceivable deployment regime is outside the scope of a single paper; however, we will include a limitations paragraph acknowledging this and emphasizing that the two headline observations (misleading test log-likelihood and limited benefit from posterior-structure methods) were consistent across all evaluated tasks. If the referee suggests particular additional tasks, we can incorporate a subset in the revision. revision: partial
Circularity Check
Empirical comparison contains no circular derivation steps
full rationale
The paper conducts an empirical evaluation of 10 BNN inference methods on regression and classification tasks, reporting that test log-likelihood can be misleading and that posterior-structure innovations do not guarantee high-quality approximations. No equations, fitted parameters, or self-citations are used to derive the reported quality scores; all results are direct measurements against external task performance and uncertainty proxies. The work is therefore self-contained against external benchmarks with no load-bearing step that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data
SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships ...
-
LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy
ACSE estimates LLM prompt uncertainty via adaptive clustering of semantic entropy across multiple responses and uses conformal prediction to bound error rates on accepted answers with distribution-free guarantees.
-
Robust SGLD algorithm for solving non-convex distributionally robust optimisation problems
Develops robust SGLD with non-asymptotic convergence bounds for non-convex DRO and applies it to neural network regression under adversarial corruption.
Reference graph
Works this paper leans on
-
[1]
Weight Uncertainty in Neural Networks
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[2]
Stochastic gradient hamiltonian monte carlo
Chen, T., Fox, E., and Guestrin, C. Stochastic gradient hamiltonian monte carlo. In International Conference on Machine Learning, pp.\ 1683--1691, 2014
work page 2014
-
[3]
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp.\ 1050--1059, 2016
work page 2016
-
[4]
Nonparametric variational inference
Gershman, S., Hoffman, M., and Blei, D. Nonparametric variational inference. arXiv preprint arXiv:1206.4665, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[5]
Practical variational inference for neural networks
Graves, A. Practical variational inference for neural networks. In Advances in neural information processing systems, pp.\ 2348--2356, 2011
work page 2011
-
[6]
Hern \'a ndez-Lobato, J. M. and Adams, R. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pp.\ 1861--1869, 2015
work page 2015
-
[7]
M., Li, Y., Rowland, M., Hern \'a ndez-Lobato, D., Bui, T., and Turner, R
Hern \'a ndez-Lobato, J. M., Li, Y., Rowland, M., Hern \'a ndez-Lobato, D., Bui, T., and Turner, R. E. Black-box -divergence minimization. 2016
work page 2016
-
[8]
Simple and scalable predictive uncertainty estimation using deep ensembles
Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp.\ 6402--6413, 2017
work page 2017
-
[9]
LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521 0 (7553): 0 436, 2015
work page 2015
-
[10]
Louizos, C. and Welling, M. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pp.\ 1708--1716, 2016
work page 2016
-
[11]
Multiplicative Normalizing Flows for Variational Bayesian Neural Networks
Louizos, C. and Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
MacKay, D. J. A practical bayesian framework for backpropagation networks. Neural computation, 4 0 (3): 0 448--472, 1992
work page 1992
-
[13]
Miller, A. C., Foti, N. J., and Adams, R. P. Variational boosting: Iteratively refining posterior approximations. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.\ 2420--2429. JMLR. org, 2017
work page 2017
-
[14]
Myshkov, P. and Julier, S. Posterior distribution analysis for bayesian inference in neural networks. 2016
work page 2016
-
[15]
Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012
work page 2012
-
[16]
Pawlowski , N., Brock , A., Lee , M. C. H., Rajchl , M., and Glocker , B. Implicit Weight Uncertainty in Neural Networks . ArXiv e-prints, November 2017
work page 2017
-
[17]
Uncertainty in neural networks: Bayesian ensembling
Pearce, T., Zaki, M., Brintrup, A., and Neel, A. Uncertainty in neural networks: Bayesian ensembling. arXiv preprint arXiv:1810.05546, 2018
-
[18]
Hierarchical variational models
Ranganath, R., Tran, D., and Blei, D. Hierarchical variational models. In International Conference on Machine Learning, pp.\ 324--333, 2016
work page 2016
-
[19]
Functional Variational Bayesian Neural Networks
Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[20]
Tagasovska, N. and Lopez-Paz, D. Frequentist uncertainty estimates for deep learning. arXiv preprint arXiv:1811.00908, 2018
-
[21]
Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.\ 681--688, 2011
work page 2011
- [22]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.