Unstable Rankings in Bayesian Deep Learning Evaluation
Pith reviewed 2026-05-08 08:30 UTC · model grok-4.3
The pith
Standard point-estimate evaluations of Bayesian deep learning methods yield unreliable rankings when training data is limited.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across six Bayesian deep learning methods and five regression datasets, method rankings are dataset-dependent and fail to stabilize at small training sizes. The same comparison can give P(MCD ≺ Ensemble) = 1.000 at n=50 on one dataset yet remain below 0.95 at n=500 on another. No universal sample-size threshold exists; therefore dataset-specific posterior inference over metrics is required to determine when observed differences are reliable.
What carries the argument
Bayesian hierarchical model with method-specific variances that treats evaluation metrics as random variables across data realizations, plus a predictive Minimum Detectable Difference curve for assessing detectability at given training sizes.
If this is right
- Evidence for superiority of one method over another must be checked against the probability that the ranking would reverse on new data draws of the same size.
- A method that appears best on one low-data problem may not be distinguishable from alternatives on a different problem at the same size.
- Evaluation reports should include the minimum training size at which a given performance gap becomes detectable for that dataset.
- Current practice of declaring one Bayesian method superior based on point metrics alone is invalid in low-data regimes.
Where Pith is reading between the lines
- The observed instability may account for conflicting results across papers that compare the same set of Bayesian deep learning techniques.
- Practitioners could apply the same hierarchical model to decide in advance how many training runs are needed before trusting a ranking.
- The framework suggests similar uncertainty quantification would be useful when comparing non-Bayesian methods under data constraints.
- Extending the analysis to classification datasets could reveal whether the same lack of universal thresholds holds outside regression.
Load-bearing premise
The five chosen regression datasets and six Bayesian deep learning methods are sufficiently representative that the lack of a universal sample-size threshold generalizes beyond these cases.
What would settle it
Observing that method superiority probabilities exceed 0.95 for all pairwise comparisons at the same training size n across all five datasets would falsify the claim that no universal threshold exists.
Figures
read the original abstract
Standard evaluations of Bayesian deep learning methods assume that metric estimates are reliable, but we show this assumption fails under data scarcity. Method rankings are not only unreliable at small $n$, but also dataset-dependent in ways that point estimates cannot reveal: the same method comparison yields $P(\mathrm{MCD} \prec \mathrm{Ensemble}) = 1.000$ at $n = 50$ on one dataset and remains below $0.95$ even at $n = 500$ on another. Across the datasets we consider, no universal sample size threshold exists, which is precisely why dataset-specific posterior inference is necessary. To address this, we use a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables across data realizations, and we use a predictive Minimum Detectable Difference curve to assess whether an observed gap would be detectable at a given training size. Across six Bayesian deep learning methods and five regression datasets, our results show that uncertainty-aware evaluation is necessary in low-data settings, because current evidence for method superiority and predictive detectability at the same training size can diverge substantially. Our framework provides practitioners with principled tools to determine whether their evaluation data is sufficient before drawing conclusions about method superiority.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that standard evaluations of Bayesian deep learning methods assume reliable metric estimates, but this fails under data scarcity: method rankings are unreliable at small n and dataset-dependent in ways point estimates cannot reveal. Using a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables, they report concrete probabilities such as P(MCD ≺ Ensemble) = 1.000 at n=50 on one dataset versus remaining below 0.95 even at n=500 on another. They introduce predictive Minimum Detectable Difference curves to assess whether observed gaps would be detectable at a given training size. Across six BDL methods and five regression datasets, they conclude that no universal sample-size threshold exists, making dataset-specific posterior inference necessary for reliable superiority claims.
Significance. If the results hold, the work is significant because it demonstrates that uncertainty-aware evaluation is required in low-data BDL settings where current evidence for method superiority and predictive detectability can diverge. The Bayesian hierarchical model and MDD curves provide practitioners with principled, reproducible tools to assess evaluation sufficiency before drawing conclusions, moving beyond point estimates. The explicit treatment of metrics as random variables across data realizations is a clear strength.
major comments (2)
- [Abstract] Abstract: The central prescriptive claim that 'no universal sample size threshold exists' and therefore 'dataset-specific posterior inference is always required' is supported only by results on five regression datasets. The Bayesian hierarchical model with method-specific variances correctly yields dataset-dependent P(MCD ≺ Ensemble) curves, but the absence of a meta-level prior over datasets means the model cannot support inference about whether a common threshold is absent across the broader space of regression or classification tasks; the conclusion is therefore an untested extrapolation.
- [Abstract] Abstract: Concrete probabilities such as P(MCD ≺ Ensemble) = 1.000 at n=50 are reported as evidence, yet the abstract provides no details on data splits, model fitting, or how the hierarchical variances were estimated; this makes it impossible to verify whether post-hoc dataset choices or variance modeling decisions affect the reported divergence between superiority probabilities and detectability thresholds.
minor comments (2)
- The abstract is technically dense; adding a short parenthetical definition or forward reference for the Minimum Detectable Difference curve when it is first mentioned would improve readability for readers outside the immediate subfield.
- Ensure that the five regression datasets and six BDL methods are enumerated with brief descriptions or citations in the main text so that the scope of the empirical study is immediately clear.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback on the abstract. We address each major comment below, with revisions to qualify our claims and improve clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central prescriptive claim that 'no universal sample size threshold exists' and therefore 'dataset-specific posterior inference is always required' is supported only by results on five regression datasets. The Bayesian hierarchical model with method-specific variances correctly yields dataset-dependent P(MCD ≺ Ensemble) curves, but the absence of a meta-level prior over datasets means the model cannot support inference about whether a common threshold is absent across the broader space of regression or classification tasks; the conclusion is therefore an untested extrapolation.
Authors: We agree that the empirical results are confined to the five regression datasets studied and that the hierarchical model does not incorporate a meta-prior over datasets, so it cannot formally demonstrate the non-existence of a universal threshold across all possible tasks or domains. The recommendation for dataset-specific posterior inference is presented as a practical consequence of the observed dataset-dependence in our experiments rather than a universal proof. We will revise the abstract to state that 'across the datasets considered, no universal sample size threshold exists' and qualify the prescriptive claim accordingly. This constitutes a partial revision focused on wording. revision: partial
-
Referee: [Abstract] Abstract: Concrete probabilities such as P(MCD ≺ Ensemble) = 1.000 at n=50 are reported as evidence, yet the abstract provides no details on data splits, model fitting, or how the hierarchical variances were estimated; this makes it impossible to verify whether post-hoc dataset choices or variance modeling decisions affect the reported divergence between superiority probabilities and detectability thresholds.
Authors: The abstract is a high-level summary; complete details on data splits (5-fold cross-validation across multiple realizations), model fitting (MCMC for the hierarchical model), and estimation of method-specific variances are provided in Sections 3 and 4 of the manuscript. To address the concern about self-containment, we will add a concise clause in the abstract referencing the use of hierarchical modeling over multiple data realizations and predictive MDD curves. Full verification remains possible from the main text. revision: yes
Circularity Check
No circularity: empirical comparisons and hierarchical model are self-contained
full rationale
The paper reports cross-dataset empirical results on six BDL methods and five regression tasks, then applies a Bayesian hierarchical model (with method-specific variances) to obtain posterior probabilities and predictive MDD curves. No equation, parameter fit, or self-citation reduces a reported probability, detectability threshold, or ranking instability claim to a quantity defined by the same data or prior work by construction. The central claim that no universal sample-size threshold exists is an inductive statement over the observed datasets rather than a definitional or fitted tautology. The derivation chain therefore remains independent of its inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ISSN 1935-8237. doi: 10.1561/2200000101. URLhttps://doi.org/10.1561/ 2200000101. Filippo Bargagna, Lisa Anita De Santi, Nicola Martini, Dario Genovesi, Brunella Favilli, Giuseppe Vergaro, Michele Emdin, Assuero Giorgetti, Vincenzo Posi- tano, and Maria Filomena Santarelli. Bayesian convolutional neural networks 13 in medical imaging classification: A prom...
-
[2]
doi: 10.1007/s10278-023-00897-8
ISSN 1618-727X. doi: 10.1007/s10278-023-00897-8. URL https://doi.org/10.1007/s10278-023-00897-8. Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. InProceedings of the 32nd International Conference on Machine Learning - Volume 37, ICML’15, page 1613–1622. JMLR.org,
-
[3]
Paul-Christian Bürkner
URLhttps://proceedings.mlsys.org/paper_files/ paper/2021/file/0184b0cd3cfb185989f858a1d9f5c1eb-Paper.pdf. Paul-Christian Bürkner. brms: An r package for bayesian multilevel models using stan.Journal of Statistical Software, 80(1):1–28,
2021
-
[4]
The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021
URLhttps://arxiv.org/abs/2107.07002. Aya Ferchichi, Ahlem Ferchichi, Fatma Hendaoui, Mejda Chihaoui, and Radhia Toujani. Deep learning-based uncertainty quantification for spatio-temporal environmental remote sensing: A systematic literature review.Neurocom- puting, 639:130242,
-
[5]
doi: https://doi.org/10.1016/ j.neucom.2025.130242
ISSN 0925-2312. doi: https://doi.org/10.1016/ j.neucom.2025.130242. URL https://www.sciencedirect.com/science/ article/pii/S0925231225009142. Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on International Conference on Machine Learni...
-
[6]
doi: 10.1214/ss/1177011136. URLhttps://doi.org/10.1214/ss/1177011136. J. Gerritsma, R. Onnink, and A. Versluis. Geometry, resistance and stability of the delft systematic yacht hull series.International Shipbuilding Progress, 28 (328):276–297,
-
[7]
doi: 10.1198/016214506000001437. URLhttps://doi.org/ 10.1198/016214506000001437. D. A. Griffiths. Maximum likelihood estimation for the beta-binomial distribution and an application to the household distribution of the total number of cases of a disease.Biometrics, 29(4):637–648,
-
[8]
Adam: A Method for Stochastic Optimization
URL https://arxiv.org/abs/1412.6980. Michael Kirchhof, Bálint Mucsányi, Seong Joon Oh, and Dr. Enkelejda Kasneci. Url: A representation learning benchmark for transferable uncertainty estimates. InAdvances in Neural Information Processing Systems, volume 36, pages 13956–13980. Curran Associates, Inc.,
work page internal anchor Pith review arXiv
-
[9]
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 2d421cd0e763f9f01958a30bace955bf-Paper-Datasets_and_Benchmarks. pdf. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InProceedings of the 31st International Conference on Neural Information Processin...
2023
-
[10]
URLhttps://arxiv.org/abs/2501.04234. Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. InAdvances in Neural Informa- tion Processing Systems, volume
-
[11]
Max Menssen and Frank Schaarschmidt
URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 118921efba23fc329e6560b27861f0c2-Paper.pdf. Max Menssen and Frank Schaarschmidt. Prediction intervals for overdispersed binomial data with application to historical controls.Statistics in Medicine, 38(14):2652–2663,
2019
-
[12]
URLhttps: //onlinelibrary.wiley.com/doi/abs/10.1002/sim.8124
doi: https://doi.org/10.1002/sim.8124. URLhttps: //onlinelibrary.wiley.com/doi/abs/10.1002/sim.8124. 15 Bálint Mucsányi, Michael Kirchhof, and Seong Joon Oh. Benchmarking uncertainty disentanglement: Specialized uncertainties for specialized tasks. InAdvances in Neural Information Processing Systems, volume 37, pages 50972–51038. Curran Associates, Inc.,
-
[13]
doi: 10.52202/079017-1614. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 5afa9cb1e917b898ad418216dc726fbd-Paper-Datasets_and_Benchmarks_ Track.pdf. Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Se- bastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluatin...
-
[14]
Deborah Raji, Emily Denton, Emily M
URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 8558cb408c1d76621371888657d2eb1d-Paper.pdf. Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Van- schorenandS.Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datas...
2019
-
[15]
Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J
URLhttps: //datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/ 2021/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper-round2.pdf. Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer. Betterbench: assessing ai benchmarks, uncovering issues, and establishing best practices. InProceedings of the 38th Internat...
2021
-
[16]
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5d97b7e62022c859347397f6c1e8d0f9-Paper-Conference.pdf. D. J. Spiegelhalter and L. S. Freedman. A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion.Statistics in Medicine, 5(1):1–13, Jan–Feb
2023
-
[17]
doi: 10.1002/sim.4780050103. Emma Svensson, Hannah Rosa Friesacher, Susanne Winiwarter, Lewis Mervin, Adam Arany, and Ola Engkvist. Enhancing uncertainty quantification in drug discovery with censored regression labels.Artificial Intelligence in the Life Sciences, 7:100128,
-
[18]
ISSN 2667-3185. doi: https://doi.org/10. 1016/j.ailsci.2025.100128. URL https://www.sciencedirect.com/science/ article/pii/S2667318525000042. Athanasios Tsanas and Angeliki Xifara. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning 16 tools.Energy and Buildings, 49:560–567,
-
[19]
ISSN 0378-7788. doi: https: //doi.org/10.1016/j.enbuild.2012.03.003. URL https://www.sciencedirect. com/science/article/pii/S037877881200151X. Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer US,
-
[20]
doi: 10.1007/b106715. I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neuralnetworks.Cement and Concrete Research, 28(12):1797–1808,
-
[21]
Modeling of strength of high-performance concrete using artificial neural networks
ISSN 0008-8846. doi: https://doi.org/10.1016/S0008-8846(98)00165-3. URL https: //www.sciencedirect.com/science/article/pii/S0008884698001653. Cheng-Han Yu and Shuaizhou Wang. A comparative study of bayesian neural networks and machine learning based on covid-19 image classification.Statistics and Data Science in Imaging, 2(1):2497555,
-
[22]
MAP, MCD, and CP are trained for 500 epochs using the Adam optimizer [Kingma and Ba, 2015] with learning rate10−3 and weight decay10−5
Predicted variance is clamped to[10−3,10 3]during training for numerical stability. MAP, MCD, and CP are trained for 500 epochs using the Adam optimizer [Kingma and Ba, 2015] with learning rate10−3 and weight decay10−5. BBB doubles the training budget to 1000 epochs to allow variational convergence and omits weight decay, since the KL divergence term in t...
2015
-
[23]
sigma collapse; unclipped MAP NLL exceeds106 at all training sizes. Deep Ensemble NLL is evaluated on the mixture predictive distribution and does not decrease monotonically withn, a structural property of mixture-of-Gaussians representations documented in Lakshminarayanan et al. [2017]. SWAG NLL increases at n = 500due to posterior over dispersion under ...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.