arxiv: 2604.23102 · v1 · submitted 2026-04-25 · 💻 cs.LG

Unstable Rankings in Bayesian Deep Learning Evaluation

Qishi Zhan , Minxuan Hu , Guansu Wang , Jiaxin Liu , Liang He This is my paper

Pith reviewed 2026-05-08 08:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords Bayesian deep learningevaluation metricsranking instabilityhierarchical Bayesian modelminimum detectable differencelow-data regimesuncertainty-aware evaluationregression datasets

0 comments

The pith

Standard point-estimate evaluations of Bayesian deep learning methods yield unreliable rankings when training data is limited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluations of Bayesian deep learning methods typically use single numbers for performance, yet these numbers fluctuate enough under limited data that one method can rank above another on one dataset while the reverse holds on another. The paper shows that the probability of one method outperforming another can be near 1.0 at n=50 on some data but stay below 0.95 even at n=500 on others. Because no fixed training size guarantees stable rankings across datasets, the authors argue that evaluations must incorporate uncertainty over possible data draws. They build a hierarchical model that lets the ranking probability itself be estimated from repeated training runs, together with a curve that predicts how large a gap must be before it becomes detectable at a given sample size.

Core claim

Across six Bayesian deep learning methods and five regression datasets, method rankings are dataset-dependent and fail to stabilize at small training sizes. The same comparison can give P(MCD ≺ Ensemble) = 1.000 at n=50 on one dataset yet remain below 0.95 at n=500 on another. No universal sample-size threshold exists; therefore dataset-specific posterior inference over metrics is required to determine when observed differences are reliable.

What carries the argument

Bayesian hierarchical model with method-specific variances that treats evaluation metrics as random variables across data realizations, plus a predictive Minimum Detectable Difference curve for assessing detectability at given training sizes.

If this is right

Evidence for superiority of one method over another must be checked against the probability that the ranking would reverse on new data draws of the same size.
A method that appears best on one low-data problem may not be distinguishable from alternatives on a different problem at the same size.
Evaluation reports should include the minimum training size at which a given performance gap becomes detectable for that dataset.
Current practice of declaring one Bayesian method superior based on point metrics alone is invalid in low-data regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed instability may account for conflicting results across papers that compare the same set of Bayesian deep learning techniques.
Practitioners could apply the same hierarchical model to decide in advance how many training runs are needed before trusting a ranking.
The framework suggests similar uncertainty quantification would be useful when comparing non-Bayesian methods under data constraints.
Extending the analysis to classification datasets could reveal whether the same lack of universal thresholds holds outside regression.

Load-bearing premise

The five chosen regression datasets and six Bayesian deep learning methods are sufficiently representative that the lack of a universal sample-size threshold generalizes beyond these cases.

What would settle it

Observing that method superiority probabilities exceed 0.95 for all pairwise comparisons at the same training size n across all five datasets would falsify the claim that no universal threshold exists.

Figures

Figures reproduced from arXiv: 2604.23102 by Guansu Wang, Jiaxin Liu, Liang He, Minxuan Hu, Qishi Zhan.

**Figure 1.** Figure 1: Standard deviation of CRPS and NLL across view at source ↗

**Figure 2.** Figure 2: Posterior probability P(MCD ≺ Ensemble) on CRPS across training sizes for three representative datasets. Synthetic and Energy exhibit ranking reversals, while Concrete remains weaker and nearly inconclusive at n = 200. Dashed lines at 0.05 and 0.95 indicate strong posterior support, and the dotted line at 0.50 indicates an inconclusive comparison. 5.3 Minimum Detectable Difference While Section 5.2 quantif… view at source ↗

**Figure 3.** Figure 3: Pairwise posterior probability P(row ≺ column) on CRPS for Kin8nm (left) and Concrete (right) at n = 50. Values > 0.95 (dark blue) indicate that the row method reliably outperforms the column method; values < 0.05 (dark red) indicate the opposite; values near 0.5 (white) indicate inconclusive comparisons. The stark contrast between datasets illustrates that method rankings are datasetdependent and cannot … view at source ↗

**Figure 4.** Figure 4: Predictive Minimum Detectable Difference for MCD vs. Deep Ensem view at source ↗

**Figure 5.** Figure 5: Posterior predictive checks for CRPS (left) and NLL (right) at view at source ↗

**Figure 6.** Figure 6: Pairwise posterior probability P(row ≺ column) on CRPS for Kin8nm (left) and Concrete (right) at n = 200. Values > 0.95 (dark blue) indicate that the row method reliably outperforms the column method; values < 0.05 (dark red) indicate the opposite; values near 0.5 (white) indicate inconclusive comparisons. The stark contrast between datasets illustrates that method rankings are datasetdependent and cannot… view at source ↗

**Figure 7.** Figure 7: Variance decomposition of a neural network evaluation metric under view at source ↗

**Figure 8.** Figure 8: Distribution of Kendall τ between CRPS and Interval Score method rankings across R = 50 data realizations. Higher τ indicates greater agreement between the two metrics. The dashed line at τ = 1 represents perfect agreement. Consistency increases with n but never reaches 1.0. 25 view at source ↗

read the original abstract

Standard evaluations of Bayesian deep learning methods assume that metric estimates are reliable, but we show this assumption fails under data scarcity. Method rankings are not only unreliable at small $n$, but also dataset-dependent in ways that point estimates cannot reveal: the same method comparison yields $P(\mathrm{MCD} \prec \mathrm{Ensemble}) = 1.000$ at $n = 50$ on one dataset and remains below $0.95$ even at $n = 500$ on another. Across the datasets we consider, no universal sample size threshold exists, which is precisely why dataset-specific posterior inference is necessary. To address this, we use a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables across data realizations, and we use a predictive Minimum Detectable Difference curve to assess whether an observed gap would be detectable at a given training size. Across six Bayesian deep learning methods and five regression datasets, our results show that uncertainty-aware evaluation is necessary in low-data settings, because current evidence for method superiority and predictive detectability at the same training size can diverge substantially. Our framework provides practitioners with principled tools to determine whether their evaluation data is sufficient before drawing conclusions about method superiority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows concrete cases where the same BDL method comparison yields decisive superiority on one dataset but stays inconclusive on another even at larger n, using a hierarchical model and predictive detectability curves to quantify it.

read the letter

The main point is that standard point-estimate comparisons of Bayesian deep learning methods can produce rankings that flip or stay uncertain depending on the dataset and training size. The authors treat evaluation metrics as random variables in a hierarchical model with method-specific variances, then add predictive minimum detectable difference curves to check whether an observed gap would actually be detectable at a given n. This produces clear examples: P(MCD better than Ensemble) hits 1.000 at n=50 on one dataset but stays below 0.95 even at n=500 on another. That divergence between apparent superiority and statistical detectability is the useful observation. The approach is straightforward and directly addresses a practical problem in low-data BDL benchmarking. The experiments cover six methods across five regression datasets, and the reported probabilities make the instability tangible rather than abstract. The math for the hierarchical model and the predictive curves looks standard and reproducible from the description. The soft spot is the reach of the conclusions. Five datasets demonstrate that instability occurs and that no single threshold worked across them, but that does not establish the absence of a universal threshold in the broader space of tasks. The model does not include a meta-level prior over datasets, so the prescriptive claim that dataset-specific posterior inference is always required is an extrapolation from these cases. A reader would still want the exact priors, data splits, and fitting details to confirm nothing in the variance modeling is driving the spread. This is for people who run or review empirical comparisons of Bayesian methods, especially in data-scarce settings. It gives them a concrete way to assess whether their evaluation data is sufficient. The work deserves peer review because the core methodological warning is grounded and the tools are usable, even if the generalization needs more datasets to strengthen.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that standard evaluations of Bayesian deep learning methods assume reliable metric estimates, but this fails under data scarcity: method rankings are unreliable at small n and dataset-dependent in ways point estimates cannot reveal. Using a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables, they report concrete probabilities such as P(MCD ≺ Ensemble) = 1.000 at n=50 on one dataset versus remaining below 0.95 even at n=500 on another. They introduce predictive Minimum Detectable Difference curves to assess whether observed gaps would be detectable at a given training size. Across six BDL methods and five regression datasets, they conclude that no universal sample-size threshold exists, making dataset-specific posterior inference necessary for reliable superiority claims.

Significance. If the results hold, the work is significant because it demonstrates that uncertainty-aware evaluation is required in low-data BDL settings where current evidence for method superiority and predictive detectability can diverge. The Bayesian hierarchical model and MDD curves provide practitioners with principled, reproducible tools to assess evaluation sufficiency before drawing conclusions, moving beyond point estimates. The explicit treatment of metrics as random variables across data realizations is a clear strength.

major comments (2)

[Abstract] Abstract: The central prescriptive claim that 'no universal sample size threshold exists' and therefore 'dataset-specific posterior inference is always required' is supported only by results on five regression datasets. The Bayesian hierarchical model with method-specific variances correctly yields dataset-dependent P(MCD ≺ Ensemble) curves, but the absence of a meta-level prior over datasets means the model cannot support inference about whether a common threshold is absent across the broader space of regression or classification tasks; the conclusion is therefore an untested extrapolation.
[Abstract] Abstract: Concrete probabilities such as P(MCD ≺ Ensemble) = 1.000 at n=50 are reported as evidence, yet the abstract provides no details on data splits, model fitting, or how the hierarchical variances were estimated; this makes it impossible to verify whether post-hoc dataset choices or variance modeling decisions affect the reported divergence between superiority probabilities and detectability thresholds.

minor comments (2)

The abstract is technically dense; adding a short parenthetical definition or forward reference for the Minimum Detectable Difference curve when it is first mentioned would improve readability for readers outside the immediate subfield.
Ensure that the five regression datasets and six BDL methods are enumerated with brief descriptions or citations in the main text so that the scope of the empirical study is immediately clear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on the abstract. We address each major comment below, with revisions to qualify our claims and improve clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The central prescriptive claim that 'no universal sample size threshold exists' and therefore 'dataset-specific posterior inference is always required' is supported only by results on five regression datasets. The Bayesian hierarchical model with method-specific variances correctly yields dataset-dependent P(MCD ≺ Ensemble) curves, but the absence of a meta-level prior over datasets means the model cannot support inference about whether a common threshold is absent across the broader space of regression or classification tasks; the conclusion is therefore an untested extrapolation.

Authors: We agree that the empirical results are confined to the five regression datasets studied and that the hierarchical model does not incorporate a meta-prior over datasets, so it cannot formally demonstrate the non-existence of a universal threshold across all possible tasks or domains. The recommendation for dataset-specific posterior inference is presented as a practical consequence of the observed dataset-dependence in our experiments rather than a universal proof. We will revise the abstract to state that 'across the datasets considered, no universal sample size threshold exists' and qualify the prescriptive claim accordingly. This constitutes a partial revision focused on wording. revision: partial
Referee: [Abstract] Abstract: Concrete probabilities such as P(MCD ≺ Ensemble) = 1.000 at n=50 are reported as evidence, yet the abstract provides no details on data splits, model fitting, or how the hierarchical variances were estimated; this makes it impossible to verify whether post-hoc dataset choices or variance modeling decisions affect the reported divergence between superiority probabilities and detectability thresholds.

Authors: The abstract is a high-level summary; complete details on data splits (5-fold cross-validation across multiple realizations), model fitting (MCMC for the hierarchical model), and estimation of method-specific variances are provided in Sections 3 and 4 of the manuscript. To address the concern about self-containment, we will add a concise clause in the abstract referencing the use of hierarchical modeling over multiple data realizations and predictive MDD curves. Full verification remains possible from the main text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons and hierarchical model are self-contained

full rationale

The paper reports cross-dataset empirical results on six BDL methods and five regression tasks, then applies a Bayesian hierarchical model (with method-specific variances) to obtain posterior probabilities and predictive MDD curves. No equation, parameter fit, or self-citation reduces a reported probability, detectability threshold, or ranking instability claim to a quantity defined by the same data or prior work by construction. The central claim that no universal sample-size threshold exists is an inductive statement over the observed datasets rather than a definitional or fitted tautology. The derivation chain therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5517 in / 1078 out tokens · 30661 ms · 2026-05-08T08:30:52.641495+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 15 canonical work pages · 1 internal anchor

[1]

and Bates, Stephen , title =

ISSN 1935-8237. doi: 10.1561/2200000101. URLhttps://doi.org/10.1561/ 2200000101. Filippo Bargagna, Lisa Anita De Santi, Nicola Martini, Dario Genovesi, Brunella Favilli, Giuseppe Vergaro, Michele Emdin, Assuero Giorgetti, Vincenzo Posi- tano, and Maria Filomena Santarelli. Bayesian convolutional neural networks 13 in medical imaging classification: A prom...

work page doi:10.1561/2200000101 1935
[2]

doi: 10.1007/s10278-023-00897-8

ISSN 1618-727X. doi: 10.1007/s10278-023-00897-8. URL https://doi.org/10.1007/s10278-023-00897-8. Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. InProceedings of the 32nd International Conference on Machine Learning - Volume 37, ICML’15, page 1613–1622. JMLR.org,

work page doi:10.1007/s10278-023-00897-8
[3]

Paul-Christian Bürkner

URLhttps://proceedings.mlsys.org/paper_files/ paper/2021/file/0184b0cd3cfb185989f858a1d9f5c1eb-Paper.pdf. Paul-Christian Bürkner. brms: An r package for bayesian multilevel models using stan.Journal of Statistical Software, 80(1):1–28,

2021
[4]

The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

URLhttps://arxiv.org/abs/2107.07002. Aya Ferchichi, Ahlem Ferchichi, Fatma Hendaoui, Mejda Chihaoui, and Radhia Toujani. Deep learning-based uncertainty quantification for spatio-temporal environmental remote sensing: A systematic literature review.Neurocom- puting, 639:130242,

work page arXiv
[5]

doi: https://doi.org/10.1016/ j.neucom.2025.130242

ISSN 0925-2312. doi: https://doi.org/10.1016/ j.neucom.2025.130242. URL https://www.sciencedirect.com/science/ article/pii/S0925231225009142. Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on International Conference on Machine Learni...

work page arXiv 2025
[6]

, year =

doi: 10.1214/ss/1177011136. URLhttps://doi.org/10.1214/ss/1177011136. J. Gerritsma, R. Onnink, and A. Versluis. Geometry, resistance and stability of the delft systematic yacht hull series.International Shipbuilding Progress, 28 (328):276–297,

work page doi:10.1214/ss/1177011136
[7]

Strictly Proper Scoring Rules, Prediction, and Estimation.Journal of the American Statistical Association, 102(477):359–378, March 2007

doi: 10.1198/016214506000001437. URLhttps://doi.org/ 10.1198/016214506000001437. D. A. Griffiths. Maximum likelihood estimation for the beta-binomial distribution and an application to the household distribution of the total number of cases of a disease.Biometrics, 29(4):637–648,

work page doi:10.1198/016214506000001437
[8]

Adam: A Method for Stochastic Optimization

URL https://arxiv.org/abs/1412.6980. Michael Kirchhof, Bálint Mucsányi, Seong Joon Oh, and Dr. Enkelejda Kasneci. Url: A representation learning benchmark for transferable uncertainty estimates. InAdvances in Neural Information Processing Systems, volume 36, pages 13956–13980. Curran Associates, Inc.,

work page internal anchor Pith review arXiv
[9]

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 2d421cd0e763f9f01958a30bace955bf-Paper-Datasets_and_Benchmarks. pdf. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InProceedings of the 31st International Conference on Neural Information Processin...

2023
[10]

org/abs/2501.04234

URLhttps://arxiv.org/abs/2501.04234. Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. InAdvances in Neural Informa- tion Processing Systems, volume

work page arXiv
[11]

Max Menssen and Frank Schaarschmidt

URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 118921efba23fc329e6560b27861f0c2-Paper.pdf. Max Menssen and Frank Schaarschmidt. Prediction intervals for overdispersed binomial data with application to historical controls.Statistics in Medicine, 38(14):2652–2663,

2019
[12]

URLhttps: //onlinelibrary.wiley.com/doi/abs/10.1002/sim.8124

doi: https://doi.org/10.1002/sim.8124. URLhttps: //onlinelibrary.wiley.com/doi/abs/10.1002/sim.8124. 15 Bálint Mucsányi, Michael Kirchhof, and Seong Joon Oh. Benchmarking uncertainty disentanglement: Specialized uncertainties for specialized tasks. InAdvances in Neural Information Processing Systems, volume 37, pages 50972–51038. Curran Associates, Inc.,

work page doi:10.1002/sim.8124
[13]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 5afa9cb1e917b898ad418216dc726fbd-Paper-Datasets_and_Benchmarks_ Track.pdf

doi: 10.52202/079017-1614. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 5afa9cb1e917b898ad418216dc726fbd-Paper-Datasets_and_Benchmarks_ Track.pdf. Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Se- bastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluatin...

work page doi:10.52202/079017-1614 2024
[14]

Deborah Raji, Emily Denton, Emily M

URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 8558cb408c1d76621371888657d2eb1d-Paper.pdf. Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Van- schorenandS.Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datas...

2019
[15]

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J

URLhttps: //datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/ 2021/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper-round2.pdf. Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer. Betterbench: assessing ai benchmarks, uncovering issues, and establishing best practices. InProceedings of the 38th Internat...

2021
[16]

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5d97b7e62022c859347397f6c1e8d0f9-Paper-Conference.pdf. D. J. Spiegelhalter and L. S. Freedman. A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion.Statistics in Medicine, 5(1):1–13, Jan–Feb

2023
[17]

Emma Svensson, Hannah Rosa Friesacher, Susanne Winiwarter, Lewis Mervin, Adam Arany, and Ola Engkvist

doi: 10.1002/sim.4780050103. Emma Svensson, Hannah Rosa Friesacher, Susanne Winiwarter, Lewis Mervin, Adam Arany, and Ola Engkvist. Enhancing uncertainty quantification in drug discovery with censored regression labels.Artificial Intelligence in the Life Sciences, 7:100128,

work page doi:10.1002/sim.4780050103
[18]

doi: https://doi.org/10

ISSN 2667-3185. doi: https://doi.org/10. 1016/j.ailsci.2025.100128. URL https://www.sciencedirect.com/science/ article/pii/S2667318525000042. Athanasios Tsanas and Angeliki Xifara. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning 16 tools.Energy and Buildings, 49:560–567,

work page arXiv 2025
[19]

Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools , journal =

ISSN 0378-7788. doi: https: //doi.org/10.1016/j.enbuild.2012.03.003. URL https://www.sciencedirect. com/science/article/pii/S037877881200151X. Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer US,

work page doi:10.1016/j.enbuild.2012.03.003 2012
[20]

doi: 10.1007/b106715. I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neuralnetworks.Cement and Concrete Research, 28(12):1797–1808,

work page doi:10.1007/b106715
[21]

Modeling of strength of high-performance concrete using artificial neural networks

ISSN 0008-8846. doi: https://doi.org/10.1016/S0008-8846(98)00165-3. URL https: //www.sciencedirect.com/science/article/pii/S0008884698001653. Cheng-Han Yu and Shuaizhou Wang. A comparative study of bayesian neural networks and machine learning based on covid-19 image classification.Statistics and Data Science in Imaging, 2(1):2497555,

work page doi:10.1016/s0008-8846(98)00165-3
[22]

MAP, MCD, and CP are trained for 500 epochs using the Adam optimizer [Kingma and Ba, 2015] with learning rate10−3 and weight decay10−5

Predicted variance is clamped to[10−3,10 3]during training for numerical stability. MAP, MCD, and CP are trained for 500 epochs using the Adam optimizer [Kingma and Ba, 2015] with learning rate10−3 and weight decay10−5. BBB doubles the training budget to 1000 epochs to allow variational convergence and omits weight decay, since the KL divergence term in t...

2015
[23]

sigma collapse; unclipped MAP NLL exceeds106 at all training sizes. Deep Ensemble NLL is evaluated on the mixture predictive distribution and does not decrease monotonically withn, a structural property of mixture-of-Gaussians representations documented in Lakshminarayanan et al. [2017]. SWAG NLL increases at n = 500due to posterior over dispersion under ...

2017