pith. machine review for the scientific record. sign in

arxiv: 2604.23102 · v1 · submitted 2026-04-25 · 💻 cs.LG

Unstable Rankings in Bayesian Deep Learning Evaluation

Pith reviewed 2026-05-08 08:30 UTC · model grok-4.3

classification 💻 cs.LG
keywords Bayesian deep learningevaluation metricsranking instabilityhierarchical Bayesian modelminimum detectable differencelow-data regimesuncertainty-aware evaluationregression datasets
0
0 comments X

The pith

Standard point-estimate evaluations of Bayesian deep learning methods yield unreliable rankings when training data is limited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluations of Bayesian deep learning methods typically use single numbers for performance, yet these numbers fluctuate enough under limited data that one method can rank above another on one dataset while the reverse holds on another. The paper shows that the probability of one method outperforming another can be near 1.0 at n=50 on some data but stay below 0.95 even at n=500 on others. Because no fixed training size guarantees stable rankings across datasets, the authors argue that evaluations must incorporate uncertainty over possible data draws. They build a hierarchical model that lets the ranking probability itself be estimated from repeated training runs, together with a curve that predicts how large a gap must be before it becomes detectable at a given sample size.

Core claim

Across six Bayesian deep learning methods and five regression datasets, method rankings are dataset-dependent and fail to stabilize at small training sizes. The same comparison can give P(MCD ≺ Ensemble) = 1.000 at n=50 on one dataset yet remain below 0.95 at n=500 on another. No universal sample-size threshold exists; therefore dataset-specific posterior inference over metrics is required to determine when observed differences are reliable.

What carries the argument

Bayesian hierarchical model with method-specific variances that treats evaluation metrics as random variables across data realizations, plus a predictive Minimum Detectable Difference curve for assessing detectability at given training sizes.

If this is right

  • Evidence for superiority of one method over another must be checked against the probability that the ranking would reverse on new data draws of the same size.
  • A method that appears best on one low-data problem may not be distinguishable from alternatives on a different problem at the same size.
  • Evaluation reports should include the minimum training size at which a given performance gap becomes detectable for that dataset.
  • Current practice of declaring one Bayesian method superior based on point metrics alone is invalid in low-data regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed instability may account for conflicting results across papers that compare the same set of Bayesian deep learning techniques.
  • Practitioners could apply the same hierarchical model to decide in advance how many training runs are needed before trusting a ranking.
  • The framework suggests similar uncertainty quantification would be useful when comparing non-Bayesian methods under data constraints.
  • Extending the analysis to classification datasets could reveal whether the same lack of universal thresholds holds outside regression.

Load-bearing premise

The five chosen regression datasets and six Bayesian deep learning methods are sufficiently representative that the lack of a universal sample-size threshold generalizes beyond these cases.

What would settle it

Observing that method superiority probabilities exceed 0.95 for all pairwise comparisons at the same training size n across all five datasets would falsify the claim that no universal threshold exists.

Figures

Figures reproduced from arXiv: 2604.23102 by Guansu Wang, Jiaxin Liu, Liang He, Minxuan Hu, Qishi Zhan.

Figure 1
Figure 1. Figure 1: Standard deviation of CRPS and NLL across view at source ↗
Figure 2
Figure 2. Figure 2: Posterior probability P(MCD ≺ Ensemble) on CRPS across training sizes for three representative datasets. Synthetic and Energy exhibit ranking reversals, while Concrete remains weaker and nearly inconclusive at n = 200. Dashed lines at 0.05 and 0.95 indicate strong posterior support, and the dotted line at 0.50 indicates an inconclusive comparison. 5.3 Minimum Detectable Difference While Section 5.2 quantif… view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise posterior probability P(row ≺ column) on CRPS for Kin8nm (left) and Concrete (right) at n = 50. Values > 0.95 (dark blue) indicate that the row method reliably outperforms the column method; values < 0.05 (dark red) indicate the opposite; values near 0.5 (white) indicate inconclusive comparisons. The stark contrast between datasets illustrates that method rankings are dataset￾dependent and cannot … view at source ↗
Figure 4
Figure 4. Figure 4: Predictive Minimum Detectable Difference for MCD vs. Deep Ensem view at source ↗
Figure 5
Figure 5. Figure 5: Posterior predictive checks for CRPS (left) and NLL (right) at view at source ↗
Figure 6
Figure 6. Figure 6: Pairwise posterior probability P(row ≺ column) on CRPS for Kin8nm (left) and Concrete (right) at n = 200. Values > 0.95 (dark blue) indicate that the row method reliably outperforms the column method; values < 0.05 (dark red) indicate the opposite; values near 0.5 (white) indicate inconclusive comparisons. The stark contrast between datasets illustrates that method rankings are dataset￾dependent and cannot… view at source ↗
Figure 7
Figure 7. Figure 7: Variance decomposition of a neural network evaluation metric under view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of Kendall τ between CRPS and Interval Score method rankings across R = 50 data realizations. Higher τ indicates greater agreement between the two metrics. The dashed line at τ = 1 represents perfect agreement. Consistency increases with n but never reaches 1.0. 25 view at source ↗
read the original abstract

Standard evaluations of Bayesian deep learning methods assume that metric estimates are reliable, but we show this assumption fails under data scarcity. Method rankings are not only unreliable at small $n$, but also dataset-dependent in ways that point estimates cannot reveal: the same method comparison yields $P(\mathrm{MCD} \prec \mathrm{Ensemble}) = 1.000$ at $n = 50$ on one dataset and remains below $0.95$ even at $n = 500$ on another. Across the datasets we consider, no universal sample size threshold exists, which is precisely why dataset-specific posterior inference is necessary. To address this, we use a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables across data realizations, and we use a predictive Minimum Detectable Difference curve to assess whether an observed gap would be detectable at a given training size. Across six Bayesian deep learning methods and five regression datasets, our results show that uncertainty-aware evaluation is necessary in low-data settings, because current evidence for method superiority and predictive detectability at the same training size can diverge substantially. Our framework provides practitioners with principled tools to determine whether their evaluation data is sufficient before drawing conclusions about method superiority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that standard evaluations of Bayesian deep learning methods assume reliable metric estimates, but this fails under data scarcity: method rankings are unreliable at small n and dataset-dependent in ways point estimates cannot reveal. Using a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables, they report concrete probabilities such as P(MCD ≺ Ensemble) = 1.000 at n=50 on one dataset versus remaining below 0.95 even at n=500 on another. They introduce predictive Minimum Detectable Difference curves to assess whether observed gaps would be detectable at a given training size. Across six BDL methods and five regression datasets, they conclude that no universal sample-size threshold exists, making dataset-specific posterior inference necessary for reliable superiority claims.

Significance. If the results hold, the work is significant because it demonstrates that uncertainty-aware evaluation is required in low-data BDL settings where current evidence for method superiority and predictive detectability can diverge. The Bayesian hierarchical model and MDD curves provide practitioners with principled, reproducible tools to assess evaluation sufficiency before drawing conclusions, moving beyond point estimates. The explicit treatment of metrics as random variables across data realizations is a clear strength.

major comments (2)
  1. [Abstract] Abstract: The central prescriptive claim that 'no universal sample size threshold exists' and therefore 'dataset-specific posterior inference is always required' is supported only by results on five regression datasets. The Bayesian hierarchical model with method-specific variances correctly yields dataset-dependent P(MCD ≺ Ensemble) curves, but the absence of a meta-level prior over datasets means the model cannot support inference about whether a common threshold is absent across the broader space of regression or classification tasks; the conclusion is therefore an untested extrapolation.
  2. [Abstract] Abstract: Concrete probabilities such as P(MCD ≺ Ensemble) = 1.000 at n=50 are reported as evidence, yet the abstract provides no details on data splits, model fitting, or how the hierarchical variances were estimated; this makes it impossible to verify whether post-hoc dataset choices or variance modeling decisions affect the reported divergence between superiority probabilities and detectability thresholds.
minor comments (2)
  1. The abstract is technically dense; adding a short parenthetical definition or forward reference for the Minimum Detectable Difference curve when it is first mentioned would improve readability for readers outside the immediate subfield.
  2. Ensure that the five regression datasets and six BDL methods are enumerated with brief descriptions or citations in the main text so that the scope of the empirical study is immediately clear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on the abstract. We address each major comment below, with revisions to qualify our claims and improve clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central prescriptive claim that 'no universal sample size threshold exists' and therefore 'dataset-specific posterior inference is always required' is supported only by results on five regression datasets. The Bayesian hierarchical model with method-specific variances correctly yields dataset-dependent P(MCD ≺ Ensemble) curves, but the absence of a meta-level prior over datasets means the model cannot support inference about whether a common threshold is absent across the broader space of regression or classification tasks; the conclusion is therefore an untested extrapolation.

    Authors: We agree that the empirical results are confined to the five regression datasets studied and that the hierarchical model does not incorporate a meta-prior over datasets, so it cannot formally demonstrate the non-existence of a universal threshold across all possible tasks or domains. The recommendation for dataset-specific posterior inference is presented as a practical consequence of the observed dataset-dependence in our experiments rather than a universal proof. We will revise the abstract to state that 'across the datasets considered, no universal sample size threshold exists' and qualify the prescriptive claim accordingly. This constitutes a partial revision focused on wording. revision: partial

  2. Referee: [Abstract] Abstract: Concrete probabilities such as P(MCD ≺ Ensemble) = 1.000 at n=50 are reported as evidence, yet the abstract provides no details on data splits, model fitting, or how the hierarchical variances were estimated; this makes it impossible to verify whether post-hoc dataset choices or variance modeling decisions affect the reported divergence between superiority probabilities and detectability thresholds.

    Authors: The abstract is a high-level summary; complete details on data splits (5-fold cross-validation across multiple realizations), model fitting (MCMC for the hierarchical model), and estimation of method-specific variances are provided in Sections 3 and 4 of the manuscript. To address the concern about self-containment, we will add a concise clause in the abstract referencing the use of hierarchical modeling over multiple data realizations and predictive MDD curves. Full verification remains possible from the main text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons and hierarchical model are self-contained

full rationale

The paper reports cross-dataset empirical results on six BDL methods and five regression tasks, then applies a Bayesian hierarchical model (with method-specific variances) to obtain posterior probabilities and predictive MDD curves. No equation, parameter fit, or self-citation reduces a reported probability, detectability threshold, or ranking instability claim to a quantity defined by the same data or prior work by construction. The central claim that no universal sample-size threshold exists is an inductive statement over the observed datasets rather than a definitional or fitted tautology. The derivation chain therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5517 in / 1078 out tokens · 30661 ms · 2026-05-08T08:30:52.641495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    and Bates, Stephen , title =

    ISSN 1935-8237. doi: 10.1561/2200000101. URLhttps://doi.org/10.1561/ 2200000101. Filippo Bargagna, Lisa Anita De Santi, Nicola Martini, Dario Genovesi, Brunella Favilli, Giuseppe Vergaro, Michele Emdin, Assuero Giorgetti, Vincenzo Posi- tano, and Maria Filomena Santarelli. Bayesian convolutional neural networks 13 in medical imaging classification: A prom...

  2. [2]

    doi: 10.1007/s10278-023-00897-8

    ISSN 1618-727X. doi: 10.1007/s10278-023-00897-8. URL https://doi.org/10.1007/s10278-023-00897-8. Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. InProceedings of the 32nd International Conference on Machine Learning - Volume 37, ICML’15, page 1613–1622. JMLR.org,

  3. [3]

    Paul-Christian Bürkner

    URLhttps://proceedings.mlsys.org/paper_files/ paper/2021/file/0184b0cd3cfb185989f858a1d9f5c1eb-Paper.pdf. Paul-Christian Bürkner. brms: An r package for bayesian multilevel models using stan.Journal of Statistical Software, 80(1):1–28,

  4. [4]

    The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

    URLhttps://arxiv.org/abs/2107.07002. Aya Ferchichi, Ahlem Ferchichi, Fatma Hendaoui, Mejda Chihaoui, and Radhia Toujani. Deep learning-based uncertainty quantification for spatio-temporal environmental remote sensing: A systematic literature review.Neurocom- puting, 639:130242,

  5. [5]

    doi: https://doi.org/10.1016/ j.neucom.2025.130242

    ISSN 0925-2312. doi: https://doi.org/10.1016/ j.neucom.2025.130242. URL https://www.sciencedirect.com/science/ article/pii/S0925231225009142. Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on International Conference on Machine Learni...

  6. [6]

    , year =

    doi: 10.1214/ss/1177011136. URLhttps://doi.org/10.1214/ss/1177011136. J. Gerritsma, R. Onnink, and A. Versluis. Geometry, resistance and stability of the delft systematic yacht hull series.International Shipbuilding Progress, 28 (328):276–297,

  7. [7]

    Strictly Proper Scoring Rules, Prediction, and Estimation.Journal of the American Statistical Association, 102(477):359–378, March 2007

    doi: 10.1198/016214506000001437. URLhttps://doi.org/ 10.1198/016214506000001437. D. A. Griffiths. Maximum likelihood estimation for the beta-binomial distribution and an application to the household distribution of the total number of cases of a disease.Biometrics, 29(4):637–648,

  8. [8]

    Adam: A Method for Stochastic Optimization

    URL https://arxiv.org/abs/1412.6980. Michael Kirchhof, Bálint Mucsányi, Seong Joon Oh, and Dr. Enkelejda Kasneci. Url: A representation learning benchmark for transferable uncertainty estimates. InAdvances in Neural Information Processing Systems, volume 36, pages 13956–13980. Curran Associates, Inc.,

  9. [9]

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 2d421cd0e763f9f01958a30bace955bf-Paper-Datasets_and_Benchmarks. pdf. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InProceedings of the 31st International Conference on Neural Information Processin...

  10. [10]

    org/abs/2501.04234

    URLhttps://arxiv.org/abs/2501.04234. Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. InAdvances in Neural Informa- tion Processing Systems, volume

  11. [11]

    Max Menssen and Frank Schaarschmidt

    URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 118921efba23fc329e6560b27861f0c2-Paper.pdf. Max Menssen and Frank Schaarschmidt. Prediction intervals for overdispersed binomial data with application to historical controls.Statistics in Medicine, 38(14):2652–2663,

  12. [12]

    URLhttps: //onlinelibrary.wiley.com/doi/abs/10.1002/sim.8124

    doi: https://doi.org/10.1002/sim.8124. URLhttps: //onlinelibrary.wiley.com/doi/abs/10.1002/sim.8124. 15 Bálint Mucsányi, Michael Kirchhof, and Seong Joon Oh. Benchmarking uncertainty disentanglement: Specialized uncertainties for specialized tasks. InAdvances in Neural Information Processing Systems, volume 37, pages 50972–51038. Curran Associates, Inc.,

  13. [13]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 5afa9cb1e917b898ad418216dc726fbd-Paper-Datasets_and_Benchmarks_ Track.pdf

    doi: 10.52202/079017-1614. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 5afa9cb1e917b898ad418216dc726fbd-Paper-Datasets_and_Benchmarks_ Track.pdf. Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Se- bastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluatin...

  14. [14]

    Deborah Raji, Emily Denton, Emily M

    URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 8558cb408c1d76621371888657d2eb1d-Paper.pdf. Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Van- schorenandS.Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datas...

  15. [15]

    Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J

    URLhttps: //datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/ 2021/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper-round2.pdf. Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer. Betterbench: assessing ai benchmarks, uncovering issues, and establishing best practices. InProceedings of the 38th Internat...

  16. [16]

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5d97b7e62022c859347397f6c1e8d0f9-Paper-Conference.pdf. D. J. Spiegelhalter and L. S. Freedman. A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion.Statistics in Medicine, 5(1):1–13, Jan–Feb

  17. [17]

    Emma Svensson, Hannah Rosa Friesacher, Susanne Winiwarter, Lewis Mervin, Adam Arany, and Ola Engkvist

    doi: 10.1002/sim.4780050103. Emma Svensson, Hannah Rosa Friesacher, Susanne Winiwarter, Lewis Mervin, Adam Arany, and Ola Engkvist. Enhancing uncertainty quantification in drug discovery with censored regression labels.Artificial Intelligence in the Life Sciences, 7:100128,

  18. [18]

    doi: https://doi.org/10

    ISSN 2667-3185. doi: https://doi.org/10. 1016/j.ailsci.2025.100128. URL https://www.sciencedirect.com/science/ article/pii/S2667318525000042. Athanasios Tsanas and Angeliki Xifara. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning 16 tools.Energy and Buildings, 49:560–567,

  19. [19]

    Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools , journal =

    ISSN 0378-7788. doi: https: //doi.org/10.1016/j.enbuild.2012.03.003. URL https://www.sciencedirect. com/science/article/pii/S037877881200151X. Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer US,

  20. [20]

    doi: 10.1007/b106715. I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neuralnetworks.Cement and Concrete Research, 28(12):1797–1808,

  21. [21]

    Modeling of strength of high-performance concrete using artificial neural networks

    ISSN 0008-8846. doi: https://doi.org/10.1016/S0008-8846(98)00165-3. URL https: //www.sciencedirect.com/science/article/pii/S0008884698001653. Cheng-Han Yu and Shuaizhou Wang. A comparative study of bayesian neural networks and machine learning based on covid-19 image classification.Statistics and Data Science in Imaging, 2(1):2497555,

  22. [22]

    MAP, MCD, and CP are trained for 500 epochs using the Adam optimizer [Kingma and Ba, 2015] with learning rate10−3 and weight decay10−5

    Predicted variance is clamped to[10−3,10 3]during training for numerical stability. MAP, MCD, and CP are trained for 500 epochs using the Adam optimizer [Kingma and Ba, 2015] with learning rate10−3 and weight decay10−5. BBB doubles the training budget to 1000 epochs to allow variational convergence and omits weight decay, since the KL divergence term in t...

  23. [23]

    sigma collapse; unclipped MAP NLL exceeds106 at all training sizes. Deep Ensemble NLL is evaluated on the mixture predictive distribution and does not decrease monotonically withn, a structural property of mixture-of-Gaussians representations documented in Lakshminarayanan et al. [2017]. SWAG NLL increases at n = 500due to posterior over dispersion under ...