Is K-fold cross validation the best model selection method for Machine Learning?
Pith reviewed 2026-05-24 04:42 UTC · model grok-4.3
The pith
K-fold CUBV uses PAC-Bayesian bounds on linear classifiers to validate machine learning accuracy while reducing excess false positives on small or heterogeneous data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper derives Probably Approximately Correct-Bayesian upper bounds for linear classifiers combined with K-fold CV, then uses these to estimate actual risk via the worst-case bound on uncertain predictions; performance on simulated and neuroimaging datasets shows K-fold CUBV as a robust criterion for detecting effects and validating accuracy values from machine learning and classical CV schemes while avoiding excess false positives.
What carries the argument
K-fold CUBV, the combination of K-fold cross-validation with PAC-Bayesian upper bounds on actual risk that applies concentration inequalities to bound uncertain predictions by their worst-case value.
If this is right
- K-fold CUBV supplies confidence intervals for accuracy values obtained directly from machine learning classifications.
- The method reduces excess false positives when validating models on small-sample or heterogeneous sources.
- It enables a frequentist-style analysis inside machine learning pipelines without requiring parametric assumptions on accuracy.
- Classical CV schemes can be checked against the K-fold CUBV bound to confirm whether reported accuracy reflects genuine effects.
Where Pith is reading between the lines
- The bounding technique might extend beyond linear classifiers if similar concentration inequalities can be derived for other model families.
- Integration into existing cross-validation routines could change how practitioners report statistical significance in applied machine learning.
- Comparison with permutation tests on the same datasets could clarify whether the PAC-Bayesian bound adds information beyond resampling.
Load-bearing premise
The PAC-Bayesian upper bounds for linear classifiers stay useful and not overly conservative when applied to real heterogeneous datasets.
What would settle it
A heterogeneous dataset where K-fold CUBV produces bounds so conservative that it misses known effects detected by standard K-fold CV without excess false positives would falsify the robustness claim.
Figures
read the original abstract
As a technique that can compactly represent complex patterns, machine learning has significant potential for predictive inference. K-fold cross-validation (CV) is the most common approach to ascertaining the likelihood that a machine learning outcome is generated by chance, and it frequently outperforms conventional hypothesis testing. This improvement uses measures directly obtained from machine learning classifications, such as accuracy, that do not have a parametric description. To approach a frequentist analysis within machine learning pipelines, a permutation test or simple statistics from data partitions (i.e., folds) can be added to estimate confidence intervals. Unfortunately, neither parametric nor non-parametric tests solve the inherent problems of partitioning small sample-size datasets and learning from heterogeneous data sources. The fact that machine learning strongly depends on the learning parameters and the distribution of data across folds recapitulates familiar difficulties around excess false positives and replication. A novel statistical test based on K-fold CV and the Upper Bound of the actual risk (K-fold CUBV) is proposed, where uncertain predictions of machine learning with CV are bounded by the worst case through the evaluation of concentration inequalities. Probably Approximately Correct-Bayesian upper bounds for linear classifiers in combination with K-fold CV are derived and used to estimate the actual risk. The performance with simulated and neuroimaging datasets suggests that K-fold CUBV is a robust criterion for detecting effects and validating accuracy values obtained from machine learning and classical CV schemes, while avoiding excess false positives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript questions whether K-fold cross-validation is the best model selection method for machine learning and proposes K-fold CUBV, which combines K-fold CV with PAC-Bayesian upper bounds on the actual risk derived for linear classifiers. The central claim is that this approach yields a robust criterion for detecting effects and validating ML accuracies on small-sample and heterogeneous data (simulated and neuroimaging datasets) while controlling excess false positives better than standard CV or permutation tests.
Significance. If the derived bounds are shown to be sufficiently tight in practice and the method demonstrably improves false-positive control without loss of power, the work could strengthen validation practices in applied ML domains such as neuroimaging. The use of standard concentration inequalities to produce explicit upper bounds on risk is a methodological strength that aligns with PAC-Bayesian theory.
major comments (2)
- [Results (neuroimaging experiments)] The robustness claim rests on the PAC-Bayesian bounds remaining useful (not overly conservative) on heterogeneous neuroimaging data. The results section on the neuroimaging experiments does not report the numerical values of the derived upper bounds relative to the observed empirical accuracies or risks, so it is impossible to verify whether the bounds stay within a factor of 2–3 of the empirical performance or inflate substantially as is common for PAC-Bayes on non-stationary, high-dimensional data.
- [Method (bound derivation)] The derivation of the PAC-Bayesian upper bounds is stated to be for linear classifiers, yet the abstract and title frame the contribution for general machine learning pipelines. The manuscript does not clarify how (or whether) the bounds extend to non-linear models that are standard in the evaluated neuroimaging tasks, which is load-bearing for the claim that K-fold CUBV improves upon classical CV schemes.
minor comments (2)
- [Methods] Notation for the concentration inequalities and the precise definition of the K-fold CUBV statistic could be introduced with an explicit equation early in the methods section rather than relying on the abstract description.
- [Experiments (simulated data)] The simulated-data experiments would benefit from an explicit statement of the data-generating process parameters and the exact form of the linear classifier used, to allow direct reproduction of the reported false-positive rates.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address each major comment point-by-point below, indicating where we agree and will revise the manuscript.
read point-by-point responses
-
Referee: [Results (neuroimaging experiments)] The robustness claim rests on the PAC-Bayesian bounds remaining useful (not overly conservative) on heterogeneous neuroimaging data. The results section on the neuroimaging experiments does not report the numerical values of the derived upper bounds relative to the observed empirical accuracies or risks, so it is impossible to verify whether the bounds stay within a factor of 2–3 of the empirical performance or inflate substantially as is common for PAC-Bayes on non-stationary, high-dimensional data.
Authors: We agree that the numerical values of the PAC-Bayesian upper bounds relative to empirical accuracies are needed to evaluate tightness on heterogeneous data. In the revised manuscript we will add a table (or supplementary table) reporting these values for each neuroimaging dataset, allowing direct assessment of whether the bounds remain within a reasonable factor of the observed risks. revision: yes
-
Referee: [Method (bound derivation)] The derivation of the PAC-Bayesian upper bounds is stated to be for linear classifiers, yet the abstract and title frame the contribution for general machine learning pipelines. The manuscript does not clarify how (or whether) the bounds extend to non-linear models that are standard in the evaluated neuroimaging tasks, which is load-bearing for the claim that K-fold CUBV improves upon classical CV schemes.
Authors: The derivation in the methods section is explicitly for linear classifiers using the corresponding concentration inequalities. The title poses a general question about model selection, but the concrete contribution and bounds are for linear models. We will revise the abstract to state this scope clearly and add a short discussion paragraph noting that extensions to non-linear models would require different inequalities and are left for future work. This removes any ambiguity without overclaiming generality. revision: yes
Circularity Check
No circularity: bounds derived from standard concentration inequalities; evaluation is empirical
full rationale
The paper states that PAC-Bayesian upper bounds for linear classifiers combined with K-fold CV are derived from concentration inequalities and then applied to estimate actual risk. This is a standard mathematical derivation step whose inputs are the classifier, the prior, and the empirical risk on the folds; the resulting bound is not defined in terms of the target accuracy or the final performance metric. The subsequent claim that K-fold CUBV is robust rests on reported performance on simulated and neuroimaging datasets, which constitutes external empirical validation rather than a reduction of the bound to its own inputs. No self-citation, fitted-parameter renaming, or self-definitional step is present in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Concentration inequalities and PAC-Bayesian analysis can be applied to bound the actual risk of linear classifiers when combined with K-fold cross-validation partitions.
Reference graph
Works this paper leans on
-
[1]
National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and Replicability in Sci- ence. Washington, DC: The National Academies Press. https://doi.org/10.17226/25303
-
[2]
Cluster failure: Inflated false positives for fMRI
A.Eklund, et al. Cluster failure: Inflated false positives for fMRI. Proceedings of the National Academy of Sci- ences Jul 2016, 113 (28) 7900-7905
work page 2016
-
[3]
S. Noble, et al. Cluster failure or power failure? Evaluating sensitivity in cluster-level inference. NeuroImage, 209, 116468,2020
work page 2020
-
[4]
Statistical Parametric Maps in functional imaging: A general linear approach Hum
K.J.Friston, et al. Statistical Parametric Maps in functional imaging: A general linear approach Hum. Brain Mapp. 2:189-210 (1995)
work page 1995
-
[5]
Classical and Bayesian inference in neuroimaging: theory NeuroImage, 16 (2) (2002), pp
K.J.Friston, et al. Classical and Bayesian inference in neuroimaging: theory NeuroImage, 16 (2) (2002), pp. 465- 483
work page 2002
-
[6]
J.D. Rosenblatt, et al. Revisiting multi-subject random effects in fMRI: Advocating prevalence estimation. Neu- roImage 84 (2014): 113-121
work page 2014
-
[7]
Model-Agnostic Interpretability of Machine Learning
MT Ribeiro, et al. Model-agnostic interpretability of machine learning arXiv preprint arXiv:1606.05386. 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
Y . LeCun et al. Deep learning. Nature 521, 436–444 (2015). 5Please see the column in Nature about this issue https://www.nature.com/articles/d41586-019-02960-3 18 Is K-fold cross validation the best model selection method for Machine Learning? A PREPRINT
work page 2015
-
[9]
Mathematical Aspects of Deep Learning
P Grohs, et al. Mathematical Aspects of Deep Learning. Cambridge University Press. ISBN 9781009025096. https://doi.org/10.1017/9781009025096
-
[10]
L.van der Maaten et al. Visualizing Data using t-SNE. Journal of Machine Learning Research 2008 vol 9, num 86, 2579–2605
work page 2008
-
[11]
J.Mouro-Miranda, et al. Classifying brain states and determining the discriminating activation patterns: Support vector machine on functional MRI data. NeuroImage, 28, 980-995. (2005)
work page 2005
-
[12]
Y . Zhang et al. Multivariate lesion-symptom mapping using support vector regression. Hum Brain Mapp. 2014 Dec;35(12):5861-76
work page 2014
-
[13]
JM Gorriz, et al. A connection between pattern classification by machine learning and statistical inference with the General Linear Model. IEEE Journal of Biomedical and Health Informatics 2021
work page 2021
-
[14]
A hypothesis-driven method based on machine learning for neuroimaging data analysis
JM Gorriz, et al. A hypothesis-driven method based on machine learning for neuroimaging data analysis. Neuro- computing V olume 510, 21 October 2022, Pages 159-171
work page 2022
-
[15]
Support vector machine learning-based fMRI data group analysis
Z Wang, et al. Support vector machine learning-based fMRI data group analysis. NeuroImage 36 (4), 1139-1151. 2007
work page 2007
-
[16]
A hybrid SVM–GLM approach for fMRI data analysis
Z Wang. A hybrid SVM–GLM approach for fMRI data analysis. Neuroimage 46 (3), 608-615. 2009
work page 2009
-
[17]
Quantifying performance of machine learning methods for neuroimaging data
Jollans L,et al. Quantifying performance of machine learning methods for neuroimaging data. Neuroimage. 2019 Oct 1;199:351-365
work page 2019
-
[18]
M.J. McKeown et. al. Independent component analysis of functional MRI: what is signal and what is noise? Curr Opin Neurobiol. 2003 Oct; 13(5): 620–629
work page 2003
-
[19]
Gorgen, K., et al. The same analysis approach: Practical protection against the pitfalls of novel neuroimaging analysis methods. NeuroImage, 180, 19-30. 2018
work page 2018
- [20]
-
[21]
G. Gallavotti. Ergodicity, ensembles, irreversibility in Boltzmann and beyond Springer March 1995 Journal of Statistical Physics 78(5):1571-1589
work page 1995
-
[22]
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence (IJCAI), pp 1–7, 1995
work page 1995
-
[23]
Allen, D. (1974). The relationship between variable selection and data augmentation and a method of prediction. Technometrics, 16:125-7
work page 1974
-
[24]
Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350):320-328
work page 1975
-
[25]
Bates, S., et al. (2023). Cross-Validation: What Does It Estimate and How Well Does It Do It? Journal of the American Statistical Association, 1–12
work page 2023
-
[26]
Rodriguez, J.D. (2020). Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence, V ol. 32, No. 3
work page 2020
-
[27]
A Machine Learning Approach to Reveal the NeuroPhenotypes of Autisms
J.M.Górriz, et al. A Machine Learning Approach to Reveal the NeuroPhenotypes of Autisms. International jour- nal of neural systems, 1850058. 2019
work page 2019
-
[28]
B. Phipson et al. Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn. Statistical Applications in Genetics and Molecular Biology: V ol. 9: Iss. 1, Article 39. (2010)
work page 2010
-
[29]
Vapnik, V . N. (1998). Statistical Learning Theory. Wiley-Interscience
work page 1998
-
[30]
S. Boucheron et al. Concentration Inequalities: A Nonasymptotic Theory of Independence ISBN: 9780199535255 Oxford University Press
-
[31]
R.S.J. Frackowiak, et al. Human Brain Function (Second Edition). Chap. 44. Introduction to Random Field Theory. ISBN 978-0-12-264841-0 Academic Press. 867-879, 2004
work page 2004
-
[32]
Multiple testing corrections, nonparametric methods, and random field theory
T.E.Nichols. Multiple testing corrections, nonparametric methods, and random field theory. NeuroImage 62 (2012) 811-815
work page 2012
-
[33]
Efron, B.; et al. (1993). An Introduction to the Bootstrap. Boca Raton, FL: Chapman & Hall/CRC. ISBN 0-412- 04231-2
work page 1993
-
[34]
A. Sarica, et al. A machine learning neuroimaging challenge for automated diagnosis of Alzheimer’s disease. Editorial on special issue: Machine learning on MCI, vol 302, Journal of Neuroscience Methods. 2018. 19 Is K-fold cross validation the best model selection method for Machine Learning? A PREPRINT
work page 2018
-
[35]
C.C.Jack,Jr. ,et al. NIA-AA Research Framework: Toward a biological definition of Alzheimer’s disease. Alzheimers Dement. 2018 Apr; 14(4): 535?562
work page 2018
-
[36]
J.M.Gorriz, et al. Artificial intelligence within the interplay between natural and artificial computation: Advances in data science, trends and applications. Neurocomputing V olume 410, 14 October 237-270 2020
work page 2020
-
[37]
J.M.Górriz, et al. On the computation of distribution-free performance bounds: Application to small sample sizes in neuroimaging. Pattern Recognition 93, 1-13, 2019
work page 2019
-
[38]
Statistical Agnostic Mapping: A framework in neuroimaging based on concentration inequali- ties
J.M.Gorriz, et al. Statistical Agnostic Mapping: A framework in neuroimaging based on concentration inequali- ties. Information Fusion V olume 66, February 2021, Pages 198-212
work page 2021
-
[39]
C.J.C Burges. A tutorial on support vector machines for pattern recognition Data Mining and Knowledge Dis- covery, 2 (2) (1998), pp. 121-167
work page 1998
- [40]
- [41]
-
[42]
V . Vapnik. Estimation dependencies based on Empirical Data. Springer-Verlach. 1982 ISBN 0-387-90733-5
work page 1982
- [43]
-
[44]
Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning
Olivier Catoni. Pac-bayesian supervised classification: the thermodynamics of statistical learning. arXiv preprint arXiv:0712.0248, 2007
work page internal anchor Pith review Pith/arXiv arXiv 2007
-
[45]
A PAC-Bayesian Tutorial with A Dropout Bound
D. McAllester, A PAC-Bayesian Tutorial with A Dropout Bound. arXiv 10.48550/ARXIV .1307.2118,2013
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2013
-
[46]
Asymptotic evaluation of certain Markov process expectations for large time. IV
Donsker, Monroe D.; Varadhan, SR Srinivasa (1983). "Asymptotic evaluation of certain Markov process expectations for large time. IV". Communications on Pure and Applied Mathematics. 36 (2): 183–212. doi:10.1002/cpa.3160360204
-
[47]
K.J. Friston. Sample size and the fallacies of classical inference. NeuroImage 81 (2013) 503–504
work page 2013
-
[48]
E T Bullmore et al. Global, voxel, and cluster tests, by theory and permutation, for a difference between two groups of structural MR images of the brain IEEE Trans Med Imaging (1999) Jan;18(1):32-42
work page 1999
-
[49]
P.T. Reiss, et al. Cross-validation and hypothesis testing in neuroimaging: an irenic comment on the exchange between Friston and Lindquist et al. Neuroimage. 2015 August 1; 116: 248-254
work page 2015
-
[50]
C. Jimenez-Mesa et al. A non-parametric statistical inference framework for Deep Learning in current neu- roimaging. Information Fusion V olume 91, March 2023, Pages 598-611
work page 2023
-
[51]
S.M. Kay. Fundamentals of Statistical Signal Processing: Detection theory. Prentice-Hall PTR, 1998 013504135X, 9780135041352
work page 1998
-
[52]
Advances in multimodal data fusion in neuroimaging: Overview, challenges, and novel orien- tation
Zhang YD, et al. Advances in multimodal data fusion in neuroimaging: Overview, challenges, and novel orien- tation. Inf Fusion. 2020 Dec;64:149-187
work page 2020
-
[53]
J.N. Acosta et al. Multimodal biomedical AI. Nat Med 28, 1773–1784 (2022)
work page 2022
-
[54]
C.S. Hyatt et al. The quandary of covarying: A brief review and empirical examination of covariate use in structural neuroimaging studies on psychological variables. Neuroimage 205, 116225
-
[55]
M. Leming, et al. Ensemble Deep Learning on Large, Mixed-Site fMRI Datasets in Autism and Other Tasks. M Leming, International Journal of Neural Systems. V ol. 30, No. 07, 2050012. 2020
work page 2020
-
[56]
J.D. Rosenblatt, et al. Better-than-chance classification for signal detection. Biostatistics (2016)
work page 2016
-
[57]
Cover, Thomas M.. “Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition.” IEEE Trans. Electron. Comput. 14 (1965): 326-334
work page 1965
-
[58]
H. Tverberg, A Generalization of Radon’s Theorem, Journal of the London Mathematical Society, V olume s1-41, Issue 1, 1966, Pages 123-128. 20 Is K-fold cross validation the best model selection method for Machine Learning? A PREPRINT Supplementary Materials 7.1 Remarks on ´´Common Experimental Designs” section How and when does a specific laboratory rejec...
work page 1966
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.