Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation
Pith reviewed 2026-05-25 06:11 UTC · model grok-4.3
The pith
Cross-validation ensembles mix data effects into uncertainty estimates unlike deep ensembles with fixed training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Deep ensembles formed from a fixed training set and different random seeds yield uncertainty estimates with improved calibration and failure detection compared to 5-fold cross-validation ensembles under matched training conditions, while the cross-validation approach can serve as a closer proxy for segmentation ambiguity on the tested multi-rater datasets.
What carries the argument
The controlled comparison of cross-validation ensembles (members trained on different data subsets) versus deep ensembles (members trained on identical data with different seeds), with disagreement used as the uncertainty signal.
If this is right
- Deep ensembles should be used when the goal is reliability-oriented tasks such as selective referral or failure detection.
- Cross-validation ensembles can be retained when the aim is to approximate segmentation ambiguity or inter-rater disagreement.
- Studies must distinguish the two ensemble constructions in both method and terminology.
- A lightweight nnU-Net modification makes deep-ensemble training available inside the default pipeline.
Where Pith is reading between the lines
- Past papers that used cross-validation but reported deep-ensemble results may need re-interpretation of their uncertainty findings.
- The same distinction could matter for non-segmentation tasks or architectures where data-subset effects interact differently with model variability.
- Choosing the wrong ensemble type might alter clinical downstream uses such as when to trust or escalate a prediction.
Load-bearing premise
The three multi-rater segmentation datasets and fixed training setups allow general conclusions about how ensemble construction changes uncertainty interpretation across medical imaging tasks.
What would settle it
A reversal of the calibration or failure-detection advantage when the same comparison is repeated on new datasets or modalities would show the reported preference does not hold generally.
Figures
read the original abstract
Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that cross-validation (CV) ensembles are frequently mislabeled as deep ensembles (DE) in medical image segmentation uncertainty literature, as CV introduces data-subset variability in addition to seed variability. Through an audit of recent studies and empirical comparison on three multi-rater datasets (spanning modalities), a 5-fold CV ensemble is contrasted with a 5-member DE under identical configurations. Results indicate that DE matches segmentation accuracy while offering better calibration and failure detection, whereas CV ensembles can correlate more with inter-rater variability; the authors recommend matching ensemble type to the uncertainty use-case and release a nnU-Net modification for DE training.
Significance. This work underscores the need to carefully select ensemble construction methods when using disagreement as an uncertainty proxy, with potential impact on applications like selective referral in medical imaging. The empirical evaluation across multiple axes (calibration, failure detection, ambiguity, distribution shift) and the provision of reproducible code are strengths that support the practical relevance if the central distinction is confirmed after addressing data exposure differences.
major comments (2)
- [Abstract and §4] The central comparison confounds ensemble type with per-model training data volume: each CV member trains on 4/5 of the data while each DE member trains on the full set. This leaves open whether differences in calibration and failure detection arise from the intended distinction (seed vs. data-exposure variability) or simply from reduced data exposure in CV. No per-member performance metrics or ablation equalizing total training examples seen are reported.
- [§4] The manuscript reports comparative results without error bars, statistical tests, or details on exact data splits and random seeds, which weakens the reliability of claims about DE improving calibration and CV correlating with inter-rater variability.
minor comments (2)
- [§5] The discussion could more explicitly address how the findings generalize beyond the three studied datasets and nnU-Net architecture.
- Some figure captions could be expanded to include the exact metrics plotted for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope of our comparison. We respond to each major comment below and commit to targeted revisions that preserve the manuscript's core contribution while improving rigor.
read point-by-point responses
-
Referee: [Abstract and §4] The central comparison confounds ensemble type with per-model training data volume: each CV member trains on 4/5 of the data while each DE member trains on the full set. This leaves open whether differences in calibration and failure detection arise from the intended distinction (seed vs. data-exposure variability) or simply from reduced data exposure in CV. No per-member performance metrics or ablation equalizing total training examples seen are reported.
Authors: The per-model data-volume difference is inherent to standard CV ensemble construction (each member sees 4/5 of the data) versus DE (full data), and forms part of the data-subset variability that our work distinguishes from pure seed variability. Our audit and experiments compare the two approaches as they are actually implemented in the literature. We will add per-member Dice and calibration metrics in the revision to quantify the effect of reduced data exposure. An ablation that equalizes total training examples (e.g., via repeated passes over subsets) would require non-standard training regimes not representative of typical CV usage; we therefore do not plan such an ablation but will explicitly note this design choice and its implications. revision: partial
-
Referee: [§4] The manuscript reports comparative results without error bars, statistical tests, or details on exact data splits and random seeds, which weakens the reliability of claims about DE improving calibration and CV correlating with inter-rater variability.
Authors: We agree that the current results section would benefit from greater statistical transparency. In the revision we will report error bars (standard deviation across repeated runs with varied seeds where computationally feasible), apply appropriate statistical tests (e.g., paired Wilcoxon signed-rank tests) to the primary calibration and failure-detection metrics, and include precise descriptions of data splits and random seeds in the methods and supplementary material (with code release). revision: yes
Circularity Check
Purely empirical comparison; no derivation chain present
full rationale
The paper conducts an audit of prior studies and reports direct experimental measurements (accuracy, calibration, failure detection, inter-rater correlation) on held-out test sets across three datasets. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains are invoked to support any central claim. The comparison between CV and DE is implemented via code changes to nnU-Net, but the results remain observational and falsifiable against external benchmarks. This matches the default expectation of no circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- ensemble size =
5
axioms (1)
- domain assumption Multi-rater annotations serve as valid ground truth for measuring ambiguity
Reference graph
Works this paper leans on
-
[1]
van Aalst, J.E., Maruccio, F.C., Simo˜ es, R., Janssen, T.M., Wolterink, J.M., van Ooijen, P.M.A., Brouwer, C.L.: Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk.Physics in Medicine & Biology70(20), 205023 (2025).https://doi.org/10.1088/1361-6560/ae110c
-
[2]
Almazroa, A., Alodhayb, S., Osman, E., Ramadan, E., Hummadi, M., Dlaim, M., Alkatee, M., Raahemifar, K., Lakshminarayanan, V.: Retinal fundus images for glaucoma analysis: the RIGA dataset (2018).https://doi.org/10.7302/Z23R0R29, retinal fundus images with optic disc and optic cup annotations by six ophthalmol- ogists
-
[3]
Radiology308(3), e230275 (2023).https: //doi.org/10.1148/radiol.230275
Alves, N., Bosma, J.S., Venkadesh, K.V., Jacobs, C., Saghir, Z., de Rooij, M., Hermans, J., Huisman, H.: Prediction variability to identify reduced AI performance in cancer diagnosis at MRI and CT. Radiology308(3), e230275 (2023).https: //doi.org/10.1148/radiol.230275
-
[4]
Computers in Biology and Medicine163, 107096 (2023)
Buddenkotte, T., Escudero Sanchez, L., Crispin-Ortuzar, M., Woitek, R., McCague, C., Brenton, J.D., Öktem, O., Sala, E., Rundo, L.: Calibrating ensembles for scalable uncertainty quantification in deep learning-based medical image segmentation. Computers in Biology and Medicine163, 107096 (2023). https://doi.org/10. 1016/j.compbiomed.2023.107096
-
[5]
Insights into Imaging15(1), 286 (2024).https://doi.org/10.1186/s13244-024-01863-w
Gade, M., Nguyen, K.M., Gedde, S., Fernandez-Quilez, A.: Impact of uncertainty quantification through conformal prediction on volume assessment from deep learning-based MRI prostate segmentation. Insights into Imaging15(1), 286 (2024).https://doi.org/10.1186/s13244-024-01863-w
- [6]
-
[7]
Frontiers in Radiology3, 1223294 (2023)
Göttlich, H.C., Korfiatis, P., Gregory, A.V., Kline, T.L.: AI in the loop: Func- tionalizing fold performance disagreement to monitor automated medical im- age segmentation workflows. Frontiers in Radiology3, 1223294 (2023). https: //doi.org/10.3389/fradi.2023.1223294
-
[8]
Huang, L., Ruan, S., Xing, Y., Feng, M.: A review of uncertainty quantification in medical image analysis: Probabilistic and non-probabilistic methods. Medical Image Analysis (2023)
work page 2023
-
[9]
Nature Methods18(2), 203–211 (2021).https://doi.org/10.1038/ s41592-020-01008-z
Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU- Net: A self-configuring method for deep learning-based biomedical image seg- mentation. Nature Methods18(2), 203–211 (2021).https://doi.org/10.1038/ s41592-020-01008-z
work page 2021
-
[10]
In: International Conference on Medical Image Computing and Computer- Assisted Intervention
Isensee, F., Wald, T., Ulrich, C., Baumgartner, M., Roy, S., Maier-Hein, K., Jaeger, P.F.: nnu-net revisited: A call for rigorous validation in 3d medical image segmen- tation. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 488–498. Springer (2024)
work page 2024
-
[11]
Frontiers in neuroscience14, 282 (2020)
Jungo, A., Balsiger, F., Reyes, M.: Analyzing the quality and challenges of uncer- tainty estimations for brain tumor segmentation. Frontiers in neuroscience14, 282 (2020)
work page 2020
-
[12]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Jungo, A., Reyes, M.: Assessing reliability and challenges of uncertainty estimations for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 48–56. Springer (2019) Lost in the Folds 11
work page 2019
-
[13]
In: International Conference on Learning Representations (ICLR) (2024)
Kahl, K.C., Lüth, C.T., Zenk, M., Maier-Hein, K.H., Jaeger, P.F.: ValUES: A frame- work for systematic validation of uncertainty estimation in semantic segmentation. In: International Conference on Learning Representations (ICLR) (2024)
work page 2024
-
[14]
Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems30(2017)
work page 2017
-
[15]
arXiv preprint arXiv:2404.07208 (2024)
Khalili, N., Spronck, J., Ciompi, F., van der Laak, J., Litjens, G.: Uncertainty-guided annotation enhances segmentation with the human-in-the-loop. arXiv preprint arXiv:2404.07208 (2024)
-
[16]
Applied Sciences14(21), 10020 (2024)
Kucybała, I., Rozynek, M., Krupa, K., Matusik, P., Jarczewski, J., Tabor, Z.: Evaluating uncertainty quantification in medical image segmentation: A multi- dataset, multi-algorithm study. Applied Sciences14(21), 10020 (2024). https: //doi.org/10.3390/app142110020
-
[17]
In: Advances in Neural Information Processing Systems (NeurIPS)
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017)
work page 2017
-
[18]
IEEE Transactions on Medical Imaging40(12), 3436–3447 (2021)
Mehrtash, A., Wells, W.M., Tempany, C.M., Abolmaesumi, P., Kapur, T.: Con- fidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging40(12), 3436–3447 (2021). https://doi.org/10.1109/TMI.2021.3122681
-
[19]
Explaining Uncertainty in Multiple Sclerosis Cortical Lesion Segmentation Beyond Prediction Errors
Molchanova, N., Gordaliza, P.M., Cagol, A., Ocampo-Pineda, M., Lu, P.J., Weigel, M., Chen, X., Beck, E.S., Tsagkas, H., Reich, D.S., Stölting, A., Maggi, P., Ribes, D., Depeursinge, A., Granziera, C., Müller, H., Bach Cuadra, M.: Explainability of AI uncertainty: Application to multiple sclerosis lesion segmentation on MRI. arXiv preprint arXiv:2504.04814 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Medical Physics45(3), 1295– 1300 (2018)
Nyholm, T., Svensson, S., Andersson, S., Jonsson, J., Sohlin, M., Gustafsson, C., Kjellén, E., Söderström, K., Albertsson, P., Blomqvist, L., Zackrisson, B., Olsson, L.E., Gunnlaugsson, A.: MR and CT data with multiobserver delineations of organs in the pelvic area–part of the gold atlas project. Medical Physics45(3), 1295– 1300 (2018). https://doi.org/10...
-
[21]
Advances in Neural Information Processing Systems32(2019)
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., Snoek, J.: Can you trust your model’s uncertainty? evalu- ating predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems32(2019)
work page 2019
-
[22]
Riera-Marín, M., Kleiß, J.M., Aubanell, A., Antolín, A.: CURVAS dataset: Calibra- tion and uncertainty for multirater volume assessment in multi-organ segmenta- tion (2024). https://doi.org/10.5281/zenodo.11147559, https://zenodo.org/ records/11147559, 90 CT scans with multi-rater segmentations of pancreas, liver, and kidneys
-
[23]
Frontiers in Neurology12, 609646 (2021)
Rosas-Gonzalez, S., Birgui-Sekou, T., Hidane, M., Tauber, C.: Asymmetric ensem- ble of asymmetric U-Net models for brain tumor segmentation with uncertainty estimation. Frontiers in Neurology12, 609646 (2021)
work page 2021
-
[24]
Schwab, M., Haltmeier, M., Mayr, A.: Disagreement-driven uncertainty quantifi- cation in late gadolinium enhancement cardiac MRI. In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging. pp. 24–33. Springer (2025)
work page 2025
-
[25]
IEEE Transactions on Medical Imaging23(7), 903–921 (2004).https://doi.org/10
Warfield, S., Zou, K., Wells, W.: Simultaneous truth and performance level esti- mation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging23(7), 903–921 (2004).https://doi.org/10. 1109/TMI.2004.828354 12 T. Kirscher et al
-
[26]
In: CVPR Workshop on Uncertainty Quantification for Computer Vision (2025), arXiv:2506.16589
Zeevi, T., Lieffrig, E.V., Staib, L.H., Onofrey, J.A.: Spatially-aware evaluation of segmentation uncertainty. In: CVPR Workshop on Uncertainty Quantification for Computer Vision (2025), arXiv:2506.16589
-
[27]
Medical Image Analysis 101, 103392 (2025)
Zenk, M., Zimmerer, D., Isensee, F., Traub, J., Norajitra, T., Jäger, P.F., Maier- Hein, K.: Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation. Medical Image Analysis 101, 103392 (2025)
work page 2025
-
[28]
In: Medical Image Computing and Computer Assisted Intervention (MICCAI)
Zhao, Y., Yang, C., Schweidtmann, A., Tao, Q.: Efficient bayesian uncertainty estimation for nnU-Net. In: Medical Image Computing and Computer Assisted Intervention (MICCAI). Lecture Notes in Computer Science, vol. 13438, pp. 535–544. Springer (2022).https://doi.org/10.1007/978-3-031-16452-1_51
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.