pith. sign in

arxiv: 2605.18329 · v2 · pith:7YCRD3ZOnew · submitted 2026-05-18 · 💻 cs.CV · cs.LG

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

Pith reviewed 2026-05-25 06:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords uncertainty estimationdeep ensemblescross-validationmedical image segmentationepistemic uncertaintyfailure detectioncalibrationambiguity modeling
0
0 comments X

The pith

Cross-validation ensembles mix data effects into uncertainty estimates unlike deep ensembles with fixed training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how different ways of building ensembles affect uncertainty estimates in medical image segmentation. It finds that K-fold cross-validation ensembles, often mislabeled as deep ensembles, combine seed variability with differences in training data exposure. On three multi-rater datasets across modalities, deep ensembles trained on the full data with varied seeds match segmentation accuracy while delivering better calibrated uncertainties and stronger failure detection. Cross-validation ensembles sometimes align more closely with inter-rater variability instead. The work supplies a simple nnU-Net change to support deep-ensemble training inside standard pipelines.

Core claim

Deep ensembles formed from a fixed training set and different random seeds yield uncertainty estimates with improved calibration and failure detection compared to 5-fold cross-validation ensembles under matched training conditions, while the cross-validation approach can serve as a closer proxy for segmentation ambiguity on the tested multi-rater datasets.

What carries the argument

The controlled comparison of cross-validation ensembles (members trained on different data subsets) versus deep ensembles (members trained on identical data with different seeds), with disagreement used as the uncertainty signal.

If this is right

  • Deep ensembles should be used when the goal is reliability-oriented tasks such as selective referral or failure detection.
  • Cross-validation ensembles can be retained when the aim is to approximate segmentation ambiguity or inter-rater disagreement.
  • Studies must distinguish the two ensemble constructions in both method and terminology.
  • A lightweight nnU-Net modification makes deep-ensemble training available inside the default pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Past papers that used cross-validation but reported deep-ensemble results may need re-interpretation of their uncertainty findings.
  • The same distinction could matter for non-segmentation tasks or architectures where data-subset effects interact differently with model variability.
  • Choosing the wrong ensemble type might alter clinical downstream uses such as when to trust or escalate a prediction.

Load-bearing premise

The three multi-rater segmentation datasets and fixed training setups allow general conclusions about how ensemble construction changes uncertainty interpretation across medical imaging tasks.

What would settle it

A reversal of the calibration or failure-detection advantage when the same comparison is repeated on new datasets or modalities would show the reported preference does not hold generally.

Figures

Figures reproduced from arXiv: 2605.18329 by Balint Kovacs (DKFZ), DKFZ), Fabian Isensee (DKFZ), Institut Strauss, Kim-Celine Kahl (DKFZ), Klaus Maier-Hein (DKFZ), Markus Bujotzek (DKFZ), Maximilian Rokuss (DKFZ), Tristan Kirscher (ICube, Yannick Kirchhoff (DKFZ).

Figure 1
Figure 1. Figure 1: Per-case Average Calibration Error (ACE) on the in-distribution test set for Cross-Validation (CV) and Deep Ensemble (DE) methods across all datasets. 100 90 80 70 60 50 40 30 20 10 0 Coverage (%) 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Mean Risk (1 − Dice) of retained cases (%) RIGA CV DE DE better CV better 100 90 80 70 60 50 40 30 20 10 0 Coverage (%) 4.0 4.5 5.0 5.5 6.0 6.5 Curvas CV DE DE better CV better 100 90 … view at source ↗
Figure 2
Figure 2. Figure 2: Referral curves showing the mean risk (1−Dice) of in-distribution retained cases as a function of coverage for CV and DE methods across all datasets. 5 Results [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that cross-validation (CV) ensembles are frequently mislabeled as deep ensembles (DE) in medical image segmentation uncertainty literature, as CV introduces data-subset variability in addition to seed variability. Through an audit of recent studies and empirical comparison on three multi-rater datasets (spanning modalities), a 5-fold CV ensemble is contrasted with a 5-member DE under identical configurations. Results indicate that DE matches segmentation accuracy while offering better calibration and failure detection, whereas CV ensembles can correlate more with inter-rater variability; the authors recommend matching ensemble type to the uncertainty use-case and release a nnU-Net modification for DE training.

Significance. This work underscores the need to carefully select ensemble construction methods when using disagreement as an uncertainty proxy, with potential impact on applications like selective referral in medical imaging. The empirical evaluation across multiple axes (calibration, failure detection, ambiguity, distribution shift) and the provision of reproducible code are strengths that support the practical relevance if the central distinction is confirmed after addressing data exposure differences.

major comments (2)
  1. [Abstract and §4] The central comparison confounds ensemble type with per-model training data volume: each CV member trains on 4/5 of the data while each DE member trains on the full set. This leaves open whether differences in calibration and failure detection arise from the intended distinction (seed vs. data-exposure variability) or simply from reduced data exposure in CV. No per-member performance metrics or ablation equalizing total training examples seen are reported.
  2. [§4] The manuscript reports comparative results without error bars, statistical tests, or details on exact data splits and random seeds, which weakens the reliability of claims about DE improving calibration and CV correlating with inter-rater variability.
minor comments (2)
  1. [§5] The discussion could more explicitly address how the findings generalize beyond the three studied datasets and nnU-Net architecture.
  2. Some figure captions could be expanded to include the exact metrics plotted for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our comparison. We respond to each major comment below and commit to targeted revisions that preserve the manuscript's core contribution while improving rigor.

read point-by-point responses
  1. Referee: [Abstract and §4] The central comparison confounds ensemble type with per-model training data volume: each CV member trains on 4/5 of the data while each DE member trains on the full set. This leaves open whether differences in calibration and failure detection arise from the intended distinction (seed vs. data-exposure variability) or simply from reduced data exposure in CV. No per-member performance metrics or ablation equalizing total training examples seen are reported.

    Authors: The per-model data-volume difference is inherent to standard CV ensemble construction (each member sees 4/5 of the data) versus DE (full data), and forms part of the data-subset variability that our work distinguishes from pure seed variability. Our audit and experiments compare the two approaches as they are actually implemented in the literature. We will add per-member Dice and calibration metrics in the revision to quantify the effect of reduced data exposure. An ablation that equalizes total training examples (e.g., via repeated passes over subsets) would require non-standard training regimes not representative of typical CV usage; we therefore do not plan such an ablation but will explicitly note this design choice and its implications. revision: partial

  2. Referee: [§4] The manuscript reports comparative results without error bars, statistical tests, or details on exact data splits and random seeds, which weakens the reliability of claims about DE improving calibration and CV correlating with inter-rater variability.

    Authors: We agree that the current results section would benefit from greater statistical transparency. In the revision we will report error bars (standard deviation across repeated runs with varied seeds where computationally feasible), apply appropriate statistical tests (e.g., paired Wilcoxon signed-rank tests) to the primary calibration and failure-detection metrics, and include precise descriptions of data splits and random seeds in the methods and supplementary material (with code release). revision: yes

Circularity Check

0 steps flagged

Purely empirical comparison; no derivation chain present

full rationale

The paper conducts an audit of prior studies and reports direct experimental measurements (accuracy, calibration, failure detection, inter-rater correlation) on held-out test sets across three datasets. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains are invoked to support any central claim. The comparison between CV and DE is implemented via code changes to nnU-Net, but the results remain observational and falsifiable against external benchmarks. This matches the default expectation of no circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Empirical study with standard machine learning assumptions about data partitioning and random initialization.

free parameters (1)
  • ensemble size = 5
    Fixed at 5 for both CV and DE to enable direct comparison
axioms (1)
  • domain assumption Multi-rater annotations serve as valid ground truth for measuring ambiguity
    Invoked when evaluating CV correlation with inter-rater variability

pith-pipeline@v0.9.0 · 7858 in / 1094 out tokens · 56842 ms · 2026-05-25T06:11:13.366552+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    van Aalst, J.E., Maruccio, F.C., Simo˜ es, R., Janssen, T.M., Wolterink, J.M., van Ooijen, P.M.A., Brouwer, C.L.: Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk.Physics in Medicine & Biology70(20), 205023 (2025).https://doi.org/10.1088/1361-6560/ae110c

  2. [2]

    Almazroa, A., Alodhayb, S., Osman, E., Ramadan, E., Hummadi, M., Dlaim, M., Alkatee, M., Raahemifar, K., Lakshminarayanan, V.: Retinal fundus images for glaucoma analysis: the RIGA dataset (2018).https://doi.org/10.7302/Z23R0R29, retinal fundus images with optic disc and optic cup annotations by six ophthalmol- ogists

  3. [3]

    Radiology308(3), e230275 (2023).https: //doi.org/10.1148/radiol.230275

    Alves, N., Bosma, J.S., Venkadesh, K.V., Jacobs, C., Saghir, Z., de Rooij, M., Hermans, J., Huisman, H.: Prediction variability to identify reduced AI performance in cancer diagnosis at MRI and CT. Radiology308(3), e230275 (2023).https: //doi.org/10.1148/radiol.230275

  4. [4]

    Computers in Biology and Medicine163, 107096 (2023)

    Buddenkotte, T., Escudero Sanchez, L., Crispin-Ortuzar, M., Woitek, R., McCague, C., Brenton, J.D., Öktem, O., Sala, E., Rundo, L.: Calibrating ensembles for scalable uncertainty quantification in deep learning-based medical image segmentation. Computers in Biology and Medicine163, 107096 (2023). https://doi.org/10. 1016/j.compbiomed.2023.107096

  5. [5]

    Insights into Imaging15(1), 286 (2024).https://doi.org/10.1186/s13244-024-01863-w

    Gade, M., Nguyen, K.M., Gedde, S., Fernandez-Quilez, A.: Impact of uncertainty quantification through conformal prediction on volume assessment from deep learning-based MRI prostate segmentation. Insights into Imaging15(1), 286 (2024).https://doi.org/10.1186/s13244-024-01863-w

  6. [6]

    In: Proc

    Gotkowski, K., Gonzalez, C., Kaltenborn, I., Fischbach, R., Bucher, A., Mukhopad- hyay, A.: i3Deep: Efficient 3D interactive segmentation with the nnU-Net. In: Proc. of the International Conference on Medical Imaging with Deep Learning (MIDL). vol. 172, pp. 1–16 (2022)

  7. [7]

    Frontiers in Radiology3, 1223294 (2023)

    Göttlich, H.C., Korfiatis, P., Gregory, A.V., Kline, T.L.: AI in the loop: Func- tionalizing fold performance disagreement to monitor automated medical im- age segmentation workflows. Frontiers in Radiology3, 1223294 (2023). https: //doi.org/10.3389/fradi.2023.1223294

  8. [8]

    Medical Image Analysis (2023)

    Huang, L., Ruan, S., Xing, Y., Feng, M.: A review of uncertainty quantification in medical image analysis: Probabilistic and non-probabilistic methods. Medical Image Analysis (2023)

  9. [9]

    Nature Methods18(2), 203–211 (2021).https://doi.org/10.1038/ s41592-020-01008-z

    Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU- Net: A self-configuring method for deep learning-based biomedical image seg- mentation. Nature Methods18(2), 203–211 (2021).https://doi.org/10.1038/ s41592-020-01008-z

  10. [10]

    In: International Conference on Medical Image Computing and Computer- Assisted Intervention

    Isensee, F., Wald, T., Ulrich, C., Baumgartner, M., Roy, S., Maier-Hein, K., Jaeger, P.F.: nnu-net revisited: A call for rigorous validation in 3d medical image segmen- tation. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 488–498. Springer (2024)

  11. [11]

    Frontiers in neuroscience14, 282 (2020)

    Jungo, A., Balsiger, F., Reyes, M.: Analyzing the quality and challenges of uncer- tainty estimations for brain tumor segmentation. Frontiers in neuroscience14, 282 (2020)

  12. [12]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Jungo, A., Reyes, M.: Assessing reliability and challenges of uncertainty estimations for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 48–56. Springer (2019) Lost in the Folds 11

  13. [13]

    In: International Conference on Learning Representations (ICLR) (2024)

    Kahl, K.C., Lüth, C.T., Zenk, M., Maier-Hein, K.H., Jaeger, P.F.: ValUES: A frame- work for systematic validation of uncertainty estimation in semantic segmentation. In: International Conference on Learning Representations (ICLR) (2024)

  14. [14]

    Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems30(2017)

  15. [15]

    arXiv preprint arXiv:2404.07208 (2024)

    Khalili, N., Spronck, J., Ciompi, F., van der Laak, J., Litjens, G.: Uncertainty-guided annotation enhances segmentation with the human-in-the-loop. arXiv preprint arXiv:2404.07208 (2024)

  16. [16]

    Applied Sciences14(21), 10020 (2024)

    Kucybała, I., Rozynek, M., Krupa, K., Matusik, P., Jarczewski, J., Tabor, Z.: Evaluating uncertainty quantification in medical image segmentation: A multi- dataset, multi-algorithm study. Applied Sciences14(21), 10020 (2024). https: //doi.org/10.3390/app142110020

  17. [17]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017)

  18. [18]

    IEEE Transactions on Medical Imaging40(12), 3436–3447 (2021)

    Mehrtash, A., Wells, W.M., Tempany, C.M., Abolmaesumi, P., Kapur, T.: Con- fidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging40(12), 3436–3447 (2021). https://doi.org/10.1109/TMI.2021.3122681

  19. [19]

    Explaining Uncertainty in Multiple Sclerosis Cortical Lesion Segmentation Beyond Prediction Errors

    Molchanova, N., Gordaliza, P.M., Cagol, A., Ocampo-Pineda, M., Lu, P.J., Weigel, M., Chen, X., Beck, E.S., Tsagkas, H., Reich, D.S., Stölting, A., Maggi, P., Ribes, D., Depeursinge, A., Granziera, C., Müller, H., Bach Cuadra, M.: Explainability of AI uncertainty: Application to multiple sclerosis lesion segmentation on MRI. arXiv preprint arXiv:2504.04814 (2025)

  20. [20]

    Medical Physics45(3), 1295– 1300 (2018)

    Nyholm, T., Svensson, S., Andersson, S., Jonsson, J., Sohlin, M., Gustafsson, C., Kjellén, E., Söderström, K., Albertsson, P., Blomqvist, L., Zackrisson, B., Olsson, L.E., Gunnlaugsson, A.: MR and CT data with multiobserver delineations of organs in the pelvic area–part of the gold atlas project. Medical Physics45(3), 1295– 1300 (2018). https://doi.org/10...

  21. [21]

    Advances in Neural Information Processing Systems32(2019)

    Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., Snoek, J.: Can you trust your model’s uncertainty? evalu- ating predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems32(2019)

  22. [22]

    https://doi.org/10.5281/zenodo.11147559, https://zenodo.org/ records/11147559, 90 CT scans with multi-rater segmentations of pancreas, liver, and kidneys

    Riera-Marín, M., Kleiß, J.M., Aubanell, A., Antolín, A.: CURVAS dataset: Calibra- tion and uncertainty for multirater volume assessment in multi-organ segmenta- tion (2024). https://doi.org/10.5281/zenodo.11147559, https://zenodo.org/ records/11147559, 90 CT scans with multi-rater segmentations of pancreas, liver, and kidneys

  23. [23]

    Frontiers in Neurology12, 609646 (2021)

    Rosas-Gonzalez, S., Birgui-Sekou, T., Hidane, M., Tauber, C.: Asymmetric ensem- ble of asymmetric U-Net models for brain tumor segmentation with uncertainty estimation. Frontiers in Neurology12, 609646 (2021)

  24. [24]

    In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging

    Schwab, M., Haltmeier, M., Mayr, A.: Disagreement-driven uncertainty quantifi- cation in late gadolinium enhancement cardiac MRI. In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging. pp. 24–33. Springer (2025)

  25. [25]

    IEEE Transactions on Medical Imaging23(7), 903–921 (2004).https://doi.org/10

    Warfield, S., Zou, K., Wells, W.: Simultaneous truth and performance level esti- mation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging23(7), 903–921 (2004).https://doi.org/10. 1109/TMI.2004.828354 12 T. Kirscher et al

  26. [26]

    In: CVPR Workshop on Uncertainty Quantification for Computer Vision (2025), arXiv:2506.16589

    Zeevi, T., Lieffrig, E.V., Staib, L.H., Onofrey, J.A.: Spatially-aware evaluation of segmentation uncertainty. In: CVPR Workshop on Uncertainty Quantification for Computer Vision (2025), arXiv:2506.16589

  27. [27]

    Medical Image Analysis 101, 103392 (2025)

    Zenk, M., Zimmerer, D., Isensee, F., Traub, J., Norajitra, T., Jäger, P.F., Maier- Hein, K.: Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation. Medical Image Analysis 101, 103392 (2025)

  28. [28]

    In: Medical Image Computing and Computer Assisted Intervention (MICCAI)

    Zhao, Y., Yang, C., Schweidtmann, A., Tao, Q.: Efficient bayesian uncertainty estimation for nnU-Net. In: Medical Image Computing and Computer Assisted Intervention (MICCAI). Lecture Notes in Computer Science, vol. 13438, pp. 535–544. Springer (2022).https://doi.org/10.1007/978-3-031-16452-1_51