Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

Balint Kovacs (DKFZ); DKFZ); Fabian Isensee (DKFZ); Institut Strauss; Kim-Celine Kahl (DKFZ); Klaus Maier-Hein (DKFZ); Markus Bujotzek (DKFZ); Maximilian Rokuss (DKFZ); Tristan Kirscher (ICube; Yannick Kirchhoff (DKFZ)

arxiv: 2605.18329 · v2 · pith:7YCRD3ZOnew · submitted 2026-05-18 · 💻 cs.CV · cs.LG

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

Tristan Kirscher (ICube , Institut Strauss , DKFZ) , Markus Bujotzek (DKFZ) , Yannick Kirchhoff (DKFZ) , Maximilian Rokuss (DKFZ) , Fabian Isensee (DKFZ) , Kim-Celine Kahl (DKFZ)

show 2 more authors

Balint Kovacs (DKFZ) Klaus Maier-Hein (DKFZ)

This is my paper

Pith reviewed 2026-05-25 06:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords uncertainty estimationdeep ensemblescross-validationmedical image segmentationepistemic uncertaintyfailure detectioncalibrationambiguity modeling

0 comments

The pith

Cross-validation ensembles mix data effects into uncertainty estimates unlike deep ensembles with fixed training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how different ways of building ensembles affect uncertainty estimates in medical image segmentation. It finds that K-fold cross-validation ensembles, often mislabeled as deep ensembles, combine seed variability with differences in training data exposure. On three multi-rater datasets across modalities, deep ensembles trained on the full data with varied seeds match segmentation accuracy while delivering better calibrated uncertainties and stronger failure detection. Cross-validation ensembles sometimes align more closely with inter-rater variability instead. The work supplies a simple nnU-Net change to support deep-ensemble training inside standard pipelines.

Core claim

Deep ensembles formed from a fixed training set and different random seeds yield uncertainty estimates with improved calibration and failure detection compared to 5-fold cross-validation ensembles under matched training conditions, while the cross-validation approach can serve as a closer proxy for segmentation ambiguity on the tested multi-rater datasets.

What carries the argument

The controlled comparison of cross-validation ensembles (members trained on different data subsets) versus deep ensembles (members trained on identical data with different seeds), with disagreement used as the uncertainty signal.

If this is right

Deep ensembles should be used when the goal is reliability-oriented tasks such as selective referral or failure detection.
Cross-validation ensembles can be retained when the aim is to approximate segmentation ambiguity or inter-rater disagreement.
Studies must distinguish the two ensemble constructions in both method and terminology.
A lightweight nnU-Net modification makes deep-ensemble training available inside the default pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Past papers that used cross-validation but reported deep-ensemble results may need re-interpretation of their uncertainty findings.
The same distinction could matter for non-segmentation tasks or architectures where data-subset effects interact differently with model variability.
Choosing the wrong ensemble type might alter clinical downstream uses such as when to trust or escalate a prediction.

Load-bearing premise

The three multi-rater segmentation datasets and fixed training setups allow general conclusions about how ensemble construction changes uncertainty interpretation across medical imaging tasks.

What would settle it

A reversal of the calibration or failure-detection advantage when the same comparison is repeated on new datasets or modalities would show the reported preference does not hold generally.

Figures

Figures reproduced from arXiv: 2605.18329 by Balint Kovacs (DKFZ), DKFZ), Fabian Isensee (DKFZ), Institut Strauss, Kim-Celine Kahl (DKFZ), Klaus Maier-Hein (DKFZ), Markus Bujotzek (DKFZ), Maximilian Rokuss (DKFZ), Tristan Kirscher (ICube, Yannick Kirchhoff (DKFZ).

**Figure 1.** Figure 1: Per-case Average Calibration Error (ACE) on the in-distribution test set for Cross-Validation (CV) and Deep Ensemble (DE) methods across all datasets. 100 90 80 70 60 50 40 30 20 10 0 Coverage (%) 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Mean Risk (1 − Dice) of retained cases (%) RIGA CV DE DE better CV better 100 90 80 70 60 50 40 30 20 10 0 Coverage (%) 4.0 4.5 5.0 5.5 6.0 6.5 Curvas CV DE DE better CV better 100 90 … view at source ↗

**Figure 2.** Figure 2: Referral curves showing the mean risk (1−Dice) of in-distribution retained cases as a function of coverage for CV and DE methods across all datasets. 5 Results [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CV and DE differ in uncertainty behavior on the tested tasks, but the comparison is undercut by CV models seeing less data per member.

read the letter

Your colleague should know two things about this paper. First, it correctly identifies that many segmentation papers misuse deep ensemble to mean cross-validation folds, and second, it shows measurable differences in how those two constructions behave for uncertainty tasks. The comparison itself, though, rests on setups that differ in more than just the ensemble method. What the work does well is run the same training pipeline on three multi-rater datasets and track uncertainty across calibration, failure detection, inter-rater correlation, and shift robustness. The finding that deep ensembles (same data, different seeds) improve calibration and failure detection while CV sometimes tracks rater disagreement better gives practitioners a concrete reason to pick one over the other depending on the goal. The nnU-Net modification is a small but helpful addition that lets people try the deep ensemble route without changing their whole workflow. The main soft spot is the one the stress test flags. CV members each train on 80 percent of the data while deep ensemble members train on the full set. The abstract claims otherwise identical configurations, but that ignores the difference in data volume each model sees. No per-member accuracy numbers or ablation that holds total training examples constant appear in the description, so the gaps could trace to less data per CV model rather than the intended distinction between data-subset variability and seed variability. That undercuts how strongly we can recommend one construction for reliability tasks versus ambiguity modeling. This is aimed at researchers who use or review uncertainty estimates in medical segmentation. A reader who cares about how ensemble choice affects downstream decisions will get value from the numbers, even if they need to treat the causal claims with caution. It is coherent and engages the literature on its own terms, so it deserves a serious referee who can ask for the missing controls on data exposure.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that cross-validation (CV) ensembles are frequently mislabeled as deep ensembles (DE) in medical image segmentation uncertainty literature, as CV introduces data-subset variability in addition to seed variability. Through an audit of recent studies and empirical comparison on three multi-rater datasets (spanning modalities), a 5-fold CV ensemble is contrasted with a 5-member DE under identical configurations. Results indicate that DE matches segmentation accuracy while offering better calibration and failure detection, whereas CV ensembles can correlate more with inter-rater variability; the authors recommend matching ensemble type to the uncertainty use-case and release a nnU-Net modification for DE training.

Significance. This work underscores the need to carefully select ensemble construction methods when using disagreement as an uncertainty proxy, with potential impact on applications like selective referral in medical imaging. The empirical evaluation across multiple axes (calibration, failure detection, ambiguity, distribution shift) and the provision of reproducible code are strengths that support the practical relevance if the central distinction is confirmed after addressing data exposure differences.

major comments (2)

[Abstract and §4] The central comparison confounds ensemble type with per-model training data volume: each CV member trains on 4/5 of the data while each DE member trains on the full set. This leaves open whether differences in calibration and failure detection arise from the intended distinction (seed vs. data-exposure variability) or simply from reduced data exposure in CV. No per-member performance metrics or ablation equalizing total training examples seen are reported.
[§4] The manuscript reports comparative results without error bars, statistical tests, or details on exact data splits and random seeds, which weakens the reliability of claims about DE improving calibration and CV correlating with inter-rater variability.

minor comments (2)

[§5] The discussion could more explicitly address how the findings generalize beyond the three studied datasets and nnU-Net architecture.
Some figure captions could be expanded to include the exact metrics plotted for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our comparison. We respond to each major comment below and commit to targeted revisions that preserve the manuscript's core contribution while improving rigor.

read point-by-point responses

Referee: [Abstract and §4] The central comparison confounds ensemble type with per-model training data volume: each CV member trains on 4/5 of the data while each DE member trains on the full set. This leaves open whether differences in calibration and failure detection arise from the intended distinction (seed vs. data-exposure variability) or simply from reduced data exposure in CV. No per-member performance metrics or ablation equalizing total training examples seen are reported.

Authors: The per-model data-volume difference is inherent to standard CV ensemble construction (each member sees 4/5 of the data) versus DE (full data), and forms part of the data-subset variability that our work distinguishes from pure seed variability. Our audit and experiments compare the two approaches as they are actually implemented in the literature. We will add per-member Dice and calibration metrics in the revision to quantify the effect of reduced data exposure. An ablation that equalizes total training examples (e.g., via repeated passes over subsets) would require non-standard training regimes not representative of typical CV usage; we therefore do not plan such an ablation but will explicitly note this design choice and its implications. revision: partial
Referee: [§4] The manuscript reports comparative results without error bars, statistical tests, or details on exact data splits and random seeds, which weakens the reliability of claims about DE improving calibration and CV correlating with inter-rater variability.

Authors: We agree that the current results section would benefit from greater statistical transparency. In the revision we will report error bars (standard deviation across repeated runs with varied seeds where computationally feasible), apply appropriate statistical tests (e.g., paired Wilcoxon signed-rank tests) to the primary calibration and failure-detection metrics, and include precise descriptions of data splits and random seeds in the methods and supplementary material (with code release). revision: yes

Circularity Check

0 steps flagged

Purely empirical comparison; no derivation chain present

full rationale

The paper conducts an audit of prior studies and reports direct experimental measurements (accuracy, calibration, failure detection, inter-rater correlation) on held-out test sets across three datasets. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains are invoked to support any central claim. The comparison between CV and DE is implemented via code changes to nnU-Net, but the results remain observational and falsifiable against external benchmarks. This matches the default expectation of no circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Empirical study with standard machine learning assumptions about data partitioning and random initialization.

free parameters (1)

ensemble size = 5
Fixed at 5 for both CV and DE to enable direct comparison

axioms (1)

domain assumption Multi-rater annotations serve as valid ground truth for measuring ambiguity
Invoked when evaluating CV correlation with inter-rater variability

pith-pipeline@v0.9.0 · 7858 in / 1094 out tokens · 56842 ms · 2026-05-25T06:11:13.366552+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

van Aalst, J.E., Maruccio, F.C., Simo˜ es, R., Janssen, T.M., Wolterink, J.M., van Ooijen, P.M.A., Brouwer, C.L.: Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk.Physics in Medicine & Biology70(20), 205023 (2025).https://doi.org/10.1088/1361-6560/ae110c

work page doi:10.1088/1361-6560/ae110c 2025
[2]

Almazroa, A., Alodhayb, S., Osman, E., Ramadan, E., Hummadi, M., Dlaim, M., Alkatee, M., Raahemifar, K., Lakshminarayanan, V.: Retinal fundus images for glaucoma analysis: the RIGA dataset (2018).https://doi.org/10.7302/Z23R0R29, retinal fundus images with optic disc and optic cup annotations by six ophthalmol- ogists

work page doi:10.7302/z23r0r29 2018
[3]

Radiology308(3), e230275 (2023).https: //doi.org/10.1148/radiol.230275

Alves, N., Bosma, J.S., Venkadesh, K.V., Jacobs, C., Saghir, Z., de Rooij, M., Hermans, J., Huisman, H.: Prediction variability to identify reduced AI performance in cancer diagnosis at MRI and CT. Radiology308(3), e230275 (2023).https: //doi.org/10.1148/radiol.230275

work page doi:10.1148/radiol.230275 2023
[4]

Computers in Biology and Medicine163, 107096 (2023)

Buddenkotte, T., Escudero Sanchez, L., Crispin-Ortuzar, M., Woitek, R., McCague, C., Brenton, J.D., Öktem, O., Sala, E., Rundo, L.: Calibrating ensembles for scalable uncertainty quantification in deep learning-based medical image segmentation. Computers in Biology and Medicine163, 107096 (2023). https://doi.org/10. 1016/j.compbiomed.2023.107096

work page arXiv 2023
[5]

Insights into Imaging15(1), 286 (2024).https://doi.org/10.1186/s13244-024-01863-w

Gade, M., Nguyen, K.M., Gedde, S., Fernandez-Quilez, A.: Impact of uncertainty quantification through conformal prediction on volume assessment from deep learning-based MRI prostate segmentation. Insights into Imaging15(1), 286 (2024).https://doi.org/10.1186/s13244-024-01863-w

work page doi:10.1186/s13244-024-01863-w 2024
[6]

In: Proc

Gotkowski, K., Gonzalez, C., Kaltenborn, I., Fischbach, R., Bucher, A., Mukhopad- hyay, A.: i3Deep: Efficient 3D interactive segmentation with the nnU-Net. In: Proc. of the International Conference on Medical Imaging with Deep Learning (MIDL). vol. 172, pp. 1–16 (2022)

work page 2022
[7]

Frontiers in Radiology3, 1223294 (2023)

Göttlich, H.C., Korfiatis, P., Gregory, A.V., Kline, T.L.: AI in the loop: Func- tionalizing fold performance disagreement to monitor automated medical im- age segmentation workflows. Frontiers in Radiology3, 1223294 (2023). https: //doi.org/10.3389/fradi.2023.1223294

work page doi:10.3389/fradi.2023.1223294 2023
[8]

Medical Image Analysis (2023)

Huang, L., Ruan, S., Xing, Y., Feng, M.: A review of uncertainty quantification in medical image analysis: Probabilistic and non-probabilistic methods. Medical Image Analysis (2023)

work page 2023
[9]

Nature Methods18(2), 203–211 (2021).https://doi.org/10.1038/ s41592-020-01008-z

Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU- Net: A self-configuring method for deep learning-based biomedical image seg- mentation. Nature Methods18(2), 203–211 (2021).https://doi.org/10.1038/ s41592-020-01008-z

work page 2021
[10]

In: International Conference on Medical Image Computing and Computer- Assisted Intervention

Isensee, F., Wald, T., Ulrich, C., Baumgartner, M., Roy, S., Maier-Hein, K., Jaeger, P.F.: nnu-net revisited: A call for rigorous validation in 3d medical image segmen- tation. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 488–498. Springer (2024)

work page 2024
[11]

Frontiers in neuroscience14, 282 (2020)

Jungo, A., Balsiger, F., Reyes, M.: Analyzing the quality and challenges of uncer- tainty estimations for brain tumor segmentation. Frontiers in neuroscience14, 282 (2020)

work page 2020
[12]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Jungo, A., Reyes, M.: Assessing reliability and challenges of uncertainty estimations for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 48–56. Springer (2019) Lost in the Folds 11

work page 2019
[13]

In: International Conference on Learning Representations (ICLR) (2024)

Kahl, K.C., Lüth, C.T., Zenk, M., Maier-Hein, K.H., Jaeger, P.F.: ValUES: A frame- work for systematic validation of uncertainty estimation in semantic segmentation. In: International Conference on Learning Representations (ICLR) (2024)

work page 2024
[14]

Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems30(2017)

work page 2017
[15]

arXiv preprint arXiv:2404.07208 (2024)

Khalili, N., Spronck, J., Ciompi, F., van der Laak, J., Litjens, G.: Uncertainty-guided annotation enhances segmentation with the human-in-the-loop. arXiv preprint arXiv:2404.07208 (2024)

work page arXiv 2024
[16]

Applied Sciences14(21), 10020 (2024)

Kucybała, I., Rozynek, M., Krupa, K., Matusik, P., Jarczewski, J., Tabor, Z.: Evaluating uncertainty quantification in medical image segmentation: A multi- dataset, multi-algorithm study. Applied Sciences14(21), 10020 (2024). https: //doi.org/10.3390/app142110020

work page doi:10.3390/app142110020 2024
[17]

In: Advances in Neural Information Processing Systems (NeurIPS)

Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017)

work page 2017
[18]

IEEE Transactions on Medical Imaging40(12), 3436–3447 (2021)

Mehrtash, A., Wells, W.M., Tempany, C.M., Abolmaesumi, P., Kapur, T.: Con- fidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging40(12), 3436–3447 (2021). https://doi.org/10.1109/TMI.2021.3122681

work page doi:10.1109/tmi.2021.3122681 2021
[19]

Explaining Uncertainty in Multiple Sclerosis Cortical Lesion Segmentation Beyond Prediction Errors

Molchanova, N., Gordaliza, P.M., Cagol, A., Ocampo-Pineda, M., Lu, P.J., Weigel, M., Chen, X., Beck, E.S., Tsagkas, H., Reich, D.S., Stölting, A., Maggi, P., Ribes, D., Depeursinge, A., Granziera, C., Müller, H., Bach Cuadra, M.: Explainability of AI uncertainty: Application to multiple sclerosis lesion segmentation on MRI. arXiv preprint arXiv:2504.04814 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Medical Physics45(3), 1295– 1300 (2018)

Nyholm, T., Svensson, S., Andersson, S., Jonsson, J., Sohlin, M., Gustafsson, C., Kjellén, E., Söderström, K., Albertsson, P., Blomqvist, L., Zackrisson, B., Olsson, L.E., Gunnlaugsson, A.: MR and CT data with multiobserver delineations of organs in the pelvic area–part of the gold atlas project. Medical Physics45(3), 1295– 1300 (2018). https://doi.org/10...

work page doi:10.1002/mp.12748 2018
[21]

Advances in Neural Information Processing Systems32(2019)

Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., Snoek, J.: Can you trust your model’s uncertainty? evalu- ating predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems32(2019)

work page 2019
[22]

https://doi.org/10.5281/zenodo.11147559, https://zenodo.org/ records/11147559, 90 CT scans with multi-rater segmentations of pancreas, liver, and kidneys

Riera-Marín, M., Kleiß, J.M., Aubanell, A., Antolín, A.: CURVAS dataset: Calibra- tion and uncertainty for multirater volume assessment in multi-organ segmenta- tion (2024). https://doi.org/10.5281/zenodo.11147559, https://zenodo.org/ records/11147559, 90 CT scans with multi-rater segmentations of pancreas, liver, and kidneys

work page doi:10.5281/zenodo.11147559 2024
[23]

Frontiers in Neurology12, 609646 (2021)

Rosas-Gonzalez, S., Birgui-Sekou, T., Hidane, M., Tauber, C.: Asymmetric ensem- ble of asymmetric U-Net models for brain tumor segmentation with uncertainty estimation. Frontiers in Neurology12, 609646 (2021)

work page 2021
[24]

In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging

Schwab, M., Haltmeier, M., Mayr, A.: Disagreement-driven uncertainty quantifi- cation in late gadolinium enhancement cardiac MRI. In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging. pp. 24–33. Springer (2025)

work page 2025
[25]

IEEE Transactions on Medical Imaging23(7), 903–921 (2004).https://doi.org/10

Warfield, S., Zou, K., Wells, W.: Simultaneous truth and performance level esti- mation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging23(7), 903–921 (2004).https://doi.org/10. 1109/TMI.2004.828354 12 T. Kirscher et al

work page arXiv 2004
[26]

In: CVPR Workshop on Uncertainty Quantification for Computer Vision (2025), arXiv:2506.16589

Zeevi, T., Lieffrig, E.V., Staib, L.H., Onofrey, J.A.: Spatially-aware evaluation of segmentation uncertainty. In: CVPR Workshop on Uncertainty Quantification for Computer Vision (2025), arXiv:2506.16589

work page arXiv 2025
[27]

Medical Image Analysis 101, 103392 (2025)

Zenk, M., Zimmerer, D., Isensee, F., Traub, J., Norajitra, T., Jäger, P.F., Maier- Hein, K.: Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation. Medical Image Analysis 101, 103392 (2025)

work page 2025
[28]

In: Medical Image Computing and Computer Assisted Intervention (MICCAI)

Zhao, Y., Yang, C., Schweidtmann, A., Tao, Q.: Efficient bayesian uncertainty estimation for nnU-Net. In: Medical Image Computing and Computer Assisted Intervention (MICCAI). Lecture Notes in Computer Science, vol. 13438, pp. 535–544. Springer (2022).https://doi.org/10.1007/978-3-031-16452-1_51

work page doi:10.1007/978-3-031-16452-1_51 2022

[1] [1]

van Aalst, J.E., Maruccio, F.C., Simo˜ es, R., Janssen, T.M., Wolterink, J.M., van Ooijen, P.M.A., Brouwer, C.L.: Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk.Physics in Medicine & Biology70(20), 205023 (2025).https://doi.org/10.1088/1361-6560/ae110c

work page doi:10.1088/1361-6560/ae110c 2025

[2] [2]

Almazroa, A., Alodhayb, S., Osman, E., Ramadan, E., Hummadi, M., Dlaim, M., Alkatee, M., Raahemifar, K., Lakshminarayanan, V.: Retinal fundus images for glaucoma analysis: the RIGA dataset (2018).https://doi.org/10.7302/Z23R0R29, retinal fundus images with optic disc and optic cup annotations by six ophthalmol- ogists

work page doi:10.7302/z23r0r29 2018

[3] [3]

Radiology308(3), e230275 (2023).https: //doi.org/10.1148/radiol.230275

Alves, N., Bosma, J.S., Venkadesh, K.V., Jacobs, C., Saghir, Z., de Rooij, M., Hermans, J., Huisman, H.: Prediction variability to identify reduced AI performance in cancer diagnosis at MRI and CT. Radiology308(3), e230275 (2023).https: //doi.org/10.1148/radiol.230275

work page doi:10.1148/radiol.230275 2023

[4] [4]

Computers in Biology and Medicine163, 107096 (2023)

Buddenkotte, T., Escudero Sanchez, L., Crispin-Ortuzar, M., Woitek, R., McCague, C., Brenton, J.D., Öktem, O., Sala, E., Rundo, L.: Calibrating ensembles for scalable uncertainty quantification in deep learning-based medical image segmentation. Computers in Biology and Medicine163, 107096 (2023). https://doi.org/10. 1016/j.compbiomed.2023.107096

work page arXiv 2023

[5] [5]

Insights into Imaging15(1), 286 (2024).https://doi.org/10.1186/s13244-024-01863-w

Gade, M., Nguyen, K.M., Gedde, S., Fernandez-Quilez, A.: Impact of uncertainty quantification through conformal prediction on volume assessment from deep learning-based MRI prostate segmentation. Insights into Imaging15(1), 286 (2024).https://doi.org/10.1186/s13244-024-01863-w

work page doi:10.1186/s13244-024-01863-w 2024

[6] [6]

In: Proc

Gotkowski, K., Gonzalez, C., Kaltenborn, I., Fischbach, R., Bucher, A., Mukhopad- hyay, A.: i3Deep: Efficient 3D interactive segmentation with the nnU-Net. In: Proc. of the International Conference on Medical Imaging with Deep Learning (MIDL). vol. 172, pp. 1–16 (2022)

work page 2022

[7] [7]

Frontiers in Radiology3, 1223294 (2023)

Göttlich, H.C., Korfiatis, P., Gregory, A.V., Kline, T.L.: AI in the loop: Func- tionalizing fold performance disagreement to monitor automated medical im- age segmentation workflows. Frontiers in Radiology3, 1223294 (2023). https: //doi.org/10.3389/fradi.2023.1223294

work page doi:10.3389/fradi.2023.1223294 2023

[8] [8]

Medical Image Analysis (2023)

Huang, L., Ruan, S., Xing, Y., Feng, M.: A review of uncertainty quantification in medical image analysis: Probabilistic and non-probabilistic methods. Medical Image Analysis (2023)

work page 2023

[9] [9]

Nature Methods18(2), 203–211 (2021).https://doi.org/10.1038/ s41592-020-01008-z

Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU- Net: A self-configuring method for deep learning-based biomedical image seg- mentation. Nature Methods18(2), 203–211 (2021).https://doi.org/10.1038/ s41592-020-01008-z

work page 2021

[10] [10]

In: International Conference on Medical Image Computing and Computer- Assisted Intervention

Isensee, F., Wald, T., Ulrich, C., Baumgartner, M., Roy, S., Maier-Hein, K., Jaeger, P.F.: nnu-net revisited: A call for rigorous validation in 3d medical image segmen- tation. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 488–498. Springer (2024)

work page 2024

[11] [11]

Frontiers in neuroscience14, 282 (2020)

Jungo, A., Balsiger, F., Reyes, M.: Analyzing the quality and challenges of uncer- tainty estimations for brain tumor segmentation. Frontiers in neuroscience14, 282 (2020)

work page 2020

[12] [12]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Jungo, A., Reyes, M.: Assessing reliability and challenges of uncertainty estimations for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 48–56. Springer (2019) Lost in the Folds 11

work page 2019

[13] [13]

In: International Conference on Learning Representations (ICLR) (2024)

Kahl, K.C., Lüth, C.T., Zenk, M., Maier-Hein, K.H., Jaeger, P.F.: ValUES: A frame- work for systematic validation of uncertainty estimation in semantic segmentation. In: International Conference on Learning Representations (ICLR) (2024)

work page 2024

[14] [14]

Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems30(2017)

work page 2017

[15] [15]

arXiv preprint arXiv:2404.07208 (2024)

Khalili, N., Spronck, J., Ciompi, F., van der Laak, J., Litjens, G.: Uncertainty-guided annotation enhances segmentation with the human-in-the-loop. arXiv preprint arXiv:2404.07208 (2024)

work page arXiv 2024

[16] [16]

Applied Sciences14(21), 10020 (2024)

Kucybała, I., Rozynek, M., Krupa, K., Matusik, P., Jarczewski, J., Tabor, Z.: Evaluating uncertainty quantification in medical image segmentation: A multi- dataset, multi-algorithm study. Applied Sciences14(21), 10020 (2024). https: //doi.org/10.3390/app142110020

work page doi:10.3390/app142110020 2024

[17] [17]

In: Advances in Neural Information Processing Systems (NeurIPS)

Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017)

work page 2017

[18] [18]

IEEE Transactions on Medical Imaging40(12), 3436–3447 (2021)

Mehrtash, A., Wells, W.M., Tempany, C.M., Abolmaesumi, P., Kapur, T.: Con- fidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging40(12), 3436–3447 (2021). https://doi.org/10.1109/TMI.2021.3122681

work page doi:10.1109/tmi.2021.3122681 2021

[19] [19]

Explaining Uncertainty in Multiple Sclerosis Cortical Lesion Segmentation Beyond Prediction Errors

Molchanova, N., Gordaliza, P.M., Cagol, A., Ocampo-Pineda, M., Lu, P.J., Weigel, M., Chen, X., Beck, E.S., Tsagkas, H., Reich, D.S., Stölting, A., Maggi, P., Ribes, D., Depeursinge, A., Granziera, C., Müller, H., Bach Cuadra, M.: Explainability of AI uncertainty: Application to multiple sclerosis lesion segmentation on MRI. arXiv preprint arXiv:2504.04814 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Medical Physics45(3), 1295– 1300 (2018)

Nyholm, T., Svensson, S., Andersson, S., Jonsson, J., Sohlin, M., Gustafsson, C., Kjellén, E., Söderström, K., Albertsson, P., Blomqvist, L., Zackrisson, B., Olsson, L.E., Gunnlaugsson, A.: MR and CT data with multiobserver delineations of organs in the pelvic area–part of the gold atlas project. Medical Physics45(3), 1295– 1300 (2018). https://doi.org/10...

work page doi:10.1002/mp.12748 2018

[21] [21]

Advances in Neural Information Processing Systems32(2019)

Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., Snoek, J.: Can you trust your model’s uncertainty? evalu- ating predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems32(2019)

work page 2019

[22] [22]

https://doi.org/10.5281/zenodo.11147559, https://zenodo.org/ records/11147559, 90 CT scans with multi-rater segmentations of pancreas, liver, and kidneys

Riera-Marín, M., Kleiß, J.M., Aubanell, A., Antolín, A.: CURVAS dataset: Calibra- tion and uncertainty for multirater volume assessment in multi-organ segmenta- tion (2024). https://doi.org/10.5281/zenodo.11147559, https://zenodo.org/ records/11147559, 90 CT scans with multi-rater segmentations of pancreas, liver, and kidneys

work page doi:10.5281/zenodo.11147559 2024

[23] [23]

Frontiers in Neurology12, 609646 (2021)

Rosas-Gonzalez, S., Birgui-Sekou, T., Hidane, M., Tauber, C.: Asymmetric ensem- ble of asymmetric U-Net models for brain tumor segmentation with uncertainty estimation. Frontiers in Neurology12, 609646 (2021)

work page 2021

[24] [24]

In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging

Schwab, M., Haltmeier, M., Mayr, A.: Disagreement-driven uncertainty quantifi- cation in late gadolinium enhancement cardiac MRI. In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging. pp. 24–33. Springer (2025)

work page 2025

[25] [25]

IEEE Transactions on Medical Imaging23(7), 903–921 (2004).https://doi.org/10

Warfield, S., Zou, K., Wells, W.: Simultaneous truth and performance level esti- mation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging23(7), 903–921 (2004).https://doi.org/10. 1109/TMI.2004.828354 12 T. Kirscher et al

work page arXiv 2004

[26] [26]

In: CVPR Workshop on Uncertainty Quantification for Computer Vision (2025), arXiv:2506.16589

Zeevi, T., Lieffrig, E.V., Staib, L.H., Onofrey, J.A.: Spatially-aware evaluation of segmentation uncertainty. In: CVPR Workshop on Uncertainty Quantification for Computer Vision (2025), arXiv:2506.16589

work page arXiv 2025

[27] [27]

Medical Image Analysis 101, 103392 (2025)

Zenk, M., Zimmerer, D., Isensee, F., Traub, J., Norajitra, T., Jäger, P.F., Maier- Hein, K.: Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation. Medical Image Analysis 101, 103392 (2025)

work page 2025

[28] [28]

In: Medical Image Computing and Computer Assisted Intervention (MICCAI)

Zhao, Y., Yang, C., Schweidtmann, A., Tao, Q.: Efficient bayesian uncertainty estimation for nnU-Net. In: Medical Image Computing and Computer Assisted Intervention (MICCAI). Lecture Notes in Computer Science, vol. 13438, pp. 535–544. Springer (2022).https://doi.org/10.1007/978-3-031-16452-1_51

work page doi:10.1007/978-3-031-16452-1_51 2022