Recognition: unknown
Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net
Pith reviewed 2026-05-10 15:10 UTC · model grok-4.3
The pith
Calibrated checkpoint ensembles produce uncertainty maps that best align with actual errors in nnU-Net radiotherapy segmentations
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Combining post-hoc temperature scaling calibration with checkpoint-based ensembling in nnU-Net produces voxel-wise predictive entropy uncertainty maps that show the strongest alignment with segmentation errors, measured as AUC over the top 0-5 percent most uncertain voxels, while keeping segmentation performance unchanged and thereby supporting efficient targeted manual QA under realistic revision budgets for complex cases such as TMLI.
What carries the argument
The calibrated predictive entropy uncertainty map from nnU-Net checkpoint ensembles, which ranks voxels by uncertainty score to direct manual review toward the regions most likely to contain errors.
If this is right
- Segmentation accuracy remains stable across all tested uncertainty configurations.
- Temperature scaling substantially improves calibration metrics for the uncertainty estimates.
- Uncertainty-error alignment reaches its highest value with calibrated checkpoint-based inference.
- The resulting maps enable more consistent highlighting of regions that require manual edits under budget constraints.
Where Pith is reading between the lines
- The same calibration-plus-checkpoint strategy could be applied to other nnU-Net medical segmentation tasks to test whether the alignment benefit holds beyond TMLI.
- If the maps prove reliable in prospective use, they could be paired with interactive editing tools to further reduce total clinician review time.
- The approach leaves open whether the highlighted voxels also correlate with downstream clinical metrics such as dose planning accuracy.
Load-bearing premise
Voxel-wise predictive entropy uncertainty maps, once calibrated, will reliably correspond to actual segmentation errors rather than other sources of variability in a way that AUC on the top 0-5 percent uncertain voxels predicts effective time savings during manual QA.
What would settle it
An experiment showing no gain in AUC for detecting segmentation errors within the top 0-5 percent most uncertain voxels when using calibrated checkpoint ensembles versus a single uncalibrated nnU-Net model would disprove the improved alignment.
Figures
read the original abstract
Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty--error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a budget-aware uncertainty-driven QA framework for nnU-Net-based radiotherapy segmentation of the Clinical Target Volume (CTV) in Total Marrow and Lymph Node Irradiation (TMLI). It compares temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), both individually and in combination, to produce voxel-wise predictive entropy uncertainty maps. The central claims are that segmentation accuracy remains stable across configurations, TS substantially improves calibration (via ROI-masked metrics), and calibrated checkpoint-based inference yields the strongest uncertainty-error alignment, quantified as AUC over the top 0-5% most uncertain voxels, thereby guiding targeted manual review under realistic budget constraints.
Significance. If the results hold, the work offers a practical route to safer clinical deployment of auto-segmentation tools by supplying uncertainty cues that better align with actual errors, potentially reducing manual QA workload in complex cases. The emphasis on efficient ensembling plus post-hoc calibration is a pragmatic strength for budget-aware workflows. However, confinement to a single use case and reliance on a voxel-wise metric without spatial validation limit immediate generalizability and clinical translation.
major comments (2)
- The uncertainty-error alignment is quantified solely via AUC on the top 0-5% most uncertain voxels (predictive entropy after calibration, as stated in the abstract). This voxel-independent selection does not address the contiguous, boundary-driven nature of TMLI segmentation errors (e.g., coherent shifts or lymph-node clusters); no connected-component or region-level analysis is provided to confirm that selected voxels correspond to edits a clinician would perform under budget constraints.
- Evaluation is restricted to a single use case (TMLI) with selection of the best configuration, introducing potential post-hoc bias. The abstract reports no quantitative AUC values, error bars, dataset details (size, splits), or full method descriptions, which are load-bearing for assessing the magnitude and reproducibility of the claimed improvements in alignment and calibration.
minor comments (1)
- Notation for 'uncertainty-error' uses an en-dash in the abstract; ensure consistent hyphenation or en-dash usage in all sections and figures.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment point by point below, making revisions where they strengthen the work without misrepresenting our results. Our responses focus on clarifying the rationale for our evaluation choices while incorporating additional analyses and reporting improvements.
read point-by-point responses
-
Referee: The uncertainty-error alignment is quantified solely via AUC on the top 0-5% most uncertain voxels (predictive entropy after calibration, as stated in the abstract). This voxel-independent selection does not address the contiguous, boundary-driven nature of TMLI segmentation errors (e.g., coherent shifts or lymph-node clusters); no connected-component or region-level analysis is provided to confirm that selected voxels correspond to edits a clinician would perform under budget constraints.
Authors: We agree that our primary metric (AUC on the top 0-5% uncertain voxels) is voxel-wise and does not explicitly model spatial contiguity, which is a valid point given the boundary-driven and clustered nature of TMLI errors. The voxel-level focus was chosen because budget-aware QA prioritizes identifying individual high-uncertainty locations for efficient manual review, where clinicians can address small contiguous groups without needing full region segmentation upfront. Nevertheless, to better demonstrate clinical relevance, we have added a supplementary connected-component analysis in the revised manuscript. This quantifies the clustering of top uncertain voxels, their overlap with error regions, and provides visual examples of boundary shifts and lymph-node clusters, confirming that selected voxels frequently align with coherent error areas a clinician would edit. revision: yes
-
Referee: Evaluation is restricted to a single use case (TMLI) with selection of the best configuration, introducing potential post-hoc bias. The abstract reports no quantitative AUC values, error bars, dataset details (size, splits), or full method descriptions, which are load-bearing for assessing the magnitude and reproducibility of the claimed improvements in alignment and calibration.
Authors: We acknowledge the single-use-case limitation as restricting broader generalizability, which we already note in the discussion as a direction for future multi-site validation; TMLI was selected as a representative complex case but cannot be expanded here without additional data. To address post-hoc bias, the revised manuscript reports full results for all configurations (TS, DE, CE, TTA, and combinations) rather than only the best, with selection justified on validation performance. We have also revised the abstract to include quantitative AUC values for key configurations, mention of error bars from the figures, dataset details (patient count and train/validation/test splits), and a brief methods overview, while retaining full descriptions in the main text for reproducibility. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper applies standard uncertainty quantification techniques (predictive entropy, temperature scaling, deep/checkpoint ensembles, TTA) to nnU-Net and evaluates them using external, non-circular metrics: ROI-masked calibration and AUC on the top 0-5% most uncertain voxels for uncertainty-error alignment. These quantities are computed from held-out data and do not reduce by construction to the paper's fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text; the derivation chain remains self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption nnU-Net base segmentations are sufficiently accurate for TMLI to serve as starting point for QA
- domain assumption Predictive entropy from model outputs is a suitable proxy for segmentation error likelihood
Reference graph
Works this paper leans on
-
[1]
Cancer and radiation therapy: Current advances and future directions,
R. Baskar, K. A. Lee, R. M. Yeo, and K. W. Yeoh, “Cancer and radiation therapy: Current advances and future directions,”International Journal of Medical Sciences, vol. 9, pp. 193 – 199, 2012
2012
-
[2]
Artificial intelligence in radiation oncology,
E. Huynh, A. Hosny, C. Guthier, D. S. Bitterman, S. F. Petit, D. A. Haas- Kogan, B. Kann, H. J. Aerts, and R. H. Mak, “Artificial intelligence in radiation oncology,”Nature Reviews Clinical Oncology, vol. 17, no. 12, pp. 771–781, 2020
2020
-
[3]
Advances in auto-segmentation,
C. E. Cardenas, J. Yang, B. M. Anderson, L. E. Court, and K. B. Brock, “Advances in auto-segmentation,”Seminars in Radiation On- cology, vol. 29, no. 3, pp. 185–197, 2019. Adaptive Radiotherapy and Automation
2019
-
[4]
Deep learning empowered volume delineation of whole-body organs-at-risk for accelerated radiotherapy,
F. Shi, W. Hu, J. Wu, M. Han, J. Wang, W. Zhang, Q. Zhou, J. Zhou, Y . Wei, Y . Shao,et al., “Deep learning empowered volume delineation of whole-body organs-at-risk for accelerated radiotherapy,”Nature Com- munications, vol. 13, no. 1, p. 6566, 2022
2022
-
[5]
Total marrow and total lymphoid irradiation in bone marrow transplantation for acute leukaemia,
J. Y . C. Wong, S. Hui, M. Scorsetti, P. Mancosu, A. R. Filippi, P. S. Matteo, J. Y . C. Wong, M. Scorsetti, S. Hui, L. P. Muren, and P. Mancosu, “Total marrow and total lymphoid irradiation in bone marrow transplantation for acute leukaemia,”Review Lancet Oncol, vol. 21, pp. 477–87, 2020
2020
-
[6]
Artificial intelligence uncertainty quantification in radiotherapy appli- cations - a scoping review,
K. A. Wahid, Z. Y . Kaffey, D. P. Farris, L. Humbert-Vidan, A. C. Moreno, M. Rasmussen, J. Ren, M. A. Naser, T. J. Netherton, S. Korre- man, G. Balakrishnan, C. D. Fuller, D. Fuentes, and M. J. Dohopolski, “Artificial intelligence uncertainty quantification in radiotherapy appli- cations - a scoping review,”Radiotherapy and Oncology, vol. 201, p. 110542, 12 2024
2024
-
[7]
Uncertainties in outcome modelling in radiation oncology,
L. D ¨unger, E. M ¨ausel, A. Zwanenburg, and S. L ¨ock, “Uncertainties in outcome modelling in radiation oncology,”Physics and Imaging in Radiation Oncology, vol. 34, p. 100774, 4 2025
2025
-
[8]
nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,
F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,”Nature Methods, 2021
2021
-
[9]
Title suppressed for double-blind review,
Anonymous, “Title suppressed for double-blind review,”Details omitted for double-blind review, 2024
2024
-
[10]
Implementation of deep learning-based auto-segmentation for radiotherapy planning structures: a workflow study at two cancer centers,
J. Wong, V . Huang, D. Wells, J. Giambattista, J. Giambattista, C. Kol- beck, K. Otto, E. P. Saibishkumar, and A. Alexander, “Implementation of deep learning-based auto-segmentation for radiotherapy planning structures: a workflow study at two cancer centers,”Radiation Oncology, vol. 16, p. 101, 12 2021
2021
-
[11]
Machine learning for auto-segmentation in radiotherapy plan- ning,
K. Harrison, H. Pullen, C. Welsh, O. Oktay, J. Alvarez-Valle, and R. Jena, “Machine learning for auto-segmentation in radiotherapy plan- ning,”Clinical Oncology, vol. 34, pp. 74–88, 2 2022
2022
-
[12]
Clinically applicable deep learning framework for organs at risk delineation in ct images,
H. Tang, X. Chen, Y . Liu, Z. Lu, J. You, M. Yang, S. Yao, G. Zhao, Y . Xu, T. Chen, Y . Liu, and X. Xie, “Clinically applicable deep learning framework for organs at risk delineation in ct images,”Nature Machine Intelligence, vol. 1, pp. 480–491, 9 2019
2019
-
[13]
A deep learning-based auto-segmentation system for organs-at-risk on whole- body computed tomography images for radiation therapy,
X. Chen, S. Sun, N. Bai, K. Han, Q. Liu, S. Yao, H. Tang, C. Zhang, Z. Lu, Q. Huang, G. Zhao, Y . Xu, T. Chen, X. Xie, and Y . Liu, “A deep learning-based auto-segmentation system for organs-at-risk on whole- body computed tomography images for radiation therapy,”Radiotherapy and Oncology, vol. 160, pp. 175–184, 2021
2021
-
[14]
Totalsegmentator: robust segmentation of 104 anatomical structures in ct images,
J. Wasserthal, M. Meyer, H.-C. Breit, J. Cyriac, S. Yang, and M. Segeroth, “Totalsegmentator: robust segmentation of 104 anatomical structures in ct images,” 2022
2022
-
[15]
nnu-net revisited: A call for rigorous validation in 3d medical image segmentation,
F. Isensee, T. Wald, C. Ulrich, M. Baumgartner, S. Roy, K. Maier- Hein, and P. F. J ¨ager, “nnu-net revisited: A call for rigorous validation in 3d medical image segmentation,” inMedical Image Computing and Computer Assisted Intervention – MICCAI 2024(M. G. Linguraru, Q. Dou, A. Feragen, S. Giannarou, B. Glocker, K. Lekadir, and J. A. Schnabel, eds.), (Ch...
2024
-
[16]
Vision 20/20: Perspectives on automated image segmentation for radiotherapy,
G. Sharp, K. D. Fritscher, V . Pekar, M. Peroni, N. Shusharina, H. Veer- araghavan, and J. Yang, “Vision 20/20: Perspectives on automated image segmentation for radiotherapy,”Medical Physics, vol. 41, no. 5, p. 050902, 2014
2014
-
[17]
Automatic segmentation of pelvic cancers using deep learning: State-of-the-art approaches and challenges,
R. Kalantar, G. Lin, J. M. Winfield, C. Messiou, S. Lalondrelle, M. D. Blackledge, and D. M. Koh, “Automatic segmentation of pelvic cancers using deep learning: State-of-the-art approaches and challenges,” Diagnostics, vol. 11, 11 2021
2021
-
[18]
Targeted total marrow irradiation using three-dimensional image-guided tomographic intensity- modulated radiation therapy: An alternative to standard total body irradiation,
J. Y . Wong, A. Liu, T. Schultheiss, L. Popplewell, A. Stein, J. Rosenthal, M. Essensten, S. Forman, and G. Somlo, “Targeted total marrow irradiation using three-dimensional image-guided tomographic intensity- modulated radiation therapy: An alternative to standard total body irradiation,”Biology of Blood and Marrow Transplantation, vol. 12, no. 3, pp. 30...
2006
-
[19]
Internal guidelines for reducing lymph node contour variability in total marrow and lymph node irradiation,
D. Dei, N. Lambri, S. Stefanini, V . Vernier, R. C. Brioso, L. Crespi, E. Clerici, L. Bellu, C. D. Philippis, D. Loiacono, P. Navarria, G. Reg- giori, S. Bramanti, M. Rodari, S. Tomatis, A. Chiti, C. Carlo-Stella, M. Scorsetti, and P. Mancosu, “Internal guidelines for reducing lymph node contour variability in total marrow and lymph node irradiation,” Can...
2023
-
[20]
On calibration of modern neural networks,
C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,”Proceedings of the 34th International Conference on Machine Learning, 2017
2017
-
[21]
Confidence calibration and predictive uncertainty estimation for deep medical image segmentation,
A. Mehrtash, W. M. Wells,et al., “Confidence calibration and predictive uncertainty estimation for deep medical image segmentation,”IEEE Transactions on Medical Imaging, 2020
2020
-
[22]
Analyzing the quality and challenges of uncertainty estimations for brain tumor segmentation,
A. Jungo, F. Balsiger, and M. Reyes, “Analyzing the quality and challenges of uncertainty estimations for brain tumor segmentation,” Frontiers in Neuroscience, vol. 14, p. 282, 4 2020
2020
-
[23]
Simple and scalable predictive uncertainty estimation using deep ensembles,
B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,”NeurIPS, 2017
2017
-
[24]
Efficient bayesian uncertainty estimation for nnu-net,
Y . Zhao, C. Yang,et al., “Efficient bayesian uncertainty estimation for nnu-net,” inMICCAI, 2022
2022
-
[25]
Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation,
G. Wang, W. Li,et al., “Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation,”Neurocomputing, 2019
2019
-
[26]
Phiseg: Capturing uncertainty in medical image segmentation,
C. Baumgartner, K. Tezcan,et al., “Phiseg: Capturing uncertainty in medical image segmentation,” inMICCAI, 2019
2019
-
[27]
Values: A framework for systematic validation of uncertainty estimation in semantic segmentation,
K. C. Kahl, C. T. L ¨uth, M. Zenk, K. Maier-Hein, and P. F. Jaeger, “Values: A framework for systematic validation of uncertainty estimation in semantic segmentation,”12th International Conference on Learning Representations, ICLR 2024, 1 2024
2024
-
[28]
Proba- bility maps for deep learning-based head and neck tumor segmentation: Graphical user interface design and test,
A. D. Biase, L. Ziegfeld, N. M. Sijtsema, R. Steenbakkers, R. Wijsman, L. V . van Dijk, J. A. Langendijk, F. Cnossen, and P. van Ooijen, “Proba- bility maps for deep learning-based head and neck tumor segmentation: Graphical user interface design and test,”Computers in Biology and Medicine, vol. 177, p. 108675, 7 2024
2024
-
[29]
A network score-based metric to optimize the quality assurance of automatic radiotherapy target segmentations,
R. R. Outeiral, N. F. Silv ´erio, P. J. Gonz´alez, E. E. Schaake, T. Janssen, U. A. van der Heide, and R. Sim ˜oes, “A network score-based metric to optimize the quality assurance of automatic radiotherapy target segmentations,”Physics and Imaging in Radiation Oncology, vol. 28, p. 100500, 10 2023
2023
-
[30]
Uncertainty-aware deep learning for segmentation of primary tumor and pathologic lymph nodes in oropha- ryngeal cancer: Insights from a multi-center cohort,
A. D. Biase, N. M. Sijtsema, L. V . van Dijk, R. Steenbakkers, J. A. Langendijk, and P. van Ooijen, “Uncertainty-aware deep learning for segmentation of primary tumor and pathologic lymph nodes in oropha- ryngeal cancer: Insights from a multi-center cohort,”Computerized Medical Imaging and Graphics, vol. 123, p. 102535, 7 2025
2025
-
[31]
Clinical assessment of deep learning-based uncertainty maps in lung cancer segmentation,
F. C. Maruccio, W. Eppinga, M. H. Laves, R. F. Navarro, M. Salvi, F. Molinari, and P. Papaconstadopoulos, “Clinical assessment of deep learning-based uncertainty maps in lung cancer segmentation,”Physics in Medicine and Biology, vol. 69, 2 2024
2024
-
[32]
Impact of deep learning model uncertainty on manual corrections to mri-based auto-segmentation in prostate cancer radiotherapy,
V . Rogowski, A. Svalkvist, M. Maspero, T. Janssen, F. C. Maruccio, J. Gorgisyan, J. Scherman, I. H ¨aggstr¨om, V . W˚ahlstrand, A. Gunnlaugs- son, M. P. Nilsson, M. Moreau, N. Vass, N. Pettersson, and C. J. Gustafs- son, “Impact of deep learning model uncertainty on manual corrections to mri-based auto-segmentation in prostate cancer radiotherapy,”Journa...
2025
-
[33]
Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk,
J. E. van Aalst, F. C. Maruccio, R. Sim ˜oes, T. M. Janssen, J. M. Wolterink, P. M. A. van Ooijen, and C. L. Brouwer, “Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk,”Physics in Medicine & Biology, vol. 70, p. 205023, Oct. 2025
2025
-
[34]
Efficient bayesian uncertainty estimation for nnu-net,
Y . Zhao, C. Yang, A. Schweidtmann, and Q. Tao, “Efficient bayesian uncertainty estimation for nnu-net,”Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13438 LNCS, pp. 535–544, 5 2024
2024
-
[35]
Title suppressed for double-blind review,
Anonymous, “Title suppressed for double-blind review,”Details omitted for double-blind review, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.