pith. machine review for the scientific record. sign in

arxiv: 2604.11798 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.AI

Recognition: unknown

Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords nnU-Netuncertainty quantificationradiotherapy segmentationquality assurancepredictive entropytemperature scalingcheckpoint ensemblesTMLI
0
0 comments X

The pith

Calibrated checkpoint ensembles produce uncertainty maps that best align with actual errors in nnU-Net radiotherapy segmentations

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a budget-aware quality assurance approach for deep-learning segmentation of clinical target volumes in radiotherapy, using nnU-Net as the base model on total marrow and lymph node irradiation cases. It compares temperature scaling, deep ensembles, checkpoint ensembles, and test-time augmentation, both alone and combined, to generate voxel-wise predictive entropy uncertainty maps. While overall segmentation accuracy stays stable, temperature scaling improves calibration and checkpoint ensembles deliver the strongest match between high-uncertainty regions and real segmentation mistakes when measured by AUC on the top 0-5 percent most uncertain voxels. This setup aims to let clinicians focus limited manual review time on the voxels most likely to need correction instead of inspecting entire volumes.

Core claim

Combining post-hoc temperature scaling calibration with checkpoint-based ensembling in nnU-Net produces voxel-wise predictive entropy uncertainty maps that show the strongest alignment with segmentation errors, measured as AUC over the top 0-5 percent most uncertain voxels, while keeping segmentation performance unchanged and thereby supporting efficient targeted manual QA under realistic revision budgets for complex cases such as TMLI.

What carries the argument

The calibrated predictive entropy uncertainty map from nnU-Net checkpoint ensembles, which ranks voxels by uncertainty score to direct manual review toward the regions most likely to contain errors.

If this is right

  • Segmentation accuracy remains stable across all tested uncertainty configurations.
  • Temperature scaling substantially improves calibration metrics for the uncertainty estimates.
  • Uncertainty-error alignment reaches its highest value with calibrated checkpoint-based inference.
  • The resulting maps enable more consistent highlighting of regions that require manual edits under budget constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same calibration-plus-checkpoint strategy could be applied to other nnU-Net medical segmentation tasks to test whether the alignment benefit holds beyond TMLI.
  • If the maps prove reliable in prospective use, they could be paired with interactive editing tools to further reduce total clinician review time.
  • The approach leaves open whether the highlighted voxels also correlate with downstream clinical metrics such as dose planning accuracy.

Load-bearing premise

Voxel-wise predictive entropy uncertainty maps, once calibrated, will reliably correspond to actual segmentation errors rather than other sources of variability in a way that AUC on the top 0-5 percent uncertain voxels predicts effective time savings during manual QA.

What would settle it

An experiment showing no gain in AUC for detecting segmentation errors within the top 0-5 percent most uncertain voxels when using calibrated checkpoint ensembles versus a single uncalibrated nnU-Net model would disprove the improved alignment.

Figures

Figures reproduced from arXiv: 2604.11798 by Damiano Dei, Daniele Loiacono, Lorenzo Mondo, Marta Scorsetti, Nicola Lambri, Pietro Mancosu, Ricardo Coimbra Brioso.

Figure 1
Figure 1. Figure 1: Budget-aware uncertainty visualization on three axial slices from a representative case (AUTOMI view at source ↗
read the original abstract

Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty--error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a budget-aware uncertainty-driven QA framework for nnU-Net-based radiotherapy segmentation of the Clinical Target Volume (CTV) in Total Marrow and Lymph Node Irradiation (TMLI). It compares temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), both individually and in combination, to produce voxel-wise predictive entropy uncertainty maps. The central claims are that segmentation accuracy remains stable across configurations, TS substantially improves calibration (via ROI-masked metrics), and calibrated checkpoint-based inference yields the strongest uncertainty-error alignment, quantified as AUC over the top 0-5% most uncertain voxels, thereby guiding targeted manual review under realistic budget constraints.

Significance. If the results hold, the work offers a practical route to safer clinical deployment of auto-segmentation tools by supplying uncertainty cues that better align with actual errors, potentially reducing manual QA workload in complex cases. The emphasis on efficient ensembling plus post-hoc calibration is a pragmatic strength for budget-aware workflows. However, confinement to a single use case and reliance on a voxel-wise metric without spatial validation limit immediate generalizability and clinical translation.

major comments (2)
  1. The uncertainty-error alignment is quantified solely via AUC on the top 0-5% most uncertain voxels (predictive entropy after calibration, as stated in the abstract). This voxel-independent selection does not address the contiguous, boundary-driven nature of TMLI segmentation errors (e.g., coherent shifts or lymph-node clusters); no connected-component or region-level analysis is provided to confirm that selected voxels correspond to edits a clinician would perform under budget constraints.
  2. Evaluation is restricted to a single use case (TMLI) with selection of the best configuration, introducing potential post-hoc bias. The abstract reports no quantitative AUC values, error bars, dataset details (size, splits), or full method descriptions, which are load-bearing for assessing the magnitude and reproducibility of the claimed improvements in alignment and calibration.
minor comments (1)
  1. Notation for 'uncertainty-error' uses an en-dash in the abstract; ensure consistent hyphenation or en-dash usage in all sections and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment point by point below, making revisions where they strengthen the work without misrepresenting our results. Our responses focus on clarifying the rationale for our evaluation choices while incorporating additional analyses and reporting improvements.

read point-by-point responses
  1. Referee: The uncertainty-error alignment is quantified solely via AUC on the top 0-5% most uncertain voxels (predictive entropy after calibration, as stated in the abstract). This voxel-independent selection does not address the contiguous, boundary-driven nature of TMLI segmentation errors (e.g., coherent shifts or lymph-node clusters); no connected-component or region-level analysis is provided to confirm that selected voxels correspond to edits a clinician would perform under budget constraints.

    Authors: We agree that our primary metric (AUC on the top 0-5% uncertain voxels) is voxel-wise and does not explicitly model spatial contiguity, which is a valid point given the boundary-driven and clustered nature of TMLI errors. The voxel-level focus was chosen because budget-aware QA prioritizes identifying individual high-uncertainty locations for efficient manual review, where clinicians can address small contiguous groups without needing full region segmentation upfront. Nevertheless, to better demonstrate clinical relevance, we have added a supplementary connected-component analysis in the revised manuscript. This quantifies the clustering of top uncertain voxels, their overlap with error regions, and provides visual examples of boundary shifts and lymph-node clusters, confirming that selected voxels frequently align with coherent error areas a clinician would edit. revision: yes

  2. Referee: Evaluation is restricted to a single use case (TMLI) with selection of the best configuration, introducing potential post-hoc bias. The abstract reports no quantitative AUC values, error bars, dataset details (size, splits), or full method descriptions, which are load-bearing for assessing the magnitude and reproducibility of the claimed improvements in alignment and calibration.

    Authors: We acknowledge the single-use-case limitation as restricting broader generalizability, which we already note in the discussion as a direction for future multi-site validation; TMLI was selected as a representative complex case but cannot be expanded here without additional data. To address post-hoc bias, the revised manuscript reports full results for all configurations (TS, DE, CE, TTA, and combinations) rather than only the best, with selection justified on validation performance. We have also revised the abstract to include quantitative AUC values for key configurations, mention of error bars from the figures, dataset details (patient count and train/validation/test splits), and a brief methods overview, while retaining full descriptions in the main text for reproducibility. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper applies standard uncertainty quantification techniques (predictive entropy, temperature scaling, deep/checkpoint ensembles, TTA) to nnU-Net and evaluates them using external, non-circular metrics: ROI-masked calibration and AUC on the top 0-5% most uncertain voxels for uncertainty-error alignment. These quantities are computed from held-out data and do not reduce by construction to the paper's fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text; the derivation chain remains self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, invented entities, or detailed axioms stated. Implicit domain assumptions include reliability of nnU-Net base models and validity of entropy as uncertainty proxy.

axioms (2)
  • domain assumption nnU-Net base segmentations are sufficiently accurate for TMLI to serve as starting point for QA
    Framework assumes nnU-Net performance is stable and focuses on uncertainty overlay rather than improving base accuracy.
  • domain assumption Predictive entropy from model outputs is a suitable proxy for segmentation error likelihood
    Central to generating voxel-wise uncertainty maps used for QA prioritization.

pith-pipeline@v0.9.0 · 5544 in / 1378 out tokens · 62919 ms · 2026-05-10T15:10:06.278554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references

  1. [1]

    Cancer and radiation therapy: Current advances and future directions,

    R. Baskar, K. A. Lee, R. M. Yeo, and K. W. Yeoh, “Cancer and radiation therapy: Current advances and future directions,”International Journal of Medical Sciences, vol. 9, pp. 193 – 199, 2012

  2. [2]

    Artificial intelligence in radiation oncology,

    E. Huynh, A. Hosny, C. Guthier, D. S. Bitterman, S. F. Petit, D. A. Haas- Kogan, B. Kann, H. J. Aerts, and R. H. Mak, “Artificial intelligence in radiation oncology,”Nature Reviews Clinical Oncology, vol. 17, no. 12, pp. 771–781, 2020

  3. [3]

    Advances in auto-segmentation,

    C. E. Cardenas, J. Yang, B. M. Anderson, L. E. Court, and K. B. Brock, “Advances in auto-segmentation,”Seminars in Radiation On- cology, vol. 29, no. 3, pp. 185–197, 2019. Adaptive Radiotherapy and Automation

  4. [4]

    Deep learning empowered volume delineation of whole-body organs-at-risk for accelerated radiotherapy,

    F. Shi, W. Hu, J. Wu, M. Han, J. Wang, W. Zhang, Q. Zhou, J. Zhou, Y . Wei, Y . Shao,et al., “Deep learning empowered volume delineation of whole-body organs-at-risk for accelerated radiotherapy,”Nature Com- munications, vol. 13, no. 1, p. 6566, 2022

  5. [5]

    Total marrow and total lymphoid irradiation in bone marrow transplantation for acute leukaemia,

    J. Y . C. Wong, S. Hui, M. Scorsetti, P. Mancosu, A. R. Filippi, P. S. Matteo, J. Y . C. Wong, M. Scorsetti, S. Hui, L. P. Muren, and P. Mancosu, “Total marrow and total lymphoid irradiation in bone marrow transplantation for acute leukaemia,”Review Lancet Oncol, vol. 21, pp. 477–87, 2020

  6. [6]

    Artificial intelligence uncertainty quantification in radiotherapy appli- cations - a scoping review,

    K. A. Wahid, Z. Y . Kaffey, D. P. Farris, L. Humbert-Vidan, A. C. Moreno, M. Rasmussen, J. Ren, M. A. Naser, T. J. Netherton, S. Korre- man, G. Balakrishnan, C. D. Fuller, D. Fuentes, and M. J. Dohopolski, “Artificial intelligence uncertainty quantification in radiotherapy appli- cations - a scoping review,”Radiotherapy and Oncology, vol. 201, p. 110542, 12 2024

  7. [7]

    Uncertainties in outcome modelling in radiation oncology,

    L. D ¨unger, E. M ¨ausel, A. Zwanenburg, and S. L ¨ock, “Uncertainties in outcome modelling in radiation oncology,”Physics and Imaging in Radiation Oncology, vol. 34, p. 100774, 4 2025

  8. [8]

    nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,

    F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,”Nature Methods, 2021

  9. [9]

    Title suppressed for double-blind review,

    Anonymous, “Title suppressed for double-blind review,”Details omitted for double-blind review, 2024

  10. [10]

    Implementation of deep learning-based auto-segmentation for radiotherapy planning structures: a workflow study at two cancer centers,

    J. Wong, V . Huang, D. Wells, J. Giambattista, J. Giambattista, C. Kol- beck, K. Otto, E. P. Saibishkumar, and A. Alexander, “Implementation of deep learning-based auto-segmentation for radiotherapy planning structures: a workflow study at two cancer centers,”Radiation Oncology, vol. 16, p. 101, 12 2021

  11. [11]

    Machine learning for auto-segmentation in radiotherapy plan- ning,

    K. Harrison, H. Pullen, C. Welsh, O. Oktay, J. Alvarez-Valle, and R. Jena, “Machine learning for auto-segmentation in radiotherapy plan- ning,”Clinical Oncology, vol. 34, pp. 74–88, 2 2022

  12. [12]

    Clinically applicable deep learning framework for organs at risk delineation in ct images,

    H. Tang, X. Chen, Y . Liu, Z. Lu, J. You, M. Yang, S. Yao, G. Zhao, Y . Xu, T. Chen, Y . Liu, and X. Xie, “Clinically applicable deep learning framework for organs at risk delineation in ct images,”Nature Machine Intelligence, vol. 1, pp. 480–491, 9 2019

  13. [13]

    A deep learning-based auto-segmentation system for organs-at-risk on whole- body computed tomography images for radiation therapy,

    X. Chen, S. Sun, N. Bai, K. Han, Q. Liu, S. Yao, H. Tang, C. Zhang, Z. Lu, Q. Huang, G. Zhao, Y . Xu, T. Chen, X. Xie, and Y . Liu, “A deep learning-based auto-segmentation system for organs-at-risk on whole- body computed tomography images for radiation therapy,”Radiotherapy and Oncology, vol. 160, pp. 175–184, 2021

  14. [14]

    Totalsegmentator: robust segmentation of 104 anatomical structures in ct images,

    J. Wasserthal, M. Meyer, H.-C. Breit, J. Cyriac, S. Yang, and M. Segeroth, “Totalsegmentator: robust segmentation of 104 anatomical structures in ct images,” 2022

  15. [15]

    nnu-net revisited: A call for rigorous validation in 3d medical image segmentation,

    F. Isensee, T. Wald, C. Ulrich, M. Baumgartner, S. Roy, K. Maier- Hein, and P. F. J ¨ager, “nnu-net revisited: A call for rigorous validation in 3d medical image segmentation,” inMedical Image Computing and Computer Assisted Intervention – MICCAI 2024(M. G. Linguraru, Q. Dou, A. Feragen, S. Giannarou, B. Glocker, K. Lekadir, and J. A. Schnabel, eds.), (Ch...

  16. [16]

    Vision 20/20: Perspectives on automated image segmentation for radiotherapy,

    G. Sharp, K. D. Fritscher, V . Pekar, M. Peroni, N. Shusharina, H. Veer- araghavan, and J. Yang, “Vision 20/20: Perspectives on automated image segmentation for radiotherapy,”Medical Physics, vol. 41, no. 5, p. 050902, 2014

  17. [17]

    Automatic segmentation of pelvic cancers using deep learning: State-of-the-art approaches and challenges,

    R. Kalantar, G. Lin, J. M. Winfield, C. Messiou, S. Lalondrelle, M. D. Blackledge, and D. M. Koh, “Automatic segmentation of pelvic cancers using deep learning: State-of-the-art approaches and challenges,” Diagnostics, vol. 11, 11 2021

  18. [18]

    Targeted total marrow irradiation using three-dimensional image-guided tomographic intensity- modulated radiation therapy: An alternative to standard total body irradiation,

    J. Y . Wong, A. Liu, T. Schultheiss, L. Popplewell, A. Stein, J. Rosenthal, M. Essensten, S. Forman, and G. Somlo, “Targeted total marrow irradiation using three-dimensional image-guided tomographic intensity- modulated radiation therapy: An alternative to standard total body irradiation,”Biology of Blood and Marrow Transplantation, vol. 12, no. 3, pp. 30...

  19. [19]

    Internal guidelines for reducing lymph node contour variability in total marrow and lymph node irradiation,

    D. Dei, N. Lambri, S. Stefanini, V . Vernier, R. C. Brioso, L. Crespi, E. Clerici, L. Bellu, C. D. Philippis, D. Loiacono, P. Navarria, G. Reg- giori, S. Bramanti, M. Rodari, S. Tomatis, A. Chiti, C. Carlo-Stella, M. Scorsetti, and P. Mancosu, “Internal guidelines for reducing lymph node contour variability in total marrow and lymph node irradiation,” Can...

  20. [20]

    On calibration of modern neural networks,

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,”Proceedings of the 34th International Conference on Machine Learning, 2017

  21. [21]

    Confidence calibration and predictive uncertainty estimation for deep medical image segmentation,

    A. Mehrtash, W. M. Wells,et al., “Confidence calibration and predictive uncertainty estimation for deep medical image segmentation,”IEEE Transactions on Medical Imaging, 2020

  22. [22]

    Analyzing the quality and challenges of uncertainty estimations for brain tumor segmentation,

    A. Jungo, F. Balsiger, and M. Reyes, “Analyzing the quality and challenges of uncertainty estimations for brain tumor segmentation,” Frontiers in Neuroscience, vol. 14, p. 282, 4 2020

  23. [23]

    Simple and scalable predictive uncertainty estimation using deep ensembles,

    B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,”NeurIPS, 2017

  24. [24]

    Efficient bayesian uncertainty estimation for nnu-net,

    Y . Zhao, C. Yang,et al., “Efficient bayesian uncertainty estimation for nnu-net,” inMICCAI, 2022

  25. [25]

    Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation,

    G. Wang, W. Li,et al., “Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation,”Neurocomputing, 2019

  26. [26]

    Phiseg: Capturing uncertainty in medical image segmentation,

    C. Baumgartner, K. Tezcan,et al., “Phiseg: Capturing uncertainty in medical image segmentation,” inMICCAI, 2019

  27. [27]

    Values: A framework for systematic validation of uncertainty estimation in semantic segmentation,

    K. C. Kahl, C. T. L ¨uth, M. Zenk, K. Maier-Hein, and P. F. Jaeger, “Values: A framework for systematic validation of uncertainty estimation in semantic segmentation,”12th International Conference on Learning Representations, ICLR 2024, 1 2024

  28. [28]

    Proba- bility maps for deep learning-based head and neck tumor segmentation: Graphical user interface design and test,

    A. D. Biase, L. Ziegfeld, N. M. Sijtsema, R. Steenbakkers, R. Wijsman, L. V . van Dijk, J. A. Langendijk, F. Cnossen, and P. van Ooijen, “Proba- bility maps for deep learning-based head and neck tumor segmentation: Graphical user interface design and test,”Computers in Biology and Medicine, vol. 177, p. 108675, 7 2024

  29. [29]

    A network score-based metric to optimize the quality assurance of automatic radiotherapy target segmentations,

    R. R. Outeiral, N. F. Silv ´erio, P. J. Gonz´alez, E. E. Schaake, T. Janssen, U. A. van der Heide, and R. Sim ˜oes, “A network score-based metric to optimize the quality assurance of automatic radiotherapy target segmentations,”Physics and Imaging in Radiation Oncology, vol. 28, p. 100500, 10 2023

  30. [30]

    Uncertainty-aware deep learning for segmentation of primary tumor and pathologic lymph nodes in oropha- ryngeal cancer: Insights from a multi-center cohort,

    A. D. Biase, N. M. Sijtsema, L. V . van Dijk, R. Steenbakkers, J. A. Langendijk, and P. van Ooijen, “Uncertainty-aware deep learning for segmentation of primary tumor and pathologic lymph nodes in oropha- ryngeal cancer: Insights from a multi-center cohort,”Computerized Medical Imaging and Graphics, vol. 123, p. 102535, 7 2025

  31. [31]

    Clinical assessment of deep learning-based uncertainty maps in lung cancer segmentation,

    F. C. Maruccio, W. Eppinga, M. H. Laves, R. F. Navarro, M. Salvi, F. Molinari, and P. Papaconstadopoulos, “Clinical assessment of deep learning-based uncertainty maps in lung cancer segmentation,”Physics in Medicine and Biology, vol. 69, 2 2024

  32. [32]

    Impact of deep learning model uncertainty on manual corrections to mri-based auto-segmentation in prostate cancer radiotherapy,

    V . Rogowski, A. Svalkvist, M. Maspero, T. Janssen, F. C. Maruccio, J. Gorgisyan, J. Scherman, I. H ¨aggstr¨om, V . W˚ahlstrand, A. Gunnlaugs- son, M. P. Nilsson, M. Moreau, N. Vass, N. Pettersson, and C. J. Gustafs- son, “Impact of deep learning model uncertainty on manual corrections to mri-based auto-segmentation in prostate cancer radiotherapy,”Journa...

  33. [33]

    Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk,

    J. E. van Aalst, F. C. Maruccio, R. Sim ˜oes, T. M. Janssen, J. M. Wolterink, P. M. A. van Ooijen, and C. L. Brouwer, “Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk,”Physics in Medicine & Biology, vol. 70, p. 205023, Oct. 2025

  34. [34]

    Efficient bayesian uncertainty estimation for nnu-net,

    Y . Zhao, C. Yang, A. Schweidtmann, and Q. Tao, “Efficient bayesian uncertainty estimation for nnu-net,”Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13438 LNCS, pp. 535–544, 5 2024

  35. [35]

    Title suppressed for double-blind review,

    Anonymous, “Title suppressed for double-blind review,”Details omitted for double-blind review, 2021