pith. machine review for the scientific record. sign in

arxiv: 2605.05775 · v2 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT unicode{x2013} Multitracer Multicenter Generalization

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords lesion segmentationPET/CTcompositional generalizationmultitracerautoPET challengennU-NetDice scorewhole-body imaging
0
0 comments X

The pith

The autoPET3 challenge shows in-domain multitracer PET/CT lesion segmentation approaches reader agreement while compositional generalization to unseen tracer-center combinations fails due to systematic volume overestimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports the design and outcomes of the third autoPET challenge, which tested automated lesion segmentation on whole-body PET/CT scans in a compositional generalization setting. Training used 1,014 FDG studies from one center and 597 PSMA studies from another, creating the largest public annotated PSMA dataset, while the 200-study test set included two entirely new tracer-center pairings. Seventeen teams submitted 27 mostly nnU-Net-based algorithms, with the winner reaching a mean Dice score of 0.66, 3.18 mL false-negative volume, and 2.78 mL false-positive volume. Analysis at patient and lesion levels supports three conclusions: in-domain performance is already reliable, cross-domain generalization remains limited by overestimation, and case heterogeneity matters more than algorithmic differences among top entries.

Core claim

The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing false-negative volume by 5 mL relative to baseline, with the key insight that in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement while compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation and that heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.

What carries the argument

The compositional generalization benchmark design using a held-out test set of 200 studies covering four tracer-center combinations, two of which are unseen pairings, to evaluate segmentation algorithms under multitracer multicentric conditions.

If this is right

  • In-domain multitracer PET/CT segmentation can be treated as reliable for many practical applications.
  • Compositional generalization requires targeted improvements focused on reducing systematic lesion volume overestimation in novel tracer-center settings.
  • Performance differences among leading methods are smaller than the effects of case difficulty and data heterogeneity.
  • The provided training data, including the large new PSMA PET/CT collection, can serve as a standard resource for developing better generalization methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether domain-adaptation techniques or tracer-specific augmentations close the gap on unseen combinations.
  • Adding explicit difficulty scoring for each case might allow models to flag low-confidence segmentations in heterogeneous populations.
  • Expanding the test set with additional unseen combinations from more centers would strengthen claims about the limits of current generalization.

Load-bearing premise

The held-out test set of 200 studies with four specific tracer-center combinations accurately captures real-world compositional generalization without unaccounted biases in lesion distribution, annotation quality, or other confounding factors.

What would settle it

A follow-up experiment that measures inter-reader Dice agreement on the same 200 test studies and finds it substantially below 0.66 would challenge the claim that in-domain performance is already near reader agreement.

Figures

Figures reproduced from arXiv: 2605.05775 by Alexander Jaus, Andreas Mittermeier, Anna Theresa St\"uber, Balthasar Schachtner, Christian La Foug\`ere, Clemens C. Cyran, Constantin M. Seibold, Fabian Isensee, Gizem Abaci, Hamza Kalisch, Hussain Alasmawi, Jakob Dexl, Jens Kleesiek, Jens Ricke, Johanna Topalis, Katharina Jeblick, Klaus H. Maier-Hein, Konstantin Nikolaou, Lalith Kumar Shiyam Sundar, Lap Yan Lennon Chan, Matthias P. Fabritius, Maurice Heimer, Maximilian Rokuss, Michael Ingrisch, Pauline Ornela Megne Choudja, Rainer Stiefelhagen, Rudolf A. Werner, Sergios Gatidis, Thomas K\"ustner, Yixuan Yuan.

Figure 1
Figure 1. Figure 1 view at source ↗
Figure 2
Figure 2. Figure 2: Representative PSMALMU PET/CT case shown in three orthogonal planes (coronal, sagittal, axial). Top row: CT images displayed with a win￾dow of [400. 1800] Hounsfield units. Bottom row: corresponding PET images displayed as SUV with a window of [0, 10]. Red overlays indicate manual le￾sion annotations. expert review or adjustment is advised to ensure accurate de￾lineation. Similarly, Hatt et al. (2017) emph… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of all 29 submitted algorithms across the four test conditions, ordered left to right by final leaderboard position. Each column shows the results for one algorithm as as a boxplot; rows show the DSC (top), FNV (middle), and FPV (bottom). FPV and FNV are displayed on a symmetric logarithmic scale. Individual test cases are shown as points, colored by domain. Stars indicate the mean across all d… view at source ↗
Figure 4
Figure 4. Figure 4: Bootstrap ranking stability analysis (n=2,000) of all submitted algorithms, shown as violin plots. Lower rank indicates better performance. A clear performance gap is visible after the data-centric baseline, separating algorithms into two tiers. Among the top-performing group, LesionTracer A achieves the most consistent top ranking, while several mid-tier algorithms show overlapping rank distributions, ind… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the top-ranked algorithm (LesionTracer A, n=50) with second-reader agreement (n=25) in terms of DSC (top), false-negative volume (middle), and false-positive volume (bottom, both in mL, symlog scale) across the four test conditions. Reader subsets were drawn from the same distribu￾tion but are not identical to the algorithm test sets, and reading protocols varied across conditions (see main t… view at source ↗
Figure 7
Figure 7. Figure 7: Volume analysis across all datasets. (A) Per-algorithm absolute volume difference (predicted − reference, mL) displayed on a symmetric log scale. Large black circles indicate the median, stars denote the mean, and individual colored dots represent per-case differences. Algorithms oversegment in the composite datasets. (B) Relative volume agreement for the top-18 algorithms, shown as (pred + ϵ)/(ref + ϵ) pl… view at source ↗
Figure 8
Figure 8. Figure 8: Lesion detection sensitivity as a function of the IoU threshold τ. The left end of the abscissa corresponds to the one-voxel criterion; the right end to τ = 0.5, analogous to the recognition in panoptic quality (Kirillov et al., 2019), which enforces one-to-one matching. The figure is based on an assignment strategy that does not penalize multi-assignment. Top 3 teams are highlighted. 0.74 and declines mor… view at source ↗
Figure 9
Figure 9. Figure 9: Lesion detection sensitivity stratified by volume deciles (A) and SUVmax deciles (B) across the four test conditions. Box plots summarize the distribution of per-algorithm sensitivity within each bin; the black line indicates the top-ranked team (LesionTracer). Detection sensitivity increases with both lesion volume and tracer uptake across all conditions. 4.6.3. Factors influencing lesion detection It is … view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of segmentation predictions from four algorithms (LesionTracer A, IKIM A, HussainAlasmawi A, AiraMatrix A) against the reference annotation on four representative failure cases, shown as coronal maximum intensity projections. Algorithm predictions are shown as colored contours overlaid on the reference (red). Metrics are reported for each algorithm. Failure modes are marked by arrow… view at source ↗
Figure 11
Figure 11. Figure 11: To visualize where lesions, false positives, and false negatives are located across the test sets, we registered all cases to a common reference space. We first resampled every case to the median voxel spacing, then chose the largest-volume case as the reference and padded it to fit all anatomies. Elastic registration was performed using organ masks from TotalSegmentator on the CT images. The same transfo… view at source ↗
read the original abstract

We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital T\"ubingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports the design and results of the autoPET3 MICCAI 2024 challenge on automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data include 1,014 FDG studies from Tübingen and 597 PSMA studies from Munich (the largest public PSMA dataset); the 200-study held-out test set covers four tracer-center combinations with two unseen pairings. Seventeen teams submitted 27 mostly nnU-Net-based algorithms; the top entry reached mean DSC 0.66, FNV 3.18 mL and FPV 2.78 mL, with stable ranking under bootstrap resampling. The paper concludes that in-domain multitracer performance is sufficient and near reader agreement, that compositional generalization to unseen combinations remains open (driven by systematic volume overestimation), and that case heterogeneity dominates algorithmic differences among top teams.

Significance. If the empirical findings hold, the work supplies a large-scale public benchmark and the largest annotated PSMA PET/CT dataset, demonstrating that in-domain multitracer segmentation is practically viable while compositional generalization across unseen tracer-center pairs is limited by volume bias. The patient- and lesion-level analysis, data-centric award track, and bootstrap stability checks provide concrete evidence that heterogeneity and case difficulty outweigh algorithmic choice, offering actionable guidance for future multitracer, multicenter PET/CT segmentation research.

major comments (2)
  1. [test-set description and results analysis] The central claim that compositional generalization to unseen tracer-center combinations remains an open problem, driven mainly by systematic volume overestimation, rests on performance across the four held-out test conditions. The manuscript does not report explicit statistical matching or controls for lesion-size/number distributions, uptake statistics, or annotation-protocol differences between training and the unseen test combinations; without these, the observed overestimation could reflect test-set mismatch rather than a fundamental generalization failure (see the test-set description and the patient/lesion-level analysis sections).
  2. [methods and results] The claim that in-domain multitracer segmentation is sufficient and probably approaching reader agreement is load-bearing for conclusion (1), yet the manuscript provides only limited detail on exact metric computation (DSC, FNV, FPV at patient vs. lesion level), inter-annotator variability, and annotation protocols; this weakens the ability to interpret how close the top DSC of 0.66 actually is to human performance.
minor comments (2)
  1. [challenge design] The data-centric award category is introduced to isolate data-handling strategies, but the manuscript does not clearly state the fixed baseline model architecture or the precise rules participants followed when submitting under this track.
  2. [figures and tables] Figure and table captions could more explicitly indicate whether reported volumes are aggregated across all four test conditions or broken down by seen vs. unseen combinations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below with point-by-point responses and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [test-set description and results analysis] The central claim that compositional generalization to unseen tracer-center combinations remains an open problem, driven mainly by systematic volume overestimation, rests on performance across the four held-out test conditions. The manuscript does not report explicit statistical matching or controls for lesion-size/number distributions, uptake statistics, or annotation-protocol differences between training and the unseen test combinations; without these, the observed overestimation could reflect test-set mismatch rather than a fundamental generalization failure (see the test-set description and the patient/lesion-level analysis sections).

    Authors: We agree that explicit statistical comparisons of lesion-size/number distributions, uptake statistics, and annotation protocols between training and test subsets would help rule out confounding factors. The held-out test set was constructed to cover representative cases from all four tracer-center combinations (including the two unseen pairings) while maintaining clinical realism. The patient- and lesion-level analyses already show that volume overestimation is systematically higher in the unseen conditions than in the seen ones, supporting a generalization interpretation. To directly address the concern, we will add a supplementary table reporting key statistics (mean lesion volume, lesions per case, SUVmax distributions) across training subsets and each test condition, along with brief discussion of any observed shifts. revision: yes

  2. Referee: [methods and results] The claim that in-domain multitracer segmentation is sufficient and probably approaching reader agreement is load-bearing for conclusion (1), yet the manuscript provides only limited detail on exact metric computation (DSC, FNV, FPV at patient vs. lesion level), inter-annotator variability, and annotation protocols; this weakens the ability to interpret how close the top DSC of 0.66 actually is to human performance.

    Authors: We will expand the methods section to provide precise definitions and computation details for DSC, FNV, and FPV at both patient and lesion levels, including how lesions are matched for per-lesion metrics. We will also add a more detailed description of the annotation protocols used for the FDG and PSMA datasets. Inter-annotator variability was not quantified as part of the challenge (annotations followed standardized clinical guidelines by experienced nuclear medicine readers). We will revise the relevant conclusion to qualify the statement, noting that while direct inter-reader metrics are unavailable, the small absolute error volumes and the lesion-level analysis (showing most discrepancies involve small or low-uptake lesions) indicate performance approaching practical clinical utility. This revision will be partial, as we cannot retroactively add inter-annotator numbers. revision: partial

Circularity Check

0 steps flagged

Empirical challenge report with no derivations or self-referential logic

full rationale

The paper is a direct empirical report of a MICCAI challenge benchmark, presenting observed DSC, FNV, and FPV metrics across held-out test conditions and drawing conclusions from those measurements. No mathematical derivations, parameter fits presented as predictions, ansatzes, or uniqueness theorems appear in the provided text. Conclusions (1)-(3) follow from tabulated participant results and bootstrap stability checks rather than reducing to any input by construction. The analysis is self-contained against the challenge's external test set and participant submissions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and challenge report. No free parameters are fitted by the authors, no domain axioms beyond standard statistical assumptions for metric calculation are invoked, and no new entities are postulated.

pith-pipeline@v0.9.0 · 5776 in / 1360 out tokens · 97064 ms · 2026-05-12T03:30:34.541470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    Chen, Q., Chen, X., Song, H., Xiong, Z., Yuille, A., Wei, C., Zhou, Z., 2024

    doi:10.1038/s41598-020-64803-w. Chen, Q., Chen, X., Song, H., Xiong, Z., Yuille, A., Wei, C., Zhou, Z., 2024. Towards Generalizable Tumor Synthesis, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, W A, USA. pp. 11147–11158. doi:10.1109/CVPR52733.2024.01060. Cheson, B.D., Fisher, R.I., Barrington, S.F., Cavall...

  2. [2]

    3D MRI brain tumor segmentation using autoencoder regularization

    doi:10.1007/978-3-030-11726-9_28. Nascimento, J., Marques, J., 2006. Performance evaluation of object detection algorithms for video surveillance. IEEE Transactions on Multimedia 8, 761–

  3. [3]

    Nehmeh, S.A., Erdi, Y .E., 2008

    doi:10.1109/TMM.2006.876287. Nehmeh, S.A., Erdi, Y .E., 2008. Respiratory Motion in Positron Emission Tomography/Computed Tomography: A Review. Seminars in Nuclear Medicine 38, 167–176. doi:10.1053/j.semnuclmed.2008.01.002. Nikolov, S., Blackwell, S., Zverovitch, A., Mendes, R., Livne, M., Fauw, J.D., Patel, Y ., Meyer, C., Askham, H., Romera-Paredes, B.,...

  4. [4]

    Nuclear Medicine and Molecular Imaging 57, 86–93

    Automatic Lung Cancer Segmentation in [18F]FDG PET/CT Using a Two-Stage Deep Learning Approach. Nuclear Medicine and Molecular Imaging 57, 86–93. doi:10.1007/s13139-022-00745-7. Ratib, O., 2004. PET/CT Image Navigation and Communication. Journal of 27 Nuclear Medicine 45, 46S–55S. Rohren, E.M., Turkington, T.G., Coleman, R.E., 2004. Clinical Applications ...

  5. [5]

    doi:10.1007/s00330-005-0088-y. Sasanelli, M., Meignan, M., Haioun, C., Berriolo-Riedinger, A., Casasnovas, R.O., Biggi, A., Gallamini, A., Siegel, B.A., Cashen, A.F., Véra, P., Tilly, H., Versari, A., Itti, E., 2014. Pretherapy metabolic tumour volume is an independent predictor of outcome in patients with diffuse large B-cell lym- phoma. European Journal...

  6. [6]

    MultiTalent: A Multi-dataset Approach to Medical Image Segmenta- tion, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023: 26th International Conference, Vancouver, BC, Canada, Oc- tober 8–12, 2023, Proceedings, Part III, Springer-Verlag, Berlin, Heidelberg. pp. 648–658. doi:10.1007/978-3-031-43898-1_62. Warfield, S., Zou, K., We...