Recognition: 2 theorem links
· Lean TheoremThe autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT unicode{x2013} Multitracer Multicenter Generalization
Pith reviewed 2026-05-12 03:30 UTC · model grok-4.3
The pith
The autoPET3 challenge shows in-domain multitracer PET/CT lesion segmentation approaches reader agreement while compositional generalization to unseen tracer-center combinations fails due to systematic volume overestimation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing false-negative volume by 5 mL relative to baseline, with the key insight that in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement while compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation and that heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.
What carries the argument
The compositional generalization benchmark design using a held-out test set of 200 studies covering four tracer-center combinations, two of which are unseen pairings, to evaluate segmentation algorithms under multitracer multicentric conditions.
If this is right
- In-domain multitracer PET/CT segmentation can be treated as reliable for many practical applications.
- Compositional generalization requires targeted improvements focused on reducing systematic lesion volume overestimation in novel tracer-center settings.
- Performance differences among leading methods are smaller than the effects of case difficulty and data heterogeneity.
- The provided training data, including the large new PSMA PET/CT collection, can serve as a standard resource for developing better generalization methods.
Where Pith is reading between the lines
- Future work could test whether domain-adaptation techniques or tracer-specific augmentations close the gap on unseen combinations.
- Adding explicit difficulty scoring for each case might allow models to flag low-confidence segmentations in heterogeneous populations.
- Expanding the test set with additional unseen combinations from more centers would strengthen claims about the limits of current generalization.
Load-bearing premise
The held-out test set of 200 studies with four specific tracer-center combinations accurately captures real-world compositional generalization without unaccounted biases in lesion distribution, annotation quality, or other confounding factors.
What would settle it
A follow-up experiment that measures inter-reader Dice agreement on the same 200 test studies and finds it substantially below 0.66 would challenge the claim that in-domain performance is already near reader agreement.
Figures
read the original abstract
We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital T\"ubingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports the design and results of the autoPET3 MICCAI 2024 challenge on automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data include 1,014 FDG studies from Tübingen and 597 PSMA studies from Munich (the largest public PSMA dataset); the 200-study held-out test set covers four tracer-center combinations with two unseen pairings. Seventeen teams submitted 27 mostly nnU-Net-based algorithms; the top entry reached mean DSC 0.66, FNV 3.18 mL and FPV 2.78 mL, with stable ranking under bootstrap resampling. The paper concludes that in-domain multitracer performance is sufficient and near reader agreement, that compositional generalization to unseen combinations remains open (driven by systematic volume overestimation), and that case heterogeneity dominates algorithmic differences among top teams.
Significance. If the empirical findings hold, the work supplies a large-scale public benchmark and the largest annotated PSMA PET/CT dataset, demonstrating that in-domain multitracer segmentation is practically viable while compositional generalization across unseen tracer-center pairs is limited by volume bias. The patient- and lesion-level analysis, data-centric award track, and bootstrap stability checks provide concrete evidence that heterogeneity and case difficulty outweigh algorithmic choice, offering actionable guidance for future multitracer, multicenter PET/CT segmentation research.
major comments (2)
- [test-set description and results analysis] The central claim that compositional generalization to unseen tracer-center combinations remains an open problem, driven mainly by systematic volume overestimation, rests on performance across the four held-out test conditions. The manuscript does not report explicit statistical matching or controls for lesion-size/number distributions, uptake statistics, or annotation-protocol differences between training and the unseen test combinations; without these, the observed overestimation could reflect test-set mismatch rather than a fundamental generalization failure (see the test-set description and the patient/lesion-level analysis sections).
- [methods and results] The claim that in-domain multitracer segmentation is sufficient and probably approaching reader agreement is load-bearing for conclusion (1), yet the manuscript provides only limited detail on exact metric computation (DSC, FNV, FPV at patient vs. lesion level), inter-annotator variability, and annotation protocols; this weakens the ability to interpret how close the top DSC of 0.66 actually is to human performance.
minor comments (2)
- [challenge design] The data-centric award category is introduced to isolate data-handling strategies, but the manuscript does not clearly state the fixed baseline model architecture or the precise rules participants followed when submitting under this track.
- [figures and tables] Figure and table captions could more explicitly indicate whether reported volumes are aggregated across all four test conditions or broken down by seen vs. unseen combinations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below with point-by-point responses and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [test-set description and results analysis] The central claim that compositional generalization to unseen tracer-center combinations remains an open problem, driven mainly by systematic volume overestimation, rests on performance across the four held-out test conditions. The manuscript does not report explicit statistical matching or controls for lesion-size/number distributions, uptake statistics, or annotation-protocol differences between training and the unseen test combinations; without these, the observed overestimation could reflect test-set mismatch rather than a fundamental generalization failure (see the test-set description and the patient/lesion-level analysis sections).
Authors: We agree that explicit statistical comparisons of lesion-size/number distributions, uptake statistics, and annotation protocols between training and test subsets would help rule out confounding factors. The held-out test set was constructed to cover representative cases from all four tracer-center combinations (including the two unseen pairings) while maintaining clinical realism. The patient- and lesion-level analyses already show that volume overestimation is systematically higher in the unseen conditions than in the seen ones, supporting a generalization interpretation. To directly address the concern, we will add a supplementary table reporting key statistics (mean lesion volume, lesions per case, SUVmax distributions) across training subsets and each test condition, along with brief discussion of any observed shifts. revision: yes
-
Referee: [methods and results] The claim that in-domain multitracer segmentation is sufficient and probably approaching reader agreement is load-bearing for conclusion (1), yet the manuscript provides only limited detail on exact metric computation (DSC, FNV, FPV at patient vs. lesion level), inter-annotator variability, and annotation protocols; this weakens the ability to interpret how close the top DSC of 0.66 actually is to human performance.
Authors: We will expand the methods section to provide precise definitions and computation details for DSC, FNV, and FPV at both patient and lesion levels, including how lesions are matched for per-lesion metrics. We will also add a more detailed description of the annotation protocols used for the FDG and PSMA datasets. Inter-annotator variability was not quantified as part of the challenge (annotations followed standardized clinical guidelines by experienced nuclear medicine readers). We will revise the relevant conclusion to qualify the statement, noting that while direct inter-reader metrics are unavailable, the small absolute error volumes and the lesion-level analysis (showing most discrepancies involve small or low-uptake lesions) indicate performance approaching practical clinical utility. This revision will be partial, as we cannot retroactively add inter-annotator numbers. revision: partial
Circularity Check
Empirical challenge report with no derivations or self-referential logic
full rationale
The paper is a direct empirical report of a MICCAI challenge benchmark, presenting observed DSC, FNV, and FPV metrics across held-out test conditions and drawing conclusions from those measurements. No mathematical derivations, parameter fits presented as predictions, ansatzes, or uniqueness theorems appear in the provided text. Conclusions (1)-(3) follow from tabulated participant results and bootstrap stability checks rather than reducing to any input by construction. The analysis is self-contained against the challenge's external test set and participant submissions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chen, Q., Chen, X., Song, H., Xiong, Z., Yuille, A., Wei, C., Zhou, Z., 2024
doi:10.1038/s41598-020-64803-w. Chen, Q., Chen, X., Song, H., Xiong, Z., Yuille, A., Wei, C., Zhou, Z., 2024. Towards Generalizable Tumor Synthesis, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, W A, USA. pp. 11147–11158. doi:10.1109/CVPR52733.2024.01060. Cheson, B.D., Fisher, R.I., Barrington, S.F., Cavall...
-
[2]
3D MRI brain tumor segmentation using autoencoder regularization
doi:10.1007/978-3-030-11726-9_28. Nascimento, J., Marques, J., 2006. Performance evaluation of object detection algorithms for video surveillance. IEEE Transactions on Multimedia 8, 761–
-
[3]
Nehmeh, S.A., Erdi, Y .E., 2008
doi:10.1109/TMM.2006.876287. Nehmeh, S.A., Erdi, Y .E., 2008. Respiratory Motion in Positron Emission Tomography/Computed Tomography: A Review. Seminars in Nuclear Medicine 38, 167–176. doi:10.1053/j.semnuclmed.2008.01.002. Nikolov, S., Blackwell, S., Zverovitch, A., Mendes, R., Livne, M., Fauw, J.D., Patel, Y ., Meyer, C., Askham, H., Romera-Paredes, B.,...
-
[4]
Nuclear Medicine and Molecular Imaging 57, 86–93
Automatic Lung Cancer Segmentation in [18F]FDG PET/CT Using a Two-Stage Deep Learning Approach. Nuclear Medicine and Molecular Imaging 57, 86–93. doi:10.1007/s13139-022-00745-7. Ratib, O., 2004. PET/CT Image Navigation and Communication. Journal of 27 Nuclear Medicine 45, 46S–55S. Rohren, E.M., Turkington, T.G., Coleman, R.E., 2004. Clinical Applications ...
-
[5]
doi:10.1007/s00330-005-0088-y. Sasanelli, M., Meignan, M., Haioun, C., Berriolo-Riedinger, A., Casasnovas, R.O., Biggi, A., Gallamini, A., Siegel, B.A., Cashen, A.F., Véra, P., Tilly, H., Versari, A., Itti, E., 2014. Pretherapy metabolic tumour volume is an independent predictor of outcome in patients with diffuse large B-cell lym- phoma. European Journal...
-
[6]
MultiTalent: A Multi-dataset Approach to Medical Image Segmenta- tion, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023: 26th International Conference, Vancouver, BC, Canada, Oc- tober 8–12, 2023, Proceedings, Part III, Springer-Verlag, Berlin, Heidelberg. pp. 648–658. doi:10.1007/978-3-031-43898-1_62. Warfield, S., Zou, K., We...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.