pith. sign in

arxiv: 2606.28392 · v1 · pith:EZQOAKGSnew · submitted 2026-06-23 · 💻 cs.CV · cs.AI· cs.LG

RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning

Pith reviewed 2026-06-30 10:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords PET/CT lesion segmentationlarge language modelsreinforcement learningfalse positive reductiononcology imagingradiology reportsvoxel-level segmentationcandidate classification
0
0 comments X

The pith

Coupling permissive voxel segmentation with LLM reasoning on textual uptake descriptions reduces false positives in PET/CT lesion detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that a hybrid system can improve lesion segmentation accuracy by first running a high-sensitivity image model to catch all possible uptake regions, then turning those regions into text summaries of intensity, shape, and location for an LLM to judge which ones are real lesions. The LLM can also read the radiology report for extra context, and the authors train it further with reinforcement learning to reward accurate classifications and correct anatomical assignments. A reader would care because physiologic uptake and artifacts often fool pure image models, producing errors that affect cancer staging and treatment decisions. If the claim holds, the added reasoning layer could make automated tools match how clinicians actually interpret scans rather than relying on pixel patterns alone.

Core claim

RADIANT-PET generates candidate uptake regions with a deliberately permissive voxel-level segmentation stage, converts each candidate into a structured textual description covering uptake intensity, morphology, and anatomical context, then uses an LLM to classify the candidate as a true lesion versus a physiologic or artifactual false positive, optionally incorporating the radiology report, and further optimizes the LLM with Group Relative Policy Optimization to reward correct lesion classification and site assignment; the resulting system outperforms image-only baselines on AutoPET and an OSU cohort, with largest gains when reports are supplied.

What carries the argument

The lesion-level LLM adjudication step that receives structured textual descriptions of candidate regions and outputs classifications as true lesions or false positives.

If this is right

  • The largest accuracy gains appear when the radiology report is supplied as additional context to the LLM.
  • Physiologic false positives that mimic malignant uptake are suppressed relative to image-only methods.
  • Voxel-level predictions become more consistent with the clinical interpretation expressed in reports.
  • Reinforcement learning improves both lesion classification accuracy and anatomically correct site assignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same text-description-plus-LLM pattern could be tested on other imaging tasks where reports supply high-level context, such as CT or MRI lesion detection.
  • If the method scales, it might lower the volume of cases needing manual radiologist review in high-throughput oncology workflows.
  • Evaluating the approach on datasets with greater variation in scanner types or patient demographics would test whether the textual summaries capture all clinically relevant cues.

Load-bearing premise

Structured textual descriptions of uptake intensity, morphology, and anatomical context contain enough information for the LLM to correctly separate true lesions from physiologic or artifactual false positives.

What would settle it

A held-out test set where adding the LLM adjudication and reinforcement learning produces no measurable reduction in false positive rate or improvement in Dice score compared with the permissive segmentation model alone.

Figures

Figures reproduced from arXiv: 2606.28392 by Jiasheng Wang, Piyawadee Jongpradubgiat, Simeng Zhu, Tanun Jitwatcharakomol.

Figure 1
Figure 1. Figure 1: Overview of the RADIANT-PET framework. (A) A high-sensitivity segmenter (SUV thresholding or a modified nnU-Net) proposes a permissive uptake mask. (B) Watershed splitting separates the mask into candidate regions. Each region is then converted into a structured JSON description combining radiomics and organ rela￾tionships by incorporating TotalSegmentator masks. (C) A local LLM classifies each candidate a… view at source ↗
Figure 2
Figure 2. Figure 2: Sample candidate lesion description and model’s reasoning trace before and after RL. Stage 2 (SUV-valley validation). To ensure that watershed boundaries corresponded to genuine metabolic separations rather than geometric artifacts, we validated each boundary between adjacent regions. For each neighboring pair (i, j), we computed the relative SUV drop at their shared boundary. We required the relative drop… view at source ↗
read the original abstract

Accurate lesion segmentation in PET/CT is critical for oncology, yet remains challenging because physiologic tracer uptake and artifacts can mimic malignant signal. We present RADIANT-PET, a reasoning-augmented framework that couples a high-sensitivity voxel-level segmentation model with lesion-level large language model (LLM) adjudication. Candidate uptake regions are generated with a deliberately permissive segmentation stage, then converted into structured textual descriptions that summarize uptake intensity, morphology, and regional and global anatomical context. An LLM classifies each candidate as true lesion vs. false positive, optionally leveraging the radiology report as additional clinical context. To strengthen lesion-level reasoning, we further optimize a local LLM via reinforcement learning using Group Relative Policy Optimization, rewarding correct lesion classification and anatomically concordant site assignment. Across AutoPET and an OSU test cohort, RADIANT-PET consistently outperforms strong image-only baselines, with the largest improvements observed when radiology reports are provided. Overall, these results demonstrate that LLM-based lesion-level reasoning adds a novel reasoning layer beyond conventional segmentation, suppressing physiologic false positives and aligning voxel-level predictions with clinical interpretation. The project repository is available at: https://github.com/jwang-580/RADIANT-PET.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces RADIANT-PET, a hybrid framework for PET/CT lesion segmentation that first applies a deliberately permissive voxel-level segmenter to generate candidate regions, converts these into structured textual descriptions of uptake intensity, morphology, and anatomical context (optionally augmented by radiology reports), and then uses an LLM (fine-tuned via Group Relative Policy Optimization reinforcement learning) to classify candidates as true lesions versus physiologic or artifactual false positives. The central claim is that this lesion-level reasoning layer consistently outperforms strong image-only baselines on the AutoPET dataset and an OSU test cohort, with largest gains when reports are provided, by suppressing false positives and better aligning predictions with clinical interpretation.

Significance. If the empirical claims hold, the work would offer a meaningful advance by adding an explicit reasoning stage that addresses a persistent limitation of pure image-based methods in distinguishing malignant from physiologic uptake. The public release of the project repository (https://github.com/jwang-580/RADIANT-PET) is a clear strength supporting reproducibility. The combination of permissive segmentation, structured text conversion, optional report context, and GRPO-based LLM optimization is a coherent technical contribution. However, the absence of any quantitative results, baseline details, or error analysis in the provided manuscript prevents a full assessment of whether the claimed gains are real or merely artifacts of the first-stage segmenter.

major comments (2)
  1. [Abstract] Abstract: No quantitative performance numbers, baseline comparisons, statistical tests, or error analysis are supplied. Without these data it is impossible to evaluate whether the LLM adjudication step actually delivers the claimed suppression of physiologic false positives or merely inherits permissive behavior from the initial segmenter.
  2. [Abstract] Abstract: The central claim that 'structured textual descriptions ... contain sufficient information for an LLM to reliably classify candidates' is load-bearing yet unsupported; the manuscript provides no evidence that fine visual cues (heterogeneous uptake patterns, boundary texture, subtle gradients) survive the text conversion process or that the LLM adjudication is not simply benefiting from report leakage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review. We agree that the current abstract and manuscript lack the quantitative results, baseline details, and supporting analyses needed to substantiate the claims. We will revise accordingly to include these elements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: No quantitative performance numbers, baseline comparisons, statistical tests, or error analysis are supplied. Without these data it is impossible to evaluate whether the LLM adjudication step actually delivers the claimed suppression of physiologic false positives or merely inherits permissive behavior from the initial segmenter.

    Authors: We agree that the abstract provides only a qualitative summary and that the manuscript as provided lacks the requested quantitative details. In the revision we will expand the abstract to report specific metrics (e.g., Dice, sensitivity, specificity) on AutoPET and the OSU cohort, include baseline comparisons, and note statistical significance. A new results section with error analysis will also be added to the main text. revision: yes

  2. Referee: [Abstract] Abstract: The central claim that 'structured textual descriptions ... contain sufficient information for an LLM to reliably classify candidates' is load-bearing yet unsupported; the manuscript provides no evidence that fine visual cues (heterogeneous uptake patterns, boundary texture, subtle gradients) survive the text conversion process or that the LLM adjudication is not simply benefiting from report leakage.

    Authors: We acknowledge that the current manuscript supplies no direct evidence or ablation addressing information preservation in the text conversion step or the possibility of report leakage. We will add a dedicated analysis section that includes (1) ablations isolating the contribution of the structured text descriptions versus report context and (2) a discussion of limitations regarding fine visual cues that may not be captured in text. These additions will be reflected in the revised abstract as well. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with external components

full rationale

The paper describes an applied framework that chains a permissive segmentation stage, text conversion of candidates, LLM classification, and optional RL fine-tuning. No equations, derivations, or fitted parameters are presented that reduce any output to an input by construction. Claims rest on experimental comparisons against image-only baselines on held-out cohorts (AutoPET, OSU), not on self-referential logic or self-citation chains. The method therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that LLMs can perform reliable lesion classification from textual summaries; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Large language models can accurately classify lesion candidates when given structured textual descriptions of uptake intensity, morphology, regional context, and optional radiology reports.
    This premise underpins the adjudication stage and the claim that LLM reasoning suppresses false positives.

pith-pipeline@v0.9.1-grok · 5769 in / 1282 out tokens · 42648 ms · 2026-06-30T10:02:31.989228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Agarwal, S., et al.: gpt-oss-120b & gpt-oss-20b model card (2025). https://doi.org/10.48550/arXiv.2508.10925

  2. [2]

    Journal of Clinical Oncology32(27), 3048–3058 (2014)

    Barrington, S.F., et al.: Role of imaging in the staging and response assessment of lymphoma: consensus of the International Conference on Malignant Lymphomas Imaging Working Group. Journal of Clinical Oncology32(27), 3048–3058 (2014). https://doi.org/10.1200/JCO.2013.53.5229

  3. [3]

    Daniel Han, M.H., team, U.: Unsloth (2023), http://github.com/unslothai/unsloth

  4. [4]

    doi: 10.1038/s41586-025-09422-z

    Guo, D., Yang, D., Zhang, H., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Nature645, 633–638 (2025). https://doi.org/10.1038/s41586-025-09422-z

  5. [5]

    European Journal of Nuclear Medicine and Molecular Imaging45(7), 1142–1154 (2018)

    Ilyas, H., Mikhaeel, N.G., Dunn, J.T., Rahman, F., Møller, H., Smith, D., Barring- ton, S.F.: Defining the optimal method for measuring baseline metabolic tumour volume in diffuse large b cell lymphoma. European Journal of Nuclear Medicine and Molecular Imaging45(7), 1142–1154 (2018). https://doi.org/10.1007/s00259- 018-3953-z

  6. [6]

    Kalisch, F

    Kalisch, H., Hörst, F., Herrmann, K., Kleesiek, J., Seibold, C.: AutoPET III chal- lenge: Incorporating anatomical knowledge into nnunet for lesion segmentation in PET/CT (2024). https://doi.org/10.48550/arXiv.2409.12155

  7. [7]

    PET Clinics14(3), 317–329 (Jul 2019)

    Kostakoglu, L., Chauvie, S.: Pet-derived quantitative metrics for response and prognosis in lymphoma. PET Clinics14(3), 317–329 (Jul 2019). https://doi.org/10.1016/j.cpet.2019.03.002

  8. [8]

    Discovering Causal Signals in Images

    Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Discovering causal signals in images (2016). https://doi.org/10.48550/arXiv.1605.08179

  9. [9]

    Rokuss, B

    Rokuss, M., Kovacs, B., Kirchhoff, Y., Xiao, S., Ulrich, C., Maier- Hein, K.H., Isensee, F.: From FDG to PSMA: A hitchhiker’s guide to multitracer, multicenter lesion segmentation in PET/CT imaging (2024). https://doi.org/10.48550/arXiv.2409.09478

  10. [10]

    Salehi, S.S.M., Erdogmus, D., Gholipour, A.: Tversky loss function for image segmentation using 3d fully convolutional deep networks (2017), https://arxiv.org/abs/1706.05721

  11. [11]

    Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., Bailey, J.: Symmetric cross entropy for robust learning with noisy labels (2019), https://arxiv.org/abs/1908.06112

  12. [12]

    Ra- diology: Artificial Intelligence5(5) (2023)

    Wasserthal,J.,Meyer,M.,Hahn,F.,etal.:Totalsegmentator:Robustsegmentation of 104 anatomical structures in ct images. Radiology: Artificial Intelligence (2023). https://doi.org/10.1148/ryai.230024