RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning

Jiasheng Wang; Piyawadee Jongpradubgiat; Simeng Zhu; Tanun Jitwatcharakomol

arxiv: 2606.28392 · v1 · pith:EZQOAKGSnew · submitted 2026-06-23 · 💻 cs.CV · cs.AI· cs.LG

RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning

Jiasheng Wang , Tanun Jitwatcharakomol , Piyawadee Jongpradubgiat , Simeng Zhu This is my paper

Pith reviewed 2026-06-30 10:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords PET/CT lesion segmentationlarge language modelsreinforcement learningfalse positive reductiononcology imagingradiology reportsvoxel-level segmentationcandidate classification

0 comments

The pith

Coupling permissive voxel segmentation with LLM reasoning on textual uptake descriptions reduces false positives in PET/CT lesion detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that a hybrid system can improve lesion segmentation accuracy by first running a high-sensitivity image model to catch all possible uptake regions, then turning those regions into text summaries of intensity, shape, and location for an LLM to judge which ones are real lesions. The LLM can also read the radiology report for extra context, and the authors train it further with reinforcement learning to reward accurate classifications and correct anatomical assignments. A reader would care because physiologic uptake and artifacts often fool pure image models, producing errors that affect cancer staging and treatment decisions. If the claim holds, the added reasoning layer could make automated tools match how clinicians actually interpret scans rather than relying on pixel patterns alone.

Core claim

RADIANT-PET generates candidate uptake regions with a deliberately permissive voxel-level segmentation stage, converts each candidate into a structured textual description covering uptake intensity, morphology, and anatomical context, then uses an LLM to classify the candidate as a true lesion versus a physiologic or artifactual false positive, optionally incorporating the radiology report, and further optimizes the LLM with Group Relative Policy Optimization to reward correct lesion classification and site assignment; the resulting system outperforms image-only baselines on AutoPET and an OSU cohort, with largest gains when reports are supplied.

What carries the argument

The lesion-level LLM adjudication step that receives structured textual descriptions of candidate regions and outputs classifications as true lesions or false positives.

If this is right

The largest accuracy gains appear when the radiology report is supplied as additional context to the LLM.
Physiologic false positives that mimic malignant uptake are suppressed relative to image-only methods.
Voxel-level predictions become more consistent with the clinical interpretation expressed in reports.
Reinforcement learning improves both lesion classification accuracy and anatomically correct site assignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same text-description-plus-LLM pattern could be tested on other imaging tasks where reports supply high-level context, such as CT or MRI lesion detection.
If the method scales, it might lower the volume of cases needing manual radiologist review in high-throughput oncology workflows.
Evaluating the approach on datasets with greater variation in scanner types or patient demographics would test whether the textual summaries capture all clinically relevant cues.

Load-bearing premise

Structured textual descriptions of uptake intensity, morphology, and anatomical context contain enough information for the LLM to correctly separate true lesions from physiologic or artifactual false positives.

What would settle it

A held-out test set where adding the LLM adjudication and reinforcement learning produces no measurable reduction in false positive rate or improvement in Dice score compared with the permissive segmentation model alone.

Figures

Figures reproduced from arXiv: 2606.28392 by Jiasheng Wang, Piyawadee Jongpradubgiat, Simeng Zhu, Tanun Jitwatcharakomol.

**Figure 1.** Figure 1: Overview of the RADIANT-PET framework. (A) A high-sensitivity segmenter (SUV thresholding or a modified nnU-Net) proposes a permissive uptake mask. (B) Watershed splitting separates the mask into candidate regions. Each region is then converted into a structured JSON description combining radiomics and organ relationships by incorporating TotalSegmentator masks. (C) A local LLM classifies each candidate a… view at source ↗

**Figure 2.** Figure 2: Sample candidate lesion description and model’s reasoning trace before and after RL. Stage 2 (SUV-valley validation). To ensure that watershed boundaries corresponded to genuine metabolic separations rather than geometric artifacts, we validated each boundary between adjacent regions. For each neighboring pair (i, j), we computed the relative SUV drop at their shared boundary. We required the relative drop… view at source ↗

read the original abstract

Accurate lesion segmentation in PET/CT is critical for oncology, yet remains challenging because physiologic tracer uptake and artifacts can mimic malignant signal. We present RADIANT-PET, a reasoning-augmented framework that couples a high-sensitivity voxel-level segmentation model with lesion-level large language model (LLM) adjudication. Candidate uptake regions are generated with a deliberately permissive segmentation stage, then converted into structured textual descriptions that summarize uptake intensity, morphology, and regional and global anatomical context. An LLM classifies each candidate as true lesion vs. false positive, optionally leveraging the radiology report as additional clinical context. To strengthen lesion-level reasoning, we further optimize a local LLM via reinforcement learning using Group Relative Policy Optimization, rewarding correct lesion classification and anatomically concordant site assignment. Across AutoPET and an OSU test cohort, RADIANT-PET consistently outperforms strong image-only baselines, with the largest improvements observed when radiology reports are provided. Overall, these results demonstrate that LLM-based lesion-level reasoning adds a novel reasoning layer beyond conventional segmentation, suppressing physiologic false positives and aligning voxel-level predictions with clinical interpretation. The project repository is available at: https://github.com/jwang-580/RADIANT-PET.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RADIANT-PET combines permissive segmentation with text-based LLM adjudication and GRPO tuning, but the abstract supplies no numbers or method details to check whether the reasoning step actually works.

read the letter

The paper's main contribution is a pipeline that runs a loose voxel segmenter first, converts candidate regions into structured text on intensity, shape, and anatomy, then lets an LLM (optionally with the radiology report) label them as lesions or false positives, with Group Relative Policy Optimization used to tune the local model on classification accuracy. That specific sequence of steps is not something I have seen framed this way before.

It does a clean job stating the clinical problem of physiologic uptake and artifacts, and the decision to add a reasoning layer after detection makes sense as an attempt to mimic how radiologists use context. The GRPO choice is a concrete implementation detail that could be reusable.

The soft spots are straightforward and central. The abstract asserts consistent outperformance over image-only baselines, with bigger gains when reports are supplied, yet it contains zero quantitative results, no baseline descriptions, no statistical tests, and no error analysis. Without those, there is no way to tell whether the text conversion preserves the visual cues that matter or whether any gains come from the permissive first stage instead. The stress-test worry about losing fine-grained uptake patterns or boundary texture in the text summaries therefore still stands; the abstract gives no evidence that this was checked.

This is aimed at groups working on hybrid vision-language systems for oncology imaging. A reader already experimenting with LLMs on medical data might pick up the pipeline structure, but the work is too preliminary for citation. It deserves peer review once the full version includes the experiments and ablations, because the underlying idea is worth testing even if the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The paper introduces RADIANT-PET, a hybrid framework for PET/CT lesion segmentation that first applies a deliberately permissive voxel-level segmenter to generate candidate regions, converts these into structured textual descriptions of uptake intensity, morphology, and anatomical context (optionally augmented by radiology reports), and then uses an LLM (fine-tuned via Group Relative Policy Optimization reinforcement learning) to classify candidates as true lesions versus physiologic or artifactual false positives. The central claim is that this lesion-level reasoning layer consistently outperforms strong image-only baselines on the AutoPET dataset and an OSU test cohort, with largest gains when reports are provided, by suppressing false positives and better aligning predictions with clinical interpretation.

Significance. If the empirical claims hold, the work would offer a meaningful advance by adding an explicit reasoning stage that addresses a persistent limitation of pure image-based methods in distinguishing malignant from physiologic uptake. The public release of the project repository (https://github.com/jwang-580/RADIANT-PET) is a clear strength supporting reproducibility. The combination of permissive segmentation, structured text conversion, optional report context, and GRPO-based LLM optimization is a coherent technical contribution. However, the absence of any quantitative results, baseline details, or error analysis in the provided manuscript prevents a full assessment of whether the claimed gains are real or merely artifacts of the first-stage segmenter.

major comments (2)

[Abstract] Abstract: No quantitative performance numbers, baseline comparisons, statistical tests, or error analysis are supplied. Without these data it is impossible to evaluate whether the LLM adjudication step actually delivers the claimed suppression of physiologic false positives or merely inherits permissive behavior from the initial segmenter.
[Abstract] Abstract: The central claim that 'structured textual descriptions ... contain sufficient information for an LLM to reliably classify candidates' is load-bearing yet unsupported; the manuscript provides no evidence that fine visual cues (heterogeneous uptake patterns, boundary texture, subtle gradients) survive the text conversion process or that the LLM adjudication is not simply benefiting from report leakage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review. We agree that the current abstract and manuscript lack the quantitative results, baseline details, and supporting analyses needed to substantiate the claims. We will revise accordingly to include these elements.

read point-by-point responses

Referee: [Abstract] Abstract: No quantitative performance numbers, baseline comparisons, statistical tests, or error analysis are supplied. Without these data it is impossible to evaluate whether the LLM adjudication step actually delivers the claimed suppression of physiologic false positives or merely inherits permissive behavior from the initial segmenter.

Authors: We agree that the abstract provides only a qualitative summary and that the manuscript as provided lacks the requested quantitative details. In the revision we will expand the abstract to report specific metrics (e.g., Dice, sensitivity, specificity) on AutoPET and the OSU cohort, include baseline comparisons, and note statistical significance. A new results section with error analysis will also be added to the main text. revision: yes
Referee: [Abstract] Abstract: The central claim that 'structured textual descriptions ... contain sufficient information for an LLM to reliably classify candidates' is load-bearing yet unsupported; the manuscript provides no evidence that fine visual cues (heterogeneous uptake patterns, boundary texture, subtle gradients) survive the text conversion process or that the LLM adjudication is not simply benefiting from report leakage.

Authors: We acknowledge that the current manuscript supplies no direct evidence or ablation addressing information preservation in the text conversion step or the possibility of report leakage. We will add a dedicated analysis section that includes (1) ablations isolating the contribution of the structured text descriptions versus report context and (2) a discussion of limitations regarding fine visual cues that may not be captured in text. These additions will be reflected in the revised abstract as well. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with external components

full rationale

The paper describes an applied framework that chains a permissive segmentation stage, text conversion of candidates, LLM classification, and optional RL fine-tuning. No equations, derivations, or fitted parameters are presented that reduce any output to an input by construction. Claims rest on experimental comparisons against image-only baselines on held-out cohorts (AutoPET, OSU), not on self-referential logic or self-citation chains. The method therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that LLMs can perform reliable lesion classification from textual summaries; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Large language models can accurately classify lesion candidates when given structured textual descriptions of uptake intensity, morphology, regional context, and optional radiology reports.
This premise underpins the adjudication stage and the claim that LLM reasoning suppresses false positives.

pith-pipeline@v0.9.1-grok · 5769 in / 1282 out tokens · 42648 ms · 2026-06-30T10:02:31.989228+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 3 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Agarwal, S., et al.: gpt-oss-120b & gpt-oss-20b model card (2025). https://doi.org/10.48550/arXiv.2508.10925

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025
[2]

Journal of Clinical Oncology32(27), 3048–3058 (2014)

Barrington, S.F., et al.: Role of imaging in the staging and response assessment of lymphoma: consensus of the International Conference on Malignant Lymphomas Imaging Working Group. Journal of Clinical Oncology32(27), 3048–3058 (2014). https://doi.org/10.1200/JCO.2013.53.5229

work page doi:10.1200/jco.2013.53.5229 2014
[3]

Daniel Han, M.H., team, U.: Unsloth (2023), http://github.com/unslothai/unsloth

2023
[4]

doi: 10.1038/s41586-025-09422-z

Guo, D., Yang, D., Zhang, H., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Nature645, 633–638 (2025). https://doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[5]

European Journal of Nuclear Medicine and Molecular Imaging45(7), 1142–1154 (2018)

Ilyas, H., Mikhaeel, N.G., Dunn, J.T., Rahman, F., Møller, H., Smith, D., Barring- ton, S.F.: Defining the optimal method for measuring baseline metabolic tumour volume in diffuse large b cell lymphoma. European Journal of Nuclear Medicine and Molecular Imaging45(7), 1142–1154 (2018). https://doi.org/10.1007/s00259- 018-3953-z

work page doi:10.1007/s00259- 2018
[6]

Kalisch, F

Kalisch, H., Hörst, F., Herrmann, K., Kleesiek, J., Seibold, C.: AutoPET III chal- lenge: Incorporating anatomical knowledge into nnunet for lesion segmentation in PET/CT (2024). https://doi.org/10.48550/arXiv.2409.12155

work page doi:10.48550/arxiv.2409.12155 2024
[7]

PET Clinics14(3), 317–329 (Jul 2019)

Kostakoglu, L., Chauvie, S.: Pet-derived quantitative metrics for response and prognosis in lymphoma. PET Clinics14(3), 317–329 (Jul 2019). https://doi.org/10.1016/j.cpet.2019.03.002

work page doi:10.1016/j.cpet.2019.03.002 2019
[8]

Discovering Causal Signals in Images

Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Discovering causal signals in images (2016). https://doi.org/10.48550/arXiv.1605.08179

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1605.08179 2016
[9]

Rokuss, B

Rokuss, M., Kovacs, B., Kirchhoff, Y., Xiao, S., Ulrich, C., Maier- Hein, K.H., Isensee, F.: From FDG to PSMA: A hitchhiker’s guide to multitracer, multicenter lesion segmentation in PET/CT imaging (2024). https://doi.org/10.48550/arXiv.2409.09478

work page doi:10.48550/arxiv.2409.09478 2024
[10]

Salehi, S.S.M., Erdogmus, D., Gholipour, A.: Tversky loss function for image segmentation using 3d fully convolutional deep networks (2017), https://arxiv.org/abs/1706.05721

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., Bailey, J.: Symmetric cross entropy for robust learning with noisy labels (2019), https://arxiv.org/abs/1908.06112

work page arXiv 2019
[12]

Ra- diology: Artificial Intelligence5(5) (2023)

Wasserthal,J.,Meyer,M.,Hahn,F.,etal.:Totalsegmentator:Robustsegmentation of 104 anatomical structures in ct images. Radiology: Artificial Intelligence (2023). https://doi.org/10.1148/ryai.230024

work page doi:10.1148/ryai.230024 2023

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

Agarwal, S., et al.: gpt-oss-120b & gpt-oss-20b model card (2025). https://doi.org/10.48550/arXiv.2508.10925

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025

[2] [2]

Journal of Clinical Oncology32(27), 3048–3058 (2014)

Barrington, S.F., et al.: Role of imaging in the staging and response assessment of lymphoma: consensus of the International Conference on Malignant Lymphomas Imaging Working Group. Journal of Clinical Oncology32(27), 3048–3058 (2014). https://doi.org/10.1200/JCO.2013.53.5229

work page doi:10.1200/jco.2013.53.5229 2014

[3] [3]

Daniel Han, M.H., team, U.: Unsloth (2023), http://github.com/unslothai/unsloth

2023

[4] [4]

doi: 10.1038/s41586-025-09422-z

Guo, D., Yang, D., Zhang, H., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Nature645, 633–638 (2025). https://doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025

[5] [5]

European Journal of Nuclear Medicine and Molecular Imaging45(7), 1142–1154 (2018)

Ilyas, H., Mikhaeel, N.G., Dunn, J.T., Rahman, F., Møller, H., Smith, D., Barring- ton, S.F.: Defining the optimal method for measuring baseline metabolic tumour volume in diffuse large b cell lymphoma. European Journal of Nuclear Medicine and Molecular Imaging45(7), 1142–1154 (2018). https://doi.org/10.1007/s00259- 018-3953-z

work page doi:10.1007/s00259- 2018

[6] [6]

Kalisch, F

Kalisch, H., Hörst, F., Herrmann, K., Kleesiek, J., Seibold, C.: AutoPET III chal- lenge: Incorporating anatomical knowledge into nnunet for lesion segmentation in PET/CT (2024). https://doi.org/10.48550/arXiv.2409.12155

work page doi:10.48550/arxiv.2409.12155 2024

[7] [7]

PET Clinics14(3), 317–329 (Jul 2019)

Kostakoglu, L., Chauvie, S.: Pet-derived quantitative metrics for response and prognosis in lymphoma. PET Clinics14(3), 317–329 (Jul 2019). https://doi.org/10.1016/j.cpet.2019.03.002

work page doi:10.1016/j.cpet.2019.03.002 2019

[8] [8]

Discovering Causal Signals in Images

Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Discovering causal signals in images (2016). https://doi.org/10.48550/arXiv.1605.08179

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1605.08179 2016

[9] [9]

Rokuss, B

Rokuss, M., Kovacs, B., Kirchhoff, Y., Xiao, S., Ulrich, C., Maier- Hein, K.H., Isensee, F.: From FDG to PSMA: A hitchhiker’s guide to multitracer, multicenter lesion segmentation in PET/CT imaging (2024). https://doi.org/10.48550/arXiv.2409.09478

work page doi:10.48550/arxiv.2409.09478 2024

[10] [10]

Salehi, S.S.M., Erdogmus, D., Gholipour, A.: Tversky loss function for image segmentation using 3d fully convolutional deep networks (2017), https://arxiv.org/abs/1706.05721

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., Bailey, J.: Symmetric cross entropy for robust learning with noisy labels (2019), https://arxiv.org/abs/1908.06112

work page arXiv 2019

[12] [12]

Ra- diology: Artificial Intelligence5(5) (2023)

Wasserthal,J.,Meyer,M.,Hahn,F.,etal.:Totalsegmentator:Robustsegmentation of 104 anatomical structures in ct images. Radiology: Artificial Intelligence (2023). https://doi.org/10.1148/ryai.230024

work page doi:10.1148/ryai.230024 2023