arxiv: 2604.05620 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening

Chenyu Xue , Yiran Liu , Mian Zhou , Jionglong Su , Zhixiang Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical image segmentationlanguage-guided segmentationgraph reasoningpulmonary screeningfoundation modelsfine-tuning strategy

0 comments

The pith

Semantic-topological graphs let clinical text guide precise lung lesion segmentation by resolving overlaps without full model retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that free-text clinical instructions can drive accurate segmentation of pulmonary lesions in CT scans despite their semantic ambiguity and the low contrast of the images. Current multimodal models fail here and overfit badly when fully fine-tuned on small medical datasets. The approach distills intent from the text into a vision foundation model and treats mask selection as a dynamic graph problem whose nodes are candidate lesions and whose edges encode spatial and semantic links. This combination yields higher accuracy and far greater stability across data folds while changing less than one percent of the parameters. If correct, the method would make reliable language-guided screening practical in real clinics where reports are vague and training data are scarce.

Core claim

The Semantic-Topological Graph Reasoning framework combines a large language model with a vision foundation model by first using a Text-to-Vision Intent Distillation module to turn ambiguous clinical reports into precise diagnostic guidance, then modeling candidate lesion masks as nodes in a dynamic graph whose edges capture spatial and semantic affinities, and finally applying a Selective Asymmetric Fine-Tuning strategy that updates under one percent of parameters. On the LIDC-IDRI dataset this produces an 81.5 percent Dice Similarity Coefficient, more than five points above leading LLM-based tools, together with only 0.6 percent variance across five folds on both LIDC-IDRI and LNDb.

What carries the argument

The dynamic graph reasoning step in which candidate lesions are nodes and edges represent spatial and semantic affinities, used to select the correct mask and thereby resolve anatomical overlaps.

Load-bearing premise

The Text-to-Vision Intent Distillation module can reliably extract precise diagnostic guidance from semantically ambiguous clinical reports and the graph reasoning step can correctly disambiguate overlapping lesions without introducing selection errors or bias.

What would settle it

Running the same five-fold protocol on a fresh collection of low-contrast pulmonary scans paired with deliberately vague or contradictory reports and finding that Dice scores drop below the baselines or that cross-fold variance rises sharply.

Figures

Figures reproduced from arXiv: 2604.05620 by Chenyu Xue, Jionglong Su, Mian Zhou, Yiran Liu, Zhixiang Lu.

**Figure 2.** Figure 2: Qualitative comparison of segmentation results across different baseline methods. The [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Medical image segmentation driven by free-text clinical instructions is a critical frontier in computer-aided diagnosis. However, existing multimodal and foundation models struggle with the semantic ambiguity of clinical reports and fail to disambiguate complex anatomical overlaps in low-contrast scans. Furthermore, fully fine-tuning these massive architectures on limited medical datasets invariably leads to severe overfitting. To address these challenges, we propose a novel Semantic-Topological Graph Reasoning (STGR) framework for language-guided pulmonary screening. Our approach elegantly synergizes the reasoning capabilities of large language models (LLaMA-3-V) with the zero-shot delineation of vision foundation models (MedSAM). Specifically, we introduce a Text-to-Vision Intent Distillation (TVID) module to extract precise diagnostic guidance. To resolve anatomical ambiguity, we formulate mask selection as a dynamic graph reasoning problem, where candidate lesions are modeled as nodes and edges capture spatial and semantic affinities. To ensure deployment feasibility, we introduce a Selective Asymmetric Fine-Tuning (SAFT) strategy that updates less than 1% of the parameters. Rigorous 5-fold cross-validation on the LIDC-IDRI and LNDb datasets demonstrates that our framework establishes a new state-of-the-art. Notably, it achieves an 81.5% Dice Similarity Coefficient (DSC) on LIDC-IDRI, outperforming leading LLM-based tools like LISA by over 5%. Crucially, our SAFT strategy acts as a powerful regularizer, yielding exceptional cross-fold stability (0.6% DSC variance) and paving the way for robust, context-aware clinical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The STGR paper offers a practical integration of LLaMA-3-V, MedSAM, graph-based mask selection, and sub-1% fine-tuning for text-guided lung segmentation, but the SOTA claims rest on results that lack the ablations and controls needed to attribute gains to the new pieces.

read the letter

The main takeaway is that this work combines existing LLM and vision foundation models with a Text-to-Vision Intent Distillation step and a dynamic graph to pick lesion masks, plus a selective fine-tuning trick that touches less than 1% of parameters. That setup targets real issues in clinical reports: vague language and overlapping anatomy in low-contrast scans. The SAFT approach looks like a sensible way to avoid overfitting on small medical datasets, and the reported 0.6% variance across 5 folds on LIDC-IDRI and LNDb suggests the tuning helps stability.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Semantic-Topological Graph Reasoning (STGR) framework for language-guided pulmonary nodule segmentation. It integrates LLaMA-3-V with MedSAM via a Text-to-Vision Intent Distillation (TVID) module to extract diagnostic intent from clinical reports, formulates mask selection as dynamic graph reasoning over lesion nodes with spatial/semantic edges to resolve anatomical overlaps, and applies Selective Asymmetric Fine-Tuning (SAFT) to update <1% of parameters. On 5-fold CV of LIDC-IDRI and LNDb, it reports 81.5% DSC (outperforming LISA by >5%) with 0.6% cross-fold variance attributed to SAFT.

Significance. If the performance gains hold and are attributable to the novel TVID and graph components rather than base MedSAM plus SAFT, the work would advance multimodal medical segmentation by addressing semantic ambiguity in free-text reports and enabling parameter-efficient deployment. The low-variance SAFT regularizer is a concrete practical strength.

major comments (2)

[Experiments / Results] The headline SOTA claim (81.5% DSC, +5% over LISA) is load-bearing on TVID reliably distilling precise guidance from ambiguous reports and graph reasoning selecting lesion masks without selection bias or anatomical error, yet the manuscript supplies no component ablations isolating TVID or graph contributions, no qualitative failure-case analysis on high-ambiguity or overlapping lesions, and no explicit controls confirming LISA received equivalent domain adaptation.
[Table 2 / §4.2] Table reporting 5-fold results states 0.6% DSC variance but does not report per-fold raw scores, statistical significance tests (e.g., paired t-test or Wilcoxon), or confidence intervals, preventing assessment of whether the reported stability and margin over baselines are robust.

minor comments (2)

[Abstract / Introduction] The abstract and introduction refer to 'leading LLM-based tools like LISA' without a citation or brief description of LISA's architecture and training regime; this should be added for reproducibility.
[Method / §3.2] Notation for the dynamic graph (nodes as candidate lesions, edges as affinities) is introduced without an explicit equation or pseudocode; a small diagram or formal definition would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments identify important areas where additional evidence can strengthen the claims regarding the contributions of TVID and graph reasoning, as well as the statistical robustness of the reported results. We address each point below and will incorporate the requested analyses in the revised manuscript.

read point-by-point responses

Referee: [Experiments / Results] The headline SOTA claim (81.5% DSC, +5% over LISA) is load-bearing on TVID reliably distilling precise guidance from ambiguous reports and graph reasoning selecting lesion masks without selection bias or anatomical error, yet the manuscript supplies no component ablations isolating TVID or graph contributions, no qualitative failure-case analysis on high-ambiguity or overlapping lesions, and no explicit controls confirming LISA received equivalent domain adaptation.

Authors: We agree that isolating the contributions of TVID and the graph reasoning module is necessary to support the performance gains. The revised manuscript will include new ablation experiments that remove TVID (replacing it with direct text embedding) and the graph reasoning step (replacing it with simple mask selection by area), reporting the resulting DSC drops on both LIDC-IDRI and LNDb. We will also add a qualitative section with failure-case visualizations on high-ambiguity reports and overlapping lesions, showing how the full framework resolves cases that simpler baselines mishandle. For the LISA comparison, we confirm that LISA was adapted using the identical SAFT protocol and training schedule as STGR; this detail will be added to §4.1 and the experimental setup to ensure transparency. revision: yes
Referee: [Table 2 / §4.2] Table reporting 5-fold results states 0.6% DSC variance but does not report per-fold raw scores, statistical significance tests (e.g., paired t-test or Wilcoxon), or confidence intervals, preventing assessment of whether the reported stability and margin over baselines are robust.

Authors: We concur that per-fold scores and formal statistical tests are required to substantiate the claimed stability and margins. The revised version will expand Table 2 to list the raw DSC for each of the five folds for STGR and all baselines. We will additionally report 95% confidence intervals and the results of paired Wilcoxon signed-rank tests (with p-values) comparing STGR against each baseline, confirming that the observed improvements are statistically significant and that the 0.6% cross-fold variance is consistent with the low-variance behavior induced by SAFT. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on cross-validation, not self-referential derivations.

full rationale

The paper presents a descriptive framework (TVID module, dynamic graph reasoning on lesion nodes/edges, SAFT fine-tuning) without any equations, fitted parameters, or first-principles derivations in the provided text. Performance numbers (81.5% DSC, 0.6% variance) are reported from 5-fold CV on LIDC-IDRI/LNDb; these are external empirical measurements, not quantities predicted from the model's own inputs by construction. No self-citations, ansatzes, or uniqueness theorems appear as load-bearing steps. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical formulations, derivations, or model equations, so no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.0 · 5594 in / 1365 out tokens · 55499 ms · 2026-05-10T18:45:38.381847+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

E(v_i,v_j) = α·IoU + (1-α)·CosineSim

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 10 canonical work pages · 4 internal anchors

[1]

A lung nodule segmentation model based on the transformer with multiple thresholds and coordinate attention.Scientific Reports, 14(1):31743, 2024

Tianjiao Hu, Yihua Lan, Yingqi Zhang, Jiashu Xu, Shuai Li, and Chih-Cheng Hung. A lung nodule segmentation model based on the transformer with multiple thresholds and coordinate attention.Scientific Reports, 14(1):31743, 2024

2024
[2]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer, 2015

2015
[3]

Skinclip-vl: Consistency-aware vision-language learning for multimodal skin cancer diagnosis.arXiv preprint arXiv:2603.21010, 2026

Zhixiang Lu, Shijie Xu, Kaicheng Yan, Xuyue Cai, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, and Jionglong Su. Skinclip-vl: Consistency-aware vision-language learning for multimodal skin cancer diagnosis.arXiv preprint arXiv:2603.21010, 2026

work page arXiv 2026
[4]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Segment anything in medical images.Nature Communications, 15(1):654, 2024

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654, 2024

2024
[6]

lost-in-the-middle

Tao Tang, Shijie Xu, Jionglong Su, and Zhixiang Lu. Causal-sam-llm: Large language models as causal reasoners for robust medical segmentation.arXiv preprint arXiv:2507.03585, 2026

work page arXiv 2026
[7]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation.arXiv preprint arXiv:2102.04306, 2021

work page internal anchor Pith review arXiv 2021
[8]

Zhixiang Lu, Yulong Li, Feilong Tang, Zhengyong Jiang, Chong Li, Mian Zhou, Tenglong Li, and Jionglong Su. Deepgb-tb: A risk-balanced cross-attention gradient-boosted convolutional network for rapid, interpretable tuberculosis screening.Proceedings of the AAAI Conference on Artificial Intelligence, 40(46):38989–38997, Mar. 2026

2026
[9]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[10]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shikun Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[11]

Segment everything everywhere all at once

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023
[12]

Sage: Sustainable agent-guided expert-tuning for culturally attuned translation in low-resource southeast asia.arXiv preprint arXiv:2603.19931, 2026

Zhixiang Lu, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, Imran Razzak, Jionglong Su, and Zhengyong Jiang. Sage: Sustainable agent-guided expert-tuning for culturally attuned translation in low-resource southeast asia.arXiv preprint arXiv:2603.19931, 2026

work page arXiv 2026
[13]

MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translation

Zhixiang Lu, Chong Zhang, Chenyu Xue, Angelos Stefanidis, Chong Li, Jionglong Su, and Zhengyong Jiang. Merit: Multilingual expert-reward informed tuning for chinese-centric low-resource machine translation.arXiv preprint arXiv:2604.04839, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Advancing low-resource machine translation: A unified data selection and scoring optimization framework

Zhixiang Lu, Peichen Ji, Yulong Li, Ding Sun, Chenyu Xue, Haochen Xue, Mian Zhou, Angelos Stefanidis, Jionglong Su, and Zhengyong Jiang. Advancing low-resource machine translation: A unified data selection and scoring optimization framework. InInternational Conference on Intelligent Computing, pages 482–493. Springer, 2025. 9

2025
[15]

Prism: A personality-driven multi-agent framework for social media simulation.arXiv preprint arXiv:2512.19933, 2025

Zhixiang Lu, Xueyuan Deng, Yiran Liu, Yulong Li, Qiang Yan, Imran Razzak, and Jionglong Su. Prism: A personality-driven multi-agent framework for social media simulation.arXiv preprint arXiv:2512.19933, 2025

work page arXiv 2025
[16]

Qlora: Efficient fine- tuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient fine- tuning of quantized llms. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023
[17]

Attention- based hybrid deep learning framework for modelling the compressive strength of ultra-high- performance geopolymer concrete.Results in Engineering, page 109288, 2026

Minggang Xu, Xihai Tang, Jian Sun, Chong Li, Jonglong Su, and Zhixiang Lu. Attention- based hybrid deep learning framework for modelling the compressive strength of ultra-high- performance geopolymer concrete.Results in Engineering, page 109288, 2026

2026
[18]

Hierrisk: A hierarchical framework for suicide risk prediction on social media

Zhixiang Lu and Jionglong Su. Hierrisk: A hierarchical framework for suicide risk prediction on social media. In2025 IEEE International Conference on Big Data (BigData), pages 8169–8174. IEEE, 2025

2025
[19]

Dieq: Dynamic identity equilibrium for author disambiguation in kdd cup 2024 whoiswho-ind challenge

Zhixiang Lu, Hansheng Zeng, and Yuqi Li. Dieq: Dynamic identity equilibrium for author disambiguation in kdd cup 2024 whoiswho-ind challenge. InKDD 2024 OAG-Challenge Cup, 2024

2024
[20]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Auto- matic pulmonary nodule detection applying deep learning or machine learning algorithms to the lidc-idri database: a systematic review.Diagnostics, 9(1):29, 2019

Lea Marie Pehrson, Michael Bachmann Nielsen, and Carsten Ammitzbøl Lauridsen. Auto- matic pulmonary nodule detection applying deep learning or machine learning algorithms to the lidc-idri database: a systematic review.Diagnostics, 9(1):29, 2019

2019
[22]

arXiv preprint arXiv:1911.08434 (2019)

Jo˜ao Pedrosa, Guilherme Aresta, Carlos Ferreira, M ´arcio Rodrigues, Patr ´ıcia Leit˜ao, Andr´e Silva Carvalho, Jo˜ao Rebelo, Eduardo Negr˜ao, Isabel Ramos, Ant ´onio Cunha, and Aur´elio Campilho. Lndb: A lung nodule database on computed tomography.arXiv preprint arXiv:1911.08434, 2019

work page arXiv 1911
[23]

Badrinarayanan, A

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation.arXiv preprint arXiv:1511.00561, 2016. 10

work page arXiv 2016