GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT

Beining Wu; Chunbo Jiang; Feiwei Qin; Huangwei Chen; Mingxuan Liu; Min Tan; Shenghao Zhu; Shuo Jiang; Weihong Chen; Yifei Chen

arxiv: 2605.22619 · v1 · pith:IN5W7JYYnew · submitted 2026-05-21 · 💻 cs.CV

GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT

Shuo Jiang , Yuhao Hong , Chunbo Jiang , Weihong Chen , Huangwei Chen , Shenghao Zhu , Beining Wu , Mingxuan Liu

show 4 more authors

Zhu Zhu Feiwei Qin Min Tan Yifei Chen

This is my paper

Pith reviewed 2026-05-22 06:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords lesion grounding3D CTradiology reportgraph reasoningoctree refinementmedical image segmentationvision-language groundinganatomy-aware verification

0 comments

The pith

GLeVE aligns each free-text lesion description to its exact location in a 3D CT scan by building a relation graph and verifying proposals against anatomy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GLeVE to close the gap between unstructured radiology reports and volumetric CT images. It treats every lesion mention as a single semantic unit, then uses graph reasoning to capture organ links, attributes, and relations among lesions. Anatomy-aware proposal generation plus region verification produces one-to-one text-to-lesion matches, after which octree refinement sharpens the boundaries step by step. The method is tested on AbdomenAtlas 3.0 and reports gains over standard multimodal models and report-supervised baselines in both segmentation accuracy and lesion-level localization. If the approach holds, reports could be turned into precise, verifiable lesion maps without requiring dense pixel labels.

Core claim

GLeVE encodes each lesion description as an atomic semantic unit and runs relation-aware graph reasoning over organ attribution, attributes, and inter-lesion connections to create discriminative queries. Anatomy-aware proposal generation with region-level verification enforces strict one-to-one text-lesion alignment. Hierarchical octree refinement then progressively refines boundary delineation, yielding measurable improvements in segmentation and localization on AbdomenAtlas 3.0 compared with classical multimodal foundation models and report-supervised baselines.

What carries the argument

relation-aware graph reasoning that produces lesion-wise queries, paired with anatomy-aware proposal generation, region-level verification, and hierarchical octree autoregressive refinement

If this is right

Lesion descriptions become atomic units that support direct one-to-one correspondence with image regions.
Anatomy-aware verification reduces false alignments between text and nearby but unrelated structures.
Octree refinement produces progressively tighter boundaries without retraining the entire model.
The pipeline works with standard radiology reports and does not require additional dense annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-plus-verification pattern could be tested on other 3D modalities such as MRI or PET to check whether the benefit is CT-specific.
If the extracted relations prove reliable, the method might support automated checking of report completeness by flagging lesions that lack image matches.
A natural next measurement would be how well the localized lesions support downstream tasks such as change detection across follow-up scans.

Load-bearing premise

Free-text radiology narratives contain enough structured relations about organs and lesions that graph reasoning can extract them reliably without dense pixel supervision or extra manual labels.

What would settle it

Remove the graph reasoning module and run the same experiments on AbdomenAtlas 3.0; if lesion-level localization accuracy drops to the level of plain vision-language baselines, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.22619 by Beining Wu, Chunbo Jiang, Feiwei Qin, Huangwei Chen, Mingxuan Liu, Min Tan, Shenghao Zhu, Shuo Jiang, Weihong Chen, Yifei Chen, Yuhao Hong, Zhu Zhu.

**Figure 1.** Figure 1: Overview of lesion-wise report grounding, illustrating structured semantic modeling and precise CT localization for efficient clinical verification. to verify critical findings, making lesion localization highly experience-dependent and inefficient [20,11,3]. This burden is further amplified in multi-organ/multilesion scenarios: reports grow substantially longer, salient lesion descriptions are diluted b… view at source ↗

**Figure 2.** Figure 2: Architecture of GLeVE, including (a) lesion semantic graph modeling, (b) anatomy-aware proposal verification, and (c) octree-based autoregressive refinement. and spurious responses. This is followed by an octree-based autoregressive refinement strategy that enhances pixel-level localization sensitivity for small lesions. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of grounding results under 10% and 100% mask supervision, highlighting localization accuracy and multi-lesion disambiguation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative ablation results of GLeVE, showing the effect of removing Anatomical-Prior, Proposal & Verification, LeQu, and OcRe on lesion grounding and boundary refinement. Arrows indicate localization center offsets. indicating improved global geometric consistency and fewer extreme boundary errors. Lesion-level performance is likewise strong, reaching 76.2% LR and 33.7% LLS. Under weak supervision with o… view at source ↗

read the original abstract

Grounding radiology report descriptions to 3D CT volumes is essential for verifiable clinical interpretation, yet remains challenging due to the semantic-spatial gap between free-text narratives and volumetric anatomy. Existing report-assisted and vision-language grounding methods typically rely on phrase-level alignment or dense pixel supervision, resulting in limited lesion-wise correspondence and suboptimal localization accuracy. We propose GLeVE, a graph-guided lesion grounding framework with anatomical prior verification and octree-based autoregressive refinement. GLeVE treats each lesion description as an atomic semantic unit and encodes organ attribution, attributes, and inter-lesion relations through relation-aware graph reasoning to produce discriminative lesion-wise queries. Anatomy-aware proposal generation with region-level verification enforces one-to-one text-lesion alignment, while hierarchical octree refinement progressively improves boundary delineation. Experiments on AbdomenAtlas 3.0 demonstrate consistent gains over classical multimodal foundation models and report-supervised baselines in both segmentation accuracy and lesion-level localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLeVE adds graph reasoning over lesion relations plus proposal verification and octree refinement to report-guided CT grounding, but the abstract gives almost no numbers or ablations to judge whether the gains are real.

read the letter

This paper's main move is to treat lesion descriptions from radiology reports as atomic units and run them through a relation-aware graph that pulls out organ attributions and inter-lesion links. Those queries then drive anatomy-aware proposal generation with region-level verification for one-to-one alignment, followed by hierarchical octree refinement to sharpen boundaries. The claim is that this beats both classical multimodal models and earlier report-supervised baselines on AbdomenAtlas 3.0 without needing dense pixel labels or extra annotations.

Referee Report

2 major / 2 minor

Summary. The paper proposes GLeVE, a graph-guided lesion grounding framework for 3D CT that treats each lesion description as an atomic semantic unit. It encodes organ attribution, attributes, and inter-lesion relations via relation-aware graph reasoning to produce discriminative lesion-wise queries, applies anatomy-aware proposal generation with region-level verification to enforce one-to-one text-lesion alignment, and uses hierarchical octree refinement to improve boundary delineation. Experiments on AbdomenAtlas 3.0 report consistent gains over classical multimodal foundation models and report-supervised baselines in segmentation accuracy and lesion-level localization.

Significance. If the quantitative results and ablations hold, the work could advance report-supervised grounding in medical imaging by showing how graph-based relational modeling and hierarchical refinement can bridge the semantic-spatial gap without dense pixel supervision. This has potential clinical value for verifiable lesion localization from free-text narratives.

major comments (2)

[§3 (Graph Reasoning and Proposal Verification)] The central assumption that free-text radiology reports reliably supply extractable organ attributions and inter-lesion relations for graph reasoning (without dense supervision or extra annotations) is load-bearing for the one-to-one alignment claim and subsequent proposal verification. The manuscript should include concrete examples from the dataset or failure-case analysis showing how ambiguous or implicit relations are handled, as this directly affects whether the graph module produces the promised discriminative queries.
[§4 (Experiments)] The abstract states 'consistent gains' on AbdomenAtlas 3.0 but the provided description lacks quantitative metrics, error bars, ablation studies on the graph and octree components, or discussion of failure modes. These details are required to substantiate the improvements in segmentation accuracy and lesion-level localization over baselines.

minor comments (2)

[§3.1] Notation for the relation-aware graph (e.g., how nodes and edges are formally defined) could be clarified with a small diagram or pseudocode for readers unfamiliar with the specific graph construction.
[Abstract] The abstract would benefit from naming the exact baseline methods and reporting at least one key metric (e.g., Dice or localization IoU) to give readers an immediate sense of the improvement magnitude.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional examples, quantitative details, and clarifications where feasible.

read point-by-point responses

Referee: [§3 (Graph Reasoning and Proposal Verification)] The central assumption that free-text radiology reports reliably supply extractable organ attributions and inter-lesion relations for graph reasoning (without dense supervision or extra annotations) is load-bearing for the one-to-one alignment claim and subsequent proposal verification. The manuscript should include concrete examples from the dataset or failure-case analysis showing how ambiguous or implicit relations are handled, as this directly affects whether the graph module produces the promised discriminative queries.

Authors: We agree that explicit examples would help substantiate the graph reasoning module. In the revised version, we will add a dedicated subsection in §3 with concrete report excerpts from AbdomenAtlas 3.0, illustrating how organ attributions and inter-lesion relations (including implicit ones) are parsed and encoded. We will also include a short failure-case analysis highlighting cases where ambiguous phrasing leads to less discriminative queries and how the anatomy-aware verification mitigates this. revision: yes
Referee: [§4 (Experiments)] The abstract states 'consistent gains' on AbdomenAtlas 3.0 but the provided description lacks quantitative metrics, error bars, ablation studies on the graph and octree components, or discussion of failure modes. These details are required to substantiate the improvements in segmentation accuracy and lesion-level localization over baselines.

Authors: The full manuscript already reports quantitative metrics (Dice, localization accuracy), error bars from repeated runs, and ablations on the graph and octree modules in §4. However, we acknowledge the abstract is overly concise. We will revise the abstract to include key numerical improvements and add a dedicated paragraph on failure modes and limitations in the experiments section. Ablation tables will be expanded for clarity. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces novel components including relation-aware graph reasoning for lesion-wise queries, anatomy-aware proposal generation with region-level verification, and hierarchical octree refinement. These are presented as independent methodological contributions that do not reduce by construction to fitted parameters, self-definitions, or self-citation chains in the provided abstract or description. No equations or steps are shown that equate predictions to inputs via renaming or ansatz smuggling. The framework is self-contained with gains demonstrated over external baselines, consistent with the default expectation that most papers exhibit no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the approach implicitly relies on standard assumptions in medical vision-language modeling such as the availability of paired report-image data.

pith-pipeline@v0.9.0 · 5731 in / 1067 out tokens · 35371 ms · 2026-05-22T06:38:31.589210+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

hierarchical octree refinement progressively improves boundary delineation... octree-based autoregressive refinement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

[1]

The Lancet Digital Health (2026)

Alabed, S., Anderson, A., Maiter, A., et al.: Large language models for simplifying radiology reports: A systematic review and meta-analysis of patient, public, and clinician evaluations. The Lancet Digital Health (2026)

work page 2026
[2]

Meng, M., Zhao, B.: M3D: Advancing 3D medical image analysis with multi-modal large language models

BAI, F., Du, Y., Huang, T., q.-h. Meng, M., Zhao, B.: M3D: Advancing 3D medical image analysis with multi-modal large language models. In: International Confer- ence on Learning Representations (2024)

work page 2024
[3]

In: Medical Imaging with Deep Learning (2025)

Bai, X., Liu, M., Chen, Y., Yang, H., Tian, Q.: Chest-OMDL: Organ-specific mul- tidisease detection and localization in chest computed tomography using weakly supervised deep learning from free-text radiology report. In: Medical Imaging with Deep Learning (2025)

work page 2025
[4]

EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT

Bai, X., Liu, M., Song, T., et al.: EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT. arXiv preprint arXiv:2604.24146 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

In: Proceedings of Medical Image Computing and Computer Assisted Intervention

Bassi, P.R., Li, W., Chen, J., et al.: Learning segmentation from radiology reports. In: Proceedings of Medical Image Computing and Computer Assisted Intervention. pp. 305–315 (2025)

work page 2025
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bassi, P.R., Yavuz, M.C., Hamamci, I.E., Er, S., Chen, X., Li, W., Menze, B., Decherchi, S., Cavalli, A., Wang, K., et al.: RadGPT: Constructing 3D image-text tumor datasets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23720–23730 (2025) 10 S. Jiang et al

work page 2025
[7]

Research Square pp

Blankemeier, L., Cohen, J.P., Kumar, A., et al.: Merlin: A vision language foun- dation model for 3D computed tomography. Research Square pp. rs–3 (2024)

work page 2024
[8]

European Radiology35(5), 2589–2602 (2025)

Busch, F., Hoffmann, L., Dos Santos, D.P., et al.: Large language models for struc- tured reporting in radiology: Past, present, and future. European Radiology35(5), 2589–2602 (2025)

work page 2025
[9]

arXiv preprint arXiv:2503.12927 (2025)

Chen, H., Chen, Y., Yan, Z., Ding, M., Li, C., Zhu, Z., Qin, F.: MMLNB: Multi- modal learning for neuroblastoma subtyping classification assisted with textual description generation. arXiv preprint arXiv:2503.12927 (2025)

work page arXiv 2025
[10]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Chen, Y., Zou, B., Guo, Z., et al.: SCUNet++: Swin-UNet and CNN bottleneck hy- brid architecture with multi-fusion dense skip connection for pulmonary embolism CT image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7759–7767 (2024)

work page 2024
[11]

NPJ Digital Medicine8(1), 490 (2025)

Dong, F., Nie, S., Chen, M., et al.: Keyword-based AI assistance in the generation of radiology reports: A pilot study. NPJ Digital Medicine8(1), 490 (2025)

work page 2025
[12]

arXiv preprint arXiv:2203.00131 (2023)

Gao, Y., Zhou, M., Liu, D., Yan, Z., Zhang, S., Metaxas, D.N.: A data-scalable Transformer for medical image segmentation: Architecture, model efficiency, and benchmark. arXiv preprint arXiv:2203.00131 (2023)

work page arXiv 2023
[13]

Nature Biomedical Engineering pp

Hamamci, I.E., Er, S., Wang, C., Almas, F., et al.: Generalist foundation mod- els from a multimodal dataset for 3D computed tomography. Nature Biomedical Engineering pp. 1–19 (2026)

work page 2026
[14]

In: Proceedings of Medical Image Computing and Computer Assisted Intervention

Hao, Q., Yu, L., Tian, S., Ye, X., Zhang, L.: R1Seg-3D: Rethinking reasoning segmentation for medical 3D CTs. In: Proceedings of Medical Image Computing and Computer Assisted Intervention. pp. 415–425 (2025)

work page 2025
[15]

In: International Sym- posium on Biomedical Imaging

He, J., Li, P., Liu, G., Zhong, S.: Parameter-efficient fine-tuning medical multi- modal large language models for medical visual grounding. In: International Sym- posium on Biomedical Imaging. pp. 1–5. IEEE (2025)

work page 2025
[16]

Huang, H

Huang, Z., Wang, H., Deng, Z., Ye, J., Su, Y., Sun, H., et al.: STU-Net: Scalable and transferable medical image segmentation models empowered by large-scale supervised pre-training. arXiv preprint arXiv:2304.06716 (2023)

work page arXiv 2023
[17]

In: Proceedings of Medical Image Computing and Computer Assisted Intervention

Ichinose, A., Hatsutani, T., Nakamura, K., et al.: Visual grounding of whole radi- ology reports for 3D CT images. In: Proceedings of Medical Image Computing and Computer Assisted Intervention. pp. 611–621 (2023)

work page 2023
[18]

Nature Methods18(2), 203–211 (2021)

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods18(2), 203–211 (2021)

work page 2021
[19]

IEEE Access 13, 112215–112254 (2025)

Kulsoom, U., Glavin, F.G., Bendechache, M.: Natural language processing and ma- chine learning for analysis of radiology reports-A systematic review. IEEE Access 13, 112215–112254 (2025)

work page 2025
[20]

Artificial Intelli- gence Review58(11), 344 (2025)

Li, Y., Kong, C., Zhao, G., Zhao, Z.: Automatic radiology report generation with deep learning: a comprehensive review of methods and advances. Artificial Intelli- gence Review58(11), 344 (2025)

work page 2025
[21]

arXiv preprint arXiv:2511.19046 (2025)

Liu, A., Xue, R., Cao, X.R., et al.: MedSAM3: Delving into segment anything with medical concepts. arXiv preprint arXiv:2511.19046 (2025)

work page arXiv 2025
[22]

In: International Conference on Learning Representations (2017)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017)

work page 2017
[23]

In: Medical Imaging with Deep Learning (2025)

Nützel, F., Dombrowski, M., Kainz, B.: Generate to ground: Multimodal text con- ditioning boosts phrase grounding in medical vision-language models. In: Medical Imaging with Deep Learning (2025)

work page 2025
[24]

In: Proceedings of Medical Image Computing and Com- puter Assisted Intervention

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomed- ical image segmentation. In: Proceedings of Medical Image Computing and Com- puter Assisted Intervention. pp. 234–241 (2015) GLeVE: Graph-Guided Lesion Grounding with Proposal Verification 11

work page 2015
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tang, Y., Yang, D., Li, W., Roth, H.R., Landman, B., Xu, D., Nath, V., Hatamizadeh, A.: Self-supervised pre-training of swin transformers for 3D med- ical image analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20730–20740 (2022)

work page 2022
[26]

Radiology: Artificial Intelligence6(4), e240300 (2024)

Tejani, A.S., Klontzas, M.E., Gatti, A.A., et al.: Checklist for artificial intelligence in medical imaging (CLAIM): 2024 update. Radiology: Artificial Intelligence6(4), e240300 (2024)

work page 2024
[27]

IEEE Journal of Biomedical and Health Informatics29(12), 9051–9059 (2025)

Vilouras, K., Sanchez, P., O’Neil, A.Q., Tsaftaris, S.A.: Zero-shot medical phrase grounding with off-the-shelf diffusion models. IEEE Journal of Biomedical and Health Informatics29(12), 9051–9059 (2025)

work page 2025
[28]

Radiology: Artificial Intelligence 5(5), e230024 (2023)

Wasserthal, J., Breit, H.C., Meyer, M.T., et al.: TotalSegmentator: Robust segmen- tation of 104 anatomic structures in CT images. Radiology: Artificial Intelligence 5(5), e230024 (2023)

work page 2023
[29]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Nature Biomedical Engineering pp

Yang, H., Zhou, H.Y., Liu, J., et al.: A multimodal vision–language model for gen- eralizable annotation-free pathology localization. Nature Biomedical Engineering pp. 1–15 (2026)

work page 2026
[31]

NPJ Digital Medicine8(1), 566 (2025)

Zhao, Z., Zhang, Y., Wu, C., Zhang, X., Zhou, X., Zhang, Y., Wang, Y., Xie, W.: Large-vocabulary segmentation for medical images with text prompts. NPJ Digital Medicine8(1), 566 (2025)

work page 2025
[32]

IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11315–11329 (2025)

Zou, K., Bai, Y., Liu, B., et al.: Uncertainty-aware medical diagnostic phrase iden- tification and grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11315–11329 (2025)

work page 2025

[1] [1]

The Lancet Digital Health (2026)

Alabed, S., Anderson, A., Maiter, A., et al.: Large language models for simplifying radiology reports: A systematic review and meta-analysis of patient, public, and clinician evaluations. The Lancet Digital Health (2026)

work page 2026

[2] [2]

Meng, M., Zhao, B.: M3D: Advancing 3D medical image analysis with multi-modal large language models

BAI, F., Du, Y., Huang, T., q.-h. Meng, M., Zhao, B.: M3D: Advancing 3D medical image analysis with multi-modal large language models. In: International Confer- ence on Learning Representations (2024)

work page 2024

[3] [3]

In: Medical Imaging with Deep Learning (2025)

Bai, X., Liu, M., Chen, Y., Yang, H., Tian, Q.: Chest-OMDL: Organ-specific mul- tidisease detection and localization in chest computed tomography using weakly supervised deep learning from free-text radiology report. In: Medical Imaging with Deep Learning (2025)

work page 2025

[4] [4]

EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT

Bai, X., Liu, M., Song, T., et al.: EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT. arXiv preprint arXiv:2604.24146 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

In: Proceedings of Medical Image Computing and Computer Assisted Intervention

Bassi, P.R., Li, W., Chen, J., et al.: Learning segmentation from radiology reports. In: Proceedings of Medical Image Computing and Computer Assisted Intervention. pp. 305–315 (2025)

work page 2025

[6] [6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bassi, P.R., Yavuz, M.C., Hamamci, I.E., Er, S., Chen, X., Li, W., Menze, B., Decherchi, S., Cavalli, A., Wang, K., et al.: RadGPT: Constructing 3D image-text tumor datasets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23720–23730 (2025) 10 S. Jiang et al

work page 2025

[7] [7]

Research Square pp

Blankemeier, L., Cohen, J.P., Kumar, A., et al.: Merlin: A vision language foun- dation model for 3D computed tomography. Research Square pp. rs–3 (2024)

work page 2024

[8] [8]

European Radiology35(5), 2589–2602 (2025)

Busch, F., Hoffmann, L., Dos Santos, D.P., et al.: Large language models for struc- tured reporting in radiology: Past, present, and future. European Radiology35(5), 2589–2602 (2025)

work page 2025

[9] [9]

arXiv preprint arXiv:2503.12927 (2025)

Chen, H., Chen, Y., Yan, Z., Ding, M., Li, C., Zhu, Z., Qin, F.: MMLNB: Multi- modal learning for neuroblastoma subtyping classification assisted with textual description generation. arXiv preprint arXiv:2503.12927 (2025)

work page arXiv 2025

[10] [10]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Chen, Y., Zou, B., Guo, Z., et al.: SCUNet++: Swin-UNet and CNN bottleneck hy- brid architecture with multi-fusion dense skip connection for pulmonary embolism CT image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7759–7767 (2024)

work page 2024

[11] [11]

NPJ Digital Medicine8(1), 490 (2025)

Dong, F., Nie, S., Chen, M., et al.: Keyword-based AI assistance in the generation of radiology reports: A pilot study. NPJ Digital Medicine8(1), 490 (2025)

work page 2025

[12] [12]

arXiv preprint arXiv:2203.00131 (2023)

Gao, Y., Zhou, M., Liu, D., Yan, Z., Zhang, S., Metaxas, D.N.: A data-scalable Transformer for medical image segmentation: Architecture, model efficiency, and benchmark. arXiv preprint arXiv:2203.00131 (2023)

work page arXiv 2023

[13] [13]

Nature Biomedical Engineering pp

Hamamci, I.E., Er, S., Wang, C., Almas, F., et al.: Generalist foundation mod- els from a multimodal dataset for 3D computed tomography. Nature Biomedical Engineering pp. 1–19 (2026)

work page 2026

[14] [14]

In: Proceedings of Medical Image Computing and Computer Assisted Intervention

Hao, Q., Yu, L., Tian, S., Ye, X., Zhang, L.: R1Seg-3D: Rethinking reasoning segmentation for medical 3D CTs. In: Proceedings of Medical Image Computing and Computer Assisted Intervention. pp. 415–425 (2025)

work page 2025

[15] [15]

In: International Sym- posium on Biomedical Imaging

He, J., Li, P., Liu, G., Zhong, S.: Parameter-efficient fine-tuning medical multi- modal large language models for medical visual grounding. In: International Sym- posium on Biomedical Imaging. pp. 1–5. IEEE (2025)

work page 2025

[16] [16]

Huang, H

Huang, Z., Wang, H., Deng, Z., Ye, J., Su, Y., Sun, H., et al.: STU-Net: Scalable and transferable medical image segmentation models empowered by large-scale supervised pre-training. arXiv preprint arXiv:2304.06716 (2023)

work page arXiv 2023

[17] [17]

In: Proceedings of Medical Image Computing and Computer Assisted Intervention

Ichinose, A., Hatsutani, T., Nakamura, K., et al.: Visual grounding of whole radi- ology reports for 3D CT images. In: Proceedings of Medical Image Computing and Computer Assisted Intervention. pp. 611–621 (2023)

work page 2023

[18] [18]

Nature Methods18(2), 203–211 (2021)

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods18(2), 203–211 (2021)

work page 2021

[19] [19]

IEEE Access 13, 112215–112254 (2025)

Kulsoom, U., Glavin, F.G., Bendechache, M.: Natural language processing and ma- chine learning for analysis of radiology reports-A systematic review. IEEE Access 13, 112215–112254 (2025)

work page 2025

[20] [20]

Artificial Intelli- gence Review58(11), 344 (2025)

Li, Y., Kong, C., Zhao, G., Zhao, Z.: Automatic radiology report generation with deep learning: a comprehensive review of methods and advances. Artificial Intelli- gence Review58(11), 344 (2025)

work page 2025

[21] [21]

arXiv preprint arXiv:2511.19046 (2025)

Liu, A., Xue, R., Cao, X.R., et al.: MedSAM3: Delving into segment anything with medical concepts. arXiv preprint arXiv:2511.19046 (2025)

work page arXiv 2025

[22] [22]

In: International Conference on Learning Representations (2017)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017)

work page 2017

[23] [23]

In: Medical Imaging with Deep Learning (2025)

Nützel, F., Dombrowski, M., Kainz, B.: Generate to ground: Multimodal text con- ditioning boosts phrase grounding in medical vision-language models. In: Medical Imaging with Deep Learning (2025)

work page 2025

[24] [24]

In: Proceedings of Medical Image Computing and Com- puter Assisted Intervention

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomed- ical image segmentation. In: Proceedings of Medical Image Computing and Com- puter Assisted Intervention. pp. 234–241 (2015) GLeVE: Graph-Guided Lesion Grounding with Proposal Verification 11

work page 2015

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tang, Y., Yang, D., Li, W., Roth, H.R., Landman, B., Xu, D., Nath, V., Hatamizadeh, A.: Self-supervised pre-training of swin transformers for 3D med- ical image analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20730–20740 (2022)

work page 2022

[26] [26]

Radiology: Artificial Intelligence6(4), e240300 (2024)

Tejani, A.S., Klontzas, M.E., Gatti, A.A., et al.: Checklist for artificial intelligence in medical imaging (CLAIM): 2024 update. Radiology: Artificial Intelligence6(4), e240300 (2024)

work page 2024

[27] [27]

IEEE Journal of Biomedical and Health Informatics29(12), 9051–9059 (2025)

Vilouras, K., Sanchez, P., O’Neil, A.Q., Tsaftaris, S.A.: Zero-shot medical phrase grounding with off-the-shelf diffusion models. IEEE Journal of Biomedical and Health Informatics29(12), 9051–9059 (2025)

work page 2025

[28] [28]

Radiology: Artificial Intelligence 5(5), e230024 (2023)

Wasserthal, J., Breit, H.C., Meyer, M.T., et al.: TotalSegmentator: Robust segmen- tation of 104 anatomic structures in CT images. Radiology: Artificial Intelligence 5(5), e230024 (2023)

work page 2023

[29] [29]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Nature Biomedical Engineering pp

Yang, H., Zhou, H.Y., Liu, J., et al.: A multimodal vision–language model for gen- eralizable annotation-free pathology localization. Nature Biomedical Engineering pp. 1–15 (2026)

work page 2026

[31] [31]

NPJ Digital Medicine8(1), 566 (2025)

Zhao, Z., Zhang, Y., Wu, C., Zhang, X., Zhou, X., Zhang, Y., Wang, Y., Xie, W.: Large-vocabulary segmentation for medical images with text prompts. NPJ Digital Medicine8(1), 566 (2025)

work page 2025

[32] [32]

IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11315–11329 (2025)

Zou, K., Bai, Y., Liu, B., et al.: Uncertainty-aware medical diagnostic phrase iden- tification and grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11315–11329 (2025)

work page 2025