Region-Aware Multimodal Large Language Model via SlowFast Tokenization and Pseudo-Mask Guidance for 3D CT Report Generation

Dongyeong Kim; Hyungbin Park; Hyunseok Lim; JiHyun Kim; Jimin Sung; Jinyoung Seo; Namkug Kim; Sunggu Kyung; Wooyoung Jo; Yoojin Nam

arxiv: 2506.23102 · v2 · pith:QOLZPOJFnew · submitted 2025-06-29 · 📡 eess.IV · cs.CV

Region-Aware Multimodal Large Language Model via SlowFast Tokenization and Pseudo-Mask Guidance for 3D CT Report Generation

Sunggu Kyung , Jinyoung Seo , Hyunseok Lim , Dongyeong Kim , Hyungbin Park , Jimin Sung , Jihyun Kim , Wooyoung Jo

show 2 more authors

Yoojin Nam Namkug Kim

This is my paper

classification 📡 eess.IV cs.CV

keywords generationmodelreportmedregion-ctslowfastclinicallyframeworkglobal

0 comments

read the original abstract

Current CT report generation frameworks predominantly rely on global feature representations, often failing to capture region-specific details and potentially missing certain abnormalities. To overcome this limitation, we propose MedRegion-CT, a region-focused multimodal large language model framework featuring three key innovations. First, we revisit the SlowFast strategy to jointly model global and fine-grained information and adapt it to the medical domain via a Region-based SlowFast Tokenizer that extracts tokens guided by clinically meaningful regions. Second, generated pseudo-masks guide the model to attend to diagnostically important anatomical regions, facilitating a systematic understanding of the overall scan context. Third, quantitative lesion information, including size, diameter, and spatial location, is encoded as structured textual prompts, enabling context-aware and clinically informed report generation. To enable rigorous evaluation, we validate our framework on multi-institutional structured report generation benchmarks. Experimental results demonstrate that MedRegion-CT achieves state-of-the-art performance, outperforming existing approaches in both linguistic quality and clinical accuracy. All code is publicly available at: https://github.com/babbu3682/MedRegion-CT.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation
cs.CV 2026-05 unverdicted novelty 6.0

The paper diagnoses Template Collapse in 3D medical VLMs for CT reports and introduces CLarGen, a decoupled detection-plus-synthesis framework that raises macro-F1 from 0.189 to 0.487.
Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.
Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance
cs.CV 2026-04 unverdicted novelty 6.0

DCP-PD improves macro F1 scores on CT report generation benchmarks and introduces a hierarchical location-aware evaluation protocol that reveals ongoing challenges in pathology spatial grounding.