Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

Andrea M. Bejar; Debesh Jha; Gorkem Durak; Halil Ertugrul Aktas; Quoc-Huy Trinh; Ulas Bagci; Vanshali Sharma

arxiv: 2606.17213 · v1 · pith:WMGGVN4Cnew · submitted 2026-06-15 · 💻 cs.CL · cs.CV

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

Vanshali Sharma , Andrea M. Bejar , Halil Ertugrul Aktas , Quoc-Huy Trinh , Debesh Jha , Gorkem Durak , Ulas Bagci This is my paper

Pith reviewed 2026-06-27 03:17 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords LLM adaptation3D CT report generationparameter-efficient fine-tuningdiagnostic priorsvolumetric imagingfrozen LLMmedical report generationRAD3D-Prefix

0 comments

The pith

Freezing large LLMs and training only lightweight layers with diagnostic priors outperforms full fine-tuning for 3D CT report generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to adapt large language models to generate reports from volumetric CT scans without overfitting on small medical datasets. It introduces RAD3D-Prefix, a module that feeds multi-label diagnostic classification results into the model to connect visual features with clinical terms. The central result is that full fine-tuning helps only smaller models, while for LLMs of roughly one billion parameters and up, keeping the LLM frozen and updating just the projection layers delivers better report quality, out-of-domain generalization, and efficiency. Experiments across model sizes, automatic metrics, and a clinical reader study support the claim that this uses far fewer trainable parameters than alternatives.

Core claim

For LLMs of approximately 1B parameters and larger, freezing the LLM and training only a lightweight projection layer that conditions on multi-label diagnostic classification logits produces higher-quality reports from 3D CT volumes than full fine-tuning, with stronger out-of-domain generalization and substantially lower computational cost.

What carries the argument

RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that combines image embeddings with multi-label diagnostic classification logits while the LLM remains frozen.

If this is right

Fine-tuning the full LLM benefits smaller models most.
Freezing larger LLMs reduces overfitting on limited medical data.
RAD3D-Prefix achieves higher scores on automatic metrics and clinical reader evaluations than comparable baselines.
The approach maintains strong performance on data from different distributions.
Training requires substantially fewer parameters than full fine-tuning or other efficient methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning strategy could extend to report generation from other 3D modalities such as MRI.
Diagnostic priors might reduce hallucinations across additional medical text-generation settings.
Lower parameter counts could make such systems practical in hospitals with limited GPU resources.
Combining the frozen-LLM approach with even larger base models may further improve results without raising training costs.

Load-bearing premise

The multi-label diagnostic classification logits accurately capture and preserve critical clinical details without introducing classification errors or biases that affect the reports.

What would settle it

If a reader study or error analysis finds that reports produced with the diagnostic logits contain more clinically significant mistakes traceable to upstream classification errors than reports from fully fine-tuned models, the performance advantage would be falsified.

Figures

Figures reproduced from arXiv: 2606.17213 by Andrea M. Bejar, Debesh Jha, Gorkem Durak, Halil Ertugrul Aktas, Quoc-Huy Trinh, Ulas Bagci, Vanshali Sharma.

**Figure 2.** Figure 2: Three variations of the proposed projection module: (a) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed RAD3D-Prefix model. The model aligns the image encoder’s output and the classification logits to the text embedding space via a lightweight projection network. textual semantics. While recent work [5] has explored region-guided mechanisms, our work investigates parameter-efficient and diagnostic-prior conditioning for 3D CT report generation. 2.2 Vision Projector in Large Vision La… view at source ↗

**Figure 4.** Figure 4: Radar plots showing the impact of fine-tuning (solid) and freezing (dashed) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative example of the baseline and RAD3D-Prefix. Matching sentence [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: UMAP visualizations across different projection networks: V-2 (top) and V-3 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Classification results on 18 and 21 multi-abnormality labels of the CT-RATE [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Forest plots of mean differences (95% CIs) for RAD3D-Prefix on (a) CT-RATE [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: More qualitative examples of the baseline and our proposed method. Matching [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative sample comparing outcomes of the three variants, [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Samples of GREEN Summary for V-2 and V-3 variants. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Performance analysis with respect to increasing trainable parameters, influ [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAD3D-Prefix adds diagnostic logits to frozen larger LLMs for 3D CT reports and the scaling study shows freezing beats fine-tuning above ~1B params, but the classifier's accuracy is unvalidated.

read the letter

The new piece is RAD3D-Prefix, which concatenates image embeddings with multi-label diagnostic logits before feeding them to the LLM, plus the scaling sweep across models from 96M to 1.6B parameters. The main finding is that fine-tuning helps smaller models while freezing bigger ones and training only the projection layers gives a better efficiency-generalization trade-off.

The paper does a solid job on the practical side. It tests the intuition that full fine-tuning on limited medical data risks overfitting and hallucination, then shows their method uses far fewer trainable parameters than full fine-tuning while beating other parameter-efficient baselines on automatic metrics and a reader study. The out-of-domain generalization result is the part that would interest people actually deploying these systems.

The soft spot is exactly the one the stress-test flags. The logits are injected directly into the prompt for a frozen LLM, yet the abstract gives no per-class precision or recall for the upstream classifier on the test split, and no ablation that holds the projection layers fixed while dropping the logits. If the classifier errs on subtle findings, those errors propagate without correction. Dataset sizes and concrete metric values are also missing from the summary, which makes it hard to judge how large the reported gains actually are.

This is for groups working on efficient VLM adaptation in medical imaging. A reader who cares about parameter counts and scaling behavior in report generation would get something concrete from it. The experiments are defined clearly enough that it deserves a serious referee, even if the classifier validation will need to be added.

Referee Report

1 major / 2 minor

Summary. The manuscript investigates parameter-efficient adaptation of LLMs for 3D CT radiology report generation. It introduces RAD3D-Prefix, a lightweight module that concatenates image embeddings with multi-label diagnostic classification logits to condition a frozen LLM, thereby minimizing trainable parameters and overfitting risk. Through scaling experiments on models ranging from 96.1M to 1.6B parameters, the central claim is that full fine-tuning benefits smaller LLMs while freezing larger (~1B+) LLMs and training only projection layers with the diagnostic prior yields superior performance, generalization, and efficiency trade-offs versus baselines, supported by automatic metrics and a clinical reader study.

Significance. If the empirical claims hold after addressing validation gaps, the work would offer actionable insights into LLM scaling for medical volumetric data, emphasizing diagnostic priors for semantic bridging and parameter efficiency. The systematic size sweep and reader study provide concrete evidence for the efficiency claims.

major comments (1)

[Methods (RAD3D-Prefix) and Results] The claim that RAD3D-Prefix (image embeddings + multi-label diagnostic logits) provides a superior trade-off for ~1B+ LLMs rests on the logits accurately capturing clinical details without propagating errors. No section reports the upstream classifier's per-class precision/recall or error rate on the report-generation test split, and no ablation holds projection layers fixed while removing the logits component. This is load-bearing for the central scaling and generalization claims.

minor comments (2)

[Abstract] Abstract supplies no dataset sizes, specific metric values, exclusion criteria, or error analysis, hindering immediate assessment of empirical robustness.
[Abstract] The invented term RAD3D-Prefix is used without an explicit component breakdown or diagram reference on first use.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The concern regarding validation of the diagnostic logits component is well-taken and directly relevant to the central claims about RAD3D-Prefix. We address it below and commit to revisions that strengthen the evidence.

read point-by-point responses

Referee: [Methods (RAD3D-Prefix) and Results] The claim that RAD3D-Prefix (image embeddings + multi-label diagnostic logits) provides a superior trade-off for ~1B+ LLMs rests on the logits accurately capturing clinical details without propagating errors. No section reports the upstream classifier's per-class precision/recall or error rate on the report-generation test split, and no ablation holds projection layers fixed while removing the logits component. This is load-bearing for the central scaling and generalization claims.

Authors: We agree that explicit reporting of the upstream multi-label classifier's performance on the report-generation test split is necessary to substantiate that the logits provide reliable clinical priors without substantial error propagation. The current manuscript does not include per-class precision, recall, or F1 scores for this classifier on the held-out test data, nor does it contain an ablation that isolates the logits contribution while holding the projection layers fixed. In the revised version we will add: (1) a dedicated subsection reporting the classifier's per-class and macro-averaged metrics on the test split, and (2) an ablation experiment that compares RAD3D-Prefix against an otherwise identical configuration using only image embeddings (projection layers fixed). These additions will allow readers to directly assess the incremental value and potential error contribution of the diagnostic logits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on independent experimental results

full rationale

The paper reports an empirical study comparing LLM fine-tuning strategies and the RAD3D-Prefix module on 3D CT report generation tasks. All load-bearing claims (superior trade-offs for freezing larger LLMs, out-of-domain generalization, parameter efficiency) are justified by automatic metrics, ablation tables, and a clinical reader study rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-definitional loops, fitted-input predictions, or self-citation chains appear in the provided text; the diagnostic-logit component is an external input whose accuracy is assumed but not derived from the paper's own results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into model internals; the ledger therefore records only the high-level assumptions visible in the text.

axioms (1)

domain assumption Parameter-efficient adaptation of frozen LLMs can bridge the semantic gap between volumetric image features and clinical terminology without overfitting on small medical datasets
The study premise that keeping the LLM frozen while adding diagnostic logits suffices for clinical factuality.

invented entities (1)

RAD3D-Prefix no independent evidence
purpose: lightweight diagnostic-prior conditioning framework that integrates image embeddings with multi-label diagnostic classification logits
New module introduced by the authors to minimize trainable parameters.

pith-pipeline@v0.9.1-grok · 5830 in / 1357 out tokens · 50327 ms · 2026-06-27T03:17:29.557895+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 3 canonical work pages

[1]

METEOR: An automatic metric for MT eval- uation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT eval- uation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Sum- marization, pages 65–72, Ann Arbor, Michi...

2005
[2]

Biomedlm: A 2.7 b parameter language model trained on biomedical text.arXiv preprint arXiv:2403.18421, 2024

Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, et al. Biomedlm: A 2.7 b parameter language model trained on biomedical text.arXiv preprint arXiv:2403.18421, 2024

arXiv 2024
[3]

Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

1901
[4]

Large language model with region-guided referring and grounding for ct report generation.arXiv preprint arXiv:2411.15539, 2024

Zhixuan Chen, Yequan Bie, Haibo Jin, and Hao Chen. Large language model with region-guided referring and grounding for ct report generation.arXiv preprint arXiv:2411.15539, 2024

arXiv 2024
[5]

Large language model with region-guided referring and grounding for ct report generation.IEEE Transactions on Medical Imaging, 2025

Zhixuan Chen, Yequan Bie, Haibo Jin, and Hao Chen. Large language model with region-guided referring and grounding for ct report generation.IEEE Transactions on Medical Imaging, 2025

2025
[6]

Instructblip: towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: towards general-purpose vision-language models with instruction tuning. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

2023
[7]

Ct-agrg: Automated abnormality-guided report generation from 3d chest ct volumes.arXiv preprint arXiv:2408.11965, 2024

Theo Di Piazza, Carole Lazarus, Olivier Nempont, and Loic Boussel. Ct-agrg: Automated abnormality-guided report generation from 3d chest ct volumes.arXiv preprint arXiv:2408.11965, 2024

arXiv 2024
[8]

From images to textual prompts: Zero-shot vi- sual question answering with frozen large language models

Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, Dacheng Tao, and Steven Hoi. From images to textual prompts: Zero-shot vi- sual question answering with frozen large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10867– 10877, 2023. 23

2023
[9]

A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities.CoRR, 2024

Ibrahim Ethem Hamamci, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sev- val Nil Esirgun, Irem Dogan, Muhammed Furkan Dasdelen, Bastian Wittmann, Enis Simsar, Mehmet Simsar, et al. A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities.CoRR, 2024

2024
[10]

Ct2rep: Automated ra- diology report generation for 3d medical imaging

Ibrahim Ethem Hamamci, Sezgin Er, and Bjoern Menze. Ct2rep: Automated ra- diology report generation for 3d medical imaging. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 476–486. Springer, 2024

2024
[11]

Generatect: Text-conditional generation of 3d chest ct volumes

Ibrahim Ethem Hamamci, Sezgin Er, Anjany Sekuboyina, Enis Simsar, Alperen Tezcan, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Furkan Almas, Irem Do ˘gan, Muhammed Furkan Dasdelen, et al. Generatect: Text-conditional generation of 3d chest ct volumes. InEuropean Conference on Computer Vision, pages 126–143. Springer, 2024

2024
[12]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022
[13]

In- spect: a multimodal dataset for pulmonary embolism diagnosis and prognosis.arXiv preprint arXiv:2311.10798, 2023

Shih-Cheng Huang, Zepeng Huo, Ethan Steinberg, Chia-Chun Chiang, Matthew P Lungren, Curtis P Langlotz, Serena Yeung, Nigam H Shah, and Jason A Fries. In- spect: a multimodal dataset for pulmonary embolism diagnosis and prognosis.arXiv preprint arXiv:2311.10798, 2023

arXiv 2023
[14]

Unified language-vision pretraining in LLM with dynamic discrete visual tokenization

Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin CHEN, Chengru Song, dai meng, Di ZHANG, Wenwu Ou, Kun Gai, and Yadong MU. Unified language-vision pretraining in LLM with dynamic discrete visual tokenization. InThe Twelfth International Conference on Learning Represen- tations, 2024. URLhttps://openreview.net/forum?id=FlvtjAB0gl

2024
[15]

Vilt: Vision-and-language transformer without convolution or region supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021

2021
[16]

E3d-gpt: Enhanced 3d visual foundation for medical vision-language model.arXiv preprint arXiv:2410.14200, 2024

Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Weifu Lv, and S Kevin Zhou. E3d-gpt: Enhanced 3d visual foundation for medical vision-language model.arXiv preprint arXiv:2410.14200, 2024

arXiv 2024
[17]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

2023
[18]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022. 24

2022
[19]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language mod- els

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language mod- els. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[20]

Dtllm-vlt: Diverse text generation for visual language tracking based on llm

Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, and Kaiqi Huang. Dtllm-vlt: Diverse text generation for visual language tracking based on llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 7283–7292, 2024

2024
[21]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. As- sociation for Computational Linguistics. URLhttps://aclanthology.org/ W04-1013/

2004
[22]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[23]

BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in Bioinformatics, 23(6), 09 2022

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie- Yan Liu. BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in Bioinformatics, 23(6), 09 2022. ISSN 1477-4054. doi: 10.1093/bib/bbac409. URLhttps://doi.org/10.1093/bib/bbac409. bbac409

work page doi:10.1093/bib/bbac409 2022
[24]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog

AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog. Retrieved December, 20:2024, 2024

2024
[25]

Clipcap: Clip prefix for image captioning.arXiv preprint arXiv:2111.09734, 2021

Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning.arXiv preprint arXiv:2111.09734, 2021

arXiv 2021
[26]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023

2023
[27]

Chaudhari, and Jean-Benoit Delbrouck

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Chris- tian Bluethgen, Arne Edward Michalson Md, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, and Jean-Benoit Delbrouck. GREEN: Generative radiology report evaluation and error notation. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors,Findings of the Associati...

work page doi:10.18653/v1/2024.findings-emnlp.21 2024
[28]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Char- niak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Com...

work page doi:10.3115/1073083 2002
[29]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Rad- ford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInter- national conference on machine learning, pages 8821–8831. Pmlr, 2021

2021
[30]

Multitask prompted training enables zero-shot task generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables zero-shot task generalization. InInternational Confer- ence on Learning Representations, 2021

2021
[31]

Med-2e3: A 2d-enhanced 3d medical multimodal large language model.arXiv preprint arXiv:2411.12783, 2024

Yiming Shi, Xun Zhu, Ying Hu, Chenyi Guo, Miao Li, and Ji Wu. Med-2e3: A 2d-enhanced 3d medical multimodal large language model.arXiv preprint arXiv:2411.12783, 2024

arXiv 2024
[32]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023
[33]

R2gengpt: Radiology report generation with frozen llms.Meta-Radiology, 1(3):100033, 2023

Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. R2gengpt: Radiology report generation with frozen llms.Meta-Radiology, 1(3):100033, 2023

2023
[34]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

2022
[35]

Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

Jacob White. Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

2020
[36]

Bertscore: Evaluating text generation with bert

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2019

2019
[37]

Uncov- ering knowledge gaps in radiology report generation models through knowledge graphs.arXiv preprint arXiv:2408.14397, 2024

Xiaoman Zhang, Julián N Acosta, Hong-Yu Zhou, and Pranav Rajpurkar. Uncov- ering knowledge gaps in radiology report generation models through knowledge graphs.arXiv preprint arXiv:2408.14397, 2024

arXiv 2024

[1] [1]

METEOR: An automatic metric for MT eval- uation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT eval- uation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Sum- marization, pages 65–72, Ann Arbor, Michi...

2005

[2] [2]

Biomedlm: A 2.7 b parameter language model trained on biomedical text.arXiv preprint arXiv:2403.18421, 2024

Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, et al. Biomedlm: A 2.7 b parameter language model trained on biomedical text.arXiv preprint arXiv:2403.18421, 2024

arXiv 2024

[3] [3]

Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

1901

[4] [4]

Large language model with region-guided referring and grounding for ct report generation.arXiv preprint arXiv:2411.15539, 2024

Zhixuan Chen, Yequan Bie, Haibo Jin, and Hao Chen. Large language model with region-guided referring and grounding for ct report generation.arXiv preprint arXiv:2411.15539, 2024

arXiv 2024

[5] [5]

Large language model with region-guided referring and grounding for ct report generation.IEEE Transactions on Medical Imaging, 2025

Zhixuan Chen, Yequan Bie, Haibo Jin, and Hao Chen. Large language model with region-guided referring and grounding for ct report generation.IEEE Transactions on Medical Imaging, 2025

2025

[6] [6]

Instructblip: towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: towards general-purpose vision-language models with instruction tuning. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

2023

[7] [7]

Ct-agrg: Automated abnormality-guided report generation from 3d chest ct volumes.arXiv preprint arXiv:2408.11965, 2024

Theo Di Piazza, Carole Lazarus, Olivier Nempont, and Loic Boussel. Ct-agrg: Automated abnormality-guided report generation from 3d chest ct volumes.arXiv preprint arXiv:2408.11965, 2024

arXiv 2024

[8] [8]

From images to textual prompts: Zero-shot vi- sual question answering with frozen large language models

Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, Dacheng Tao, and Steven Hoi. From images to textual prompts: Zero-shot vi- sual question answering with frozen large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10867– 10877, 2023. 23

2023

[9] [9]

A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities.CoRR, 2024

Ibrahim Ethem Hamamci, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sev- val Nil Esirgun, Irem Dogan, Muhammed Furkan Dasdelen, Bastian Wittmann, Enis Simsar, Mehmet Simsar, et al. A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities.CoRR, 2024

2024

[10] [10]

Ct2rep: Automated ra- diology report generation for 3d medical imaging

Ibrahim Ethem Hamamci, Sezgin Er, and Bjoern Menze. Ct2rep: Automated ra- diology report generation for 3d medical imaging. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 476–486. Springer, 2024

2024

[11] [11]

Generatect: Text-conditional generation of 3d chest ct volumes

Ibrahim Ethem Hamamci, Sezgin Er, Anjany Sekuboyina, Enis Simsar, Alperen Tezcan, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Furkan Almas, Irem Do ˘gan, Muhammed Furkan Dasdelen, et al. Generatect: Text-conditional generation of 3d chest ct volumes. InEuropean Conference on Computer Vision, pages 126–143. Springer, 2024

2024

[12] [12]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022

[13] [13]

In- spect: a multimodal dataset for pulmonary embolism diagnosis and prognosis.arXiv preprint arXiv:2311.10798, 2023

Shih-Cheng Huang, Zepeng Huo, Ethan Steinberg, Chia-Chun Chiang, Matthew P Lungren, Curtis P Langlotz, Serena Yeung, Nigam H Shah, and Jason A Fries. In- spect: a multimodal dataset for pulmonary embolism diagnosis and prognosis.arXiv preprint arXiv:2311.10798, 2023

arXiv 2023

[14] [14]

Unified language-vision pretraining in LLM with dynamic discrete visual tokenization

Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin CHEN, Chengru Song, dai meng, Di ZHANG, Wenwu Ou, Kun Gai, and Yadong MU. Unified language-vision pretraining in LLM with dynamic discrete visual tokenization. InThe Twelfth International Conference on Learning Represen- tations, 2024. URLhttps://openreview.net/forum?id=FlvtjAB0gl

2024

[15] [15]

Vilt: Vision-and-language transformer without convolution or region supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021

2021

[16] [16]

E3d-gpt: Enhanced 3d visual foundation for medical vision-language model.arXiv preprint arXiv:2410.14200, 2024

Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Weifu Lv, and S Kevin Zhou. E3d-gpt: Enhanced 3d visual foundation for medical vision-language model.arXiv preprint arXiv:2410.14200, 2024

arXiv 2024

[17] [17]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

2023

[18] [18]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022. 24

2022

[19] [19]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language mod- els

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language mod- els. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023

[20] [20]

Dtllm-vlt: Diverse text generation for visual language tracking based on llm

Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, and Kaiqi Huang. Dtllm-vlt: Diverse text generation for visual language tracking based on llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 7283–7292, 2024

2024

[21] [21]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. As- sociation for Computational Linguistics. URLhttps://aclanthology.org/ W04-1013/

2004

[22] [22]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[23] [23]

BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in Bioinformatics, 23(6), 09 2022

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie- Yan Liu. BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in Bioinformatics, 23(6), 09 2022. ISSN 1477-4054. doi: 10.1093/bib/bbac409. URLhttps://doi.org/10.1093/bib/bbac409. bbac409

work page doi:10.1093/bib/bbac409 2022

[24] [24]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog

AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog. Retrieved December, 20:2024, 2024

2024

[25] [25]

Clipcap: Clip prefix for image captioning.arXiv preprint arXiv:2111.09734, 2021

Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning.arXiv preprint arXiv:2111.09734, 2021

arXiv 2021

[26] [26]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023

2023

[27] [27]

Chaudhari, and Jean-Benoit Delbrouck

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Chris- tian Bluethgen, Arne Edward Michalson Md, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, and Jean-Benoit Delbrouck. GREEN: Generative radiology report evaluation and error notation. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors,Findings of the Associati...

work page doi:10.18653/v1/2024.findings-emnlp.21 2024

[28] [28]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Char- niak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Com...

work page doi:10.3115/1073083 2002

[29] [29]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Rad- ford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInter- national conference on machine learning, pages 8821–8831. Pmlr, 2021

2021

[30] [30]

Multitask prompted training enables zero-shot task generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables zero-shot task generalization. InInternational Confer- ence on Learning Representations, 2021

2021

[31] [31]

Med-2e3: A 2d-enhanced 3d medical multimodal large language model.arXiv preprint arXiv:2411.12783, 2024

Yiming Shi, Xun Zhu, Ying Hu, Chenyi Guo, Miao Li, and Ji Wu. Med-2e3: A 2d-enhanced 3d medical multimodal large language model.arXiv preprint arXiv:2411.12783, 2024

arXiv 2024

[32] [32]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023

[33] [33]

R2gengpt: Radiology report generation with frozen llms.Meta-Radiology, 1(3):100033, 2023

Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. R2gengpt: Radiology report generation with frozen llms.Meta-Radiology, 1(3):100033, 2023

2023

[34] [34]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

2022

[35] [35]

Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

Jacob White. Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

2020

[36] [36]

Bertscore: Evaluating text generation with bert

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2019

2019

[37] [37]

Uncov- ering knowledge gaps in radiology report generation models through knowledge graphs.arXiv preprint arXiv:2408.14397, 2024

Xiaoman Zhang, Julián N Acosta, Hong-Yu Zhou, and Pranav Rajpurkar. Uncov- ering knowledge gaps in radiology report generation models through knowledge graphs.arXiv preprint arXiv:2408.14397, 2024

arXiv 2024