pith. sign in

arxiv: 2606.17213 · v1 · pith:WMGGVN4Cnew · submitted 2026-06-15 · 💻 cs.CL · cs.CV

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

Pith reviewed 2026-06-27 03:17 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords LLM adaptation3D CT report generationparameter-efficient fine-tuningdiagnostic priorsvolumetric imagingfrozen LLMmedical report generationRAD3D-Prefix
0
0 comments X

The pith

Freezing large LLMs and training only lightweight layers with diagnostic priors outperforms full fine-tuning for 3D CT report generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to adapt large language models to generate reports from volumetric CT scans without overfitting on small medical datasets. It introduces RAD3D-Prefix, a module that feeds multi-label diagnostic classification results into the model to connect visual features with clinical terms. The central result is that full fine-tuning helps only smaller models, while for LLMs of roughly one billion parameters and up, keeping the LLM frozen and updating just the projection layers delivers better report quality, out-of-domain generalization, and efficiency. Experiments across model sizes, automatic metrics, and a clinical reader study support the claim that this uses far fewer trainable parameters than alternatives.

Core claim

For LLMs of approximately 1B parameters and larger, freezing the LLM and training only a lightweight projection layer that conditions on multi-label diagnostic classification logits produces higher-quality reports from 3D CT volumes than full fine-tuning, with stronger out-of-domain generalization and substantially lower computational cost.

What carries the argument

RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that combines image embeddings with multi-label diagnostic classification logits while the LLM remains frozen.

If this is right

  • Fine-tuning the full LLM benefits smaller models most.
  • Freezing larger LLMs reduces overfitting on limited medical data.
  • RAD3D-Prefix achieves higher scores on automatic metrics and clinical reader evaluations than comparable baselines.
  • The approach maintains strong performance on data from different distributions.
  • Training requires substantially fewer parameters than full fine-tuning or other efficient methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning strategy could extend to report generation from other 3D modalities such as MRI.
  • Diagnostic priors might reduce hallucinations across additional medical text-generation settings.
  • Lower parameter counts could make such systems practical in hospitals with limited GPU resources.
  • Combining the frozen-LLM approach with even larger base models may further improve results without raising training costs.

Load-bearing premise

The multi-label diagnostic classification logits accurately capture and preserve critical clinical details without introducing classification errors or biases that affect the reports.

What would settle it

If a reader study or error analysis finds that reports produced with the diagnostic logits contain more clinically significant mistakes traceable to upstream classification errors than reports from fully fine-tuned models, the performance advantage would be falsified.

Figures

Figures reproduced from arXiv: 2606.17213 by Andrea M. Bejar, Debesh Jha, Gorkem Durak, Halil Ertugrul Aktas, Quoc-Huy Trinh, Ulas Bagci, Vanshali Sharma.

Figure 1
Figure 1. Figure 1: Three critical challenges in report generation: (a) Semantic Clinical Gap. (b) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three variations of the proposed projection module: (a) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed RAD3D-Prefix model. The model aligns the image encoder’s output and the classification logits to the text embedding space via a lightweight projection network. textual semantics. While recent work [5] has explored region-guided mechanisms, our work investigates parameter-efficient and diagnostic-prior conditioning for 3D CT report generation. 2.2 Vision Projector in Large Vision La… view at source ↗
Figure 4
Figure 4. Figure 4: Radar plots showing the impact of fine-tuning (solid) and freezing (dashed) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative example of the baseline and RAD3D-Prefix. Matching sentence [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: UMAP visualizations across different projection networks: V-2 (top) and V-3 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Classification results on 18 and 21 multi-abnormality labels of the CT-RATE [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Forest plots of mean differences (95% CIs) for RAD3D-Prefix on (a) CT-RATE [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More qualitative examples of the baseline and our proposed method. Matching [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative sample comparing outcomes of the three variants, [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Samples of GREEN Summary for V-2 and V-3 variants. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance analysis with respect to increasing trainable parameters, influ [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript investigates parameter-efficient adaptation of LLMs for 3D CT radiology report generation. It introduces RAD3D-Prefix, a lightweight module that concatenates image embeddings with multi-label diagnostic classification logits to condition a frozen LLM, thereby minimizing trainable parameters and overfitting risk. Through scaling experiments on models ranging from 96.1M to 1.6B parameters, the central claim is that full fine-tuning benefits smaller LLMs while freezing larger (~1B+) LLMs and training only projection layers with the diagnostic prior yields superior performance, generalization, and efficiency trade-offs versus baselines, supported by automatic metrics and a clinical reader study.

Significance. If the empirical claims hold after addressing validation gaps, the work would offer actionable insights into LLM scaling for medical volumetric data, emphasizing diagnostic priors for semantic bridging and parameter efficiency. The systematic size sweep and reader study provide concrete evidence for the efficiency claims.

major comments (1)
  1. [Methods (RAD3D-Prefix) and Results] The claim that RAD3D-Prefix (image embeddings + multi-label diagnostic logits) provides a superior trade-off for ~1B+ LLMs rests on the logits accurately capturing clinical details without propagating errors. No section reports the upstream classifier's per-class precision/recall or error rate on the report-generation test split, and no ablation holds projection layers fixed while removing the logits component. This is load-bearing for the central scaling and generalization claims.
minor comments (2)
  1. [Abstract] Abstract supplies no dataset sizes, specific metric values, exclusion criteria, or error analysis, hindering immediate assessment of empirical robustness.
  2. [Abstract] The invented term RAD3D-Prefix is used without an explicit component breakdown or diagram reference on first use.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The concern regarding validation of the diagnostic logits component is well-taken and directly relevant to the central claims about RAD3D-Prefix. We address it below and commit to revisions that strengthen the evidence.

read point-by-point responses
  1. Referee: [Methods (RAD3D-Prefix) and Results] The claim that RAD3D-Prefix (image embeddings + multi-label diagnostic logits) provides a superior trade-off for ~1B+ LLMs rests on the logits accurately capturing clinical details without propagating errors. No section reports the upstream classifier's per-class precision/recall or error rate on the report-generation test split, and no ablation holds projection layers fixed while removing the logits component. This is load-bearing for the central scaling and generalization claims.

    Authors: We agree that explicit reporting of the upstream multi-label classifier's performance on the report-generation test split is necessary to substantiate that the logits provide reliable clinical priors without substantial error propagation. The current manuscript does not include per-class precision, recall, or F1 scores for this classifier on the held-out test data, nor does it contain an ablation that isolates the logits contribution while holding the projection layers fixed. In the revised version we will add: (1) a dedicated subsection reporting the classifier's per-class and macro-averaged metrics on the test split, and (2) an ablation experiment that compares RAD3D-Prefix against an otherwise identical configuration using only image embeddings (projection layers fixed). These additions will allow readers to directly assess the incremental value and potential error contribution of the diagnostic logits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on independent experimental results

full rationale

The paper reports an empirical study comparing LLM fine-tuning strategies and the RAD3D-Prefix module on 3D CT report generation tasks. All load-bearing claims (superior trade-offs for freezing larger LLMs, out-of-domain generalization, parameter efficiency) are justified by automatic metrics, ablation tables, and a clinical reader study rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-definitional loops, fitted-input predictions, or self-citation chains appear in the provided text; the diagnostic-logit component is an external input whose accuracy is assumed but not derived from the paper's own results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into model internals; the ledger therefore records only the high-level assumptions visible in the text.

axioms (1)
  • domain assumption Parameter-efficient adaptation of frozen LLMs can bridge the semantic gap between volumetric image features and clinical terminology without overfitting on small medical datasets
    The study premise that keeping the LLM frozen while adding diagnostic logits suffices for clinical factuality.
invented entities (1)
  • RAD3D-Prefix no independent evidence
    purpose: lightweight diagnostic-prior conditioning framework that integrates image embeddings with multi-label diagnostic classification logits
    New module introduced by the authors to minimize trainable parameters.

pith-pipeline@v0.9.1-grok · 5830 in / 1357 out tokens · 50327 ms · 2026-06-27T03:17:29.557895+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 3 canonical work pages

  1. [1]

    METEOR: An automatic metric for MT eval- uation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT eval- uation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Sum- marization, pages 65–72, Ann Arbor, Michi...

  2. [2]

    Biomedlm: A 2.7 b parameter language model trained on biomedical text.arXiv preprint arXiv:2403.18421, 2024

    Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, et al. Biomedlm: A 2.7 b parameter language model trained on biomedical text.arXiv preprint arXiv:2403.18421, 2024

  3. [3]

    Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

  4. [4]

    Large language model with region-guided referring and grounding for ct report generation.arXiv preprint arXiv:2411.15539, 2024

    Zhixuan Chen, Yequan Bie, Haibo Jin, and Hao Chen. Large language model with region-guided referring and grounding for ct report generation.arXiv preprint arXiv:2411.15539, 2024

  5. [5]

    Large language model with region-guided referring and grounding for ct report generation.IEEE Transactions on Medical Imaging, 2025

    Zhixuan Chen, Yequan Bie, Haibo Jin, and Hao Chen. Large language model with region-guided referring and grounding for ct report generation.IEEE Transactions on Medical Imaging, 2025

  6. [6]

    Instructblip: towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: towards general-purpose vision-language models with instruction tuning. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  7. [7]

    Ct-agrg: Automated abnormality-guided report generation from 3d chest ct volumes.arXiv preprint arXiv:2408.11965, 2024

    Theo Di Piazza, Carole Lazarus, Olivier Nempont, and Loic Boussel. Ct-agrg: Automated abnormality-guided report generation from 3d chest ct volumes.arXiv preprint arXiv:2408.11965, 2024

  8. [8]

    From images to textual prompts: Zero-shot vi- sual question answering with frozen large language models

    Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, Dacheng Tao, and Steven Hoi. From images to textual prompts: Zero-shot vi- sual question answering with frozen large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10867– 10877, 2023. 23

  9. [9]

    A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities.CoRR, 2024

    Ibrahim Ethem Hamamci, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sev- val Nil Esirgun, Irem Dogan, Muhammed Furkan Dasdelen, Bastian Wittmann, Enis Simsar, Mehmet Simsar, et al. A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities.CoRR, 2024

  10. [10]

    Ct2rep: Automated ra- diology report generation for 3d medical imaging

    Ibrahim Ethem Hamamci, Sezgin Er, and Bjoern Menze. Ct2rep: Automated ra- diology report generation for 3d medical imaging. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 476–486. Springer, 2024

  11. [11]

    Generatect: Text-conditional generation of 3d chest ct volumes

    Ibrahim Ethem Hamamci, Sezgin Er, Anjany Sekuboyina, Enis Simsar, Alperen Tezcan, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Furkan Almas, Irem Do ˘gan, Muhammed Furkan Dasdelen, et al. Generatect: Text-conditional generation of 3d chest ct volumes. InEuropean Conference on Computer Vision, pages 126–143. Springer, 2024

  12. [12]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  13. [13]

    In- spect: a multimodal dataset for pulmonary embolism diagnosis and prognosis.arXiv preprint arXiv:2311.10798, 2023

    Shih-Cheng Huang, Zepeng Huo, Ethan Steinberg, Chia-Chun Chiang, Matthew P Lungren, Curtis P Langlotz, Serena Yeung, Nigam H Shah, and Jason A Fries. In- spect: a multimodal dataset for pulmonary embolism diagnosis and prognosis.arXiv preprint arXiv:2311.10798, 2023

  14. [14]

    Unified language-vision pretraining in LLM with dynamic discrete visual tokenization

    Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin CHEN, Chengru Song, dai meng, Di ZHANG, Wenwu Ou, Kun Gai, and Yadong MU. Unified language-vision pretraining in LLM with dynamic discrete visual tokenization. InThe Twelfth International Conference on Learning Represen- tations, 2024. URLhttps://openreview.net/forum?id=FlvtjAB0gl

  15. [15]

    Vilt: Vision-and-language transformer without convolution or region supervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021

  16. [16]

    E3d-gpt: Enhanced 3d visual foundation for medical vision-language model.arXiv preprint arXiv:2410.14200, 2024

    Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Weifu Lv, and S Kevin Zhou. E3d-gpt: Enhanced 3d visual foundation for medical vision-language model.arXiv preprint arXiv:2410.14200, 2024

  17. [17]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

  18. [18]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022. 24

  19. [19]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language mod- els

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language mod- els. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  20. [20]

    Dtllm-vlt: Diverse text generation for visual language tracking based on llm

    Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, and Kaiqi Huang. Dtllm-vlt: Diverse text generation for visual language tracking based on llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 7283–7292, 2024

  21. [21]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. As- sociation for Computational Linguistics. URLhttps://aclanthology.org/ W04-1013/

  22. [22]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  23. [23]

    BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in Bioinformatics, 23(6), 09 2022

    Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie- Yan Liu. BioGPT: generative pre-trained transformer for biomedical text generation and mining.Briefings in Bioinformatics, 23(6), 09 2022. ISSN 1477-4054. doi: 10.1093/bib/bbac409. URLhttps://doi.org/10.1093/bib/bbac409. bbac409

  24. [24]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog

    AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog. Retrieved December, 20:2024, 2024

  25. [25]

    Clipcap: Clip prefix for image captioning.arXiv preprint arXiv:2111.09734, 2021

    Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning.arXiv preprint arXiv:2111.09734, 2021

  26. [26]

    Med-flamingo: a multimodal medical few-shot learner

    Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023

  27. [27]

    Chaudhari, and Jean-Benoit Delbrouck

    Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Chris- tian Bluethgen, Arne Edward Michalson Md, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, and Jean-Benoit Delbrouck. GREEN: Generative radiology report evaluation and error notation. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors,Findings of the Associati...

  28. [28]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Char- niak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Com...

  29. [29]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Rad- ford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInter- national conference on machine learning, pages 8821–8831. Pmlr, 2021

  30. [30]

    Multitask prompted training enables zero-shot task generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables zero-shot task generalization. InInternational Confer- ence on Learning Representations, 2021

  31. [31]

    Med-2e3: A 2d-enhanced 3d medical multimodal large language model.arXiv preprint arXiv:2411.12783, 2024

    Yiming Shi, Xun Zhu, Ying Hu, Chenyi Guo, Miao Li, and Ji Wu. Med-2e3: A 2d-enhanced 3d medical multimodal large language model.arXiv preprint arXiv:2411.12783, 2024

  32. [32]

    Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

  33. [33]

    R2gengpt: Radiology report generation with frozen llms.Meta-Radiology, 1(3):100033, 2023

    Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. R2gengpt: Radiology report generation with frozen llms.Meta-Radiology, 1(3):100033, 2023

  34. [34]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

  35. [35]

    Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

    Jacob White. Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

  36. [36]

    Bertscore: Evaluating text generation with bert

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2019

  37. [37]

    Uncov- ering knowledge gaps in radiology report generation models through knowledge graphs.arXiv preprint arXiv:2408.14397, 2024

    Xiaoman Zhang, Julián N Acosta, Hong-Yu Zhou, and Pranav Rajpurkar. Uncov- ering knowledge gaps in radiology report generation models through knowledge graphs.arXiv preprint arXiv:2408.14397, 2024