pith. sign in

arxiv: 2408.16213 · v1 · submitted 2024-08-29 · 💻 cs.CV · cs.AI· cs.CL

M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation

Pith reviewed 2026-05-23 21:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multi-modal LLMchest X-raymedical report generationvisual groundingvisual question answeringchain-of-thought promptingmulti-task learning
0
0 comments X

The pith

A single multi-modal LLM trained on conversational visual instructions performs chest X-ray report generation, visual grounding, and VQA while reaching state-of-the-art clinical accuracy in reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that one multi-modal large language model, when trained on a dataset merging multiple chest X-ray tasks into conversational format, can handle medical report generation, visual grounding, and visual question answering without separate specialized models for each. A sympathetic reader would care because this multi-task setup could simplify AI tools in healthcare by cutting the need for multiple systems. If correct, it means one model adapts to different input scenarios like single or multiple images and still produces clinically accurate outputs through chain-of-thought reasoning that identifies findings first.

Core claim

M4CXR is trained on a visual instruction-following dataset that integrates various task-specific datasets in conversational format. This enables the model to support medical report generation by using chain-of-thought prompting to identify findings before generating reports, visual grounding, and visual question answering. The model achieves state-of-the-art clinical accuracy in report generation and performs at levels comparable to specialized models in the other tasks, with adaptability to single-image, multi-image, and multi-study contexts.

What carries the argument

The integrated visual instruction-following dataset in conversational format that trains one multi-modal LLM to handle multiple CXR tasks, combined with chain-of-thought prompting that first identifies findings before report generation.

Load-bearing premise

Combining different task datasets into one conversational training set lets the model learn all tasks without losing accuracy on any individual one.

What would settle it

A test on a held-out clinical benchmark for medical report generation where M4CXR is run without chain-of-thought prompting and its accuracy falls below that of prior specialized models.

Figures

Figures reproduced from arXiv: 2408.16213 by Byungmu Yoon, Jihun Hyun, Jonggwon Park, Kyoyun Choi, Soobum Kim.

Figure 1
Figure 1. Figure 1: Overview of the multi-tasking capabilities of M4CXR. Facilitated by CoT prompting in MRG, M4CXR produces [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) The architecture of M4CXR. Utilizing the LLaVA framework, it allows visual tokens from each image to be [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of multi-turn CoT prompting. M4CXR [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of M4CXR’s performance in (a) visual grounding and (b) VQA. The images are selected from the test splits [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of medical report generation across various scenarios. For the same study, the top left shows the result for [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of visual grounding. The ground-truth bounding box is represented by a yellow solid box, while the predic [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of medical report explanations using easy language. The left shows the results from M4CXR, while the [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of medical report summarization. The left shows the results from M4CXR, while the right shows the results [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of medical treatment recommendation. The left shows the results from M4CXR, while the right shows the [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the multi-tasking capabilities of LLMs or lacking clinical accuracy. This paper presents M4CXR, a multi-modal LLM designed to enhance CXR interpretation. The model is trained on a visual instruction-following dataset that integrates various task-specific datasets in a conversational format. As a result, the model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA). M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought prompting strategy, in which it identifies findings in CXR images and subsequently generates corresponding reports. The model is adaptable to various MRG scenarios depending on the available inputs, such as single-image, multi-image, and multi-study contexts. In addition to MRG, M4CXR performs visual grounding at a level comparable to specialized models and also demonstrates outstanding performance in VQA. Both quantitative and qualitative assessments reveal M4CXR's versatility in MRG, visual grounding, and VQA, while consistently maintaining clinical accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents M4CXR, a multi-modal LLM for chest X-ray interpretation trained on an integrated visual instruction-following dataset in conversational format. It supports multiple tasks including medical report generation (MRG), visual grounding, and VQA. The central claim is that M4CXR achieves state-of-the-art clinical accuracy in MRG via a chain-of-thought prompting strategy that first identifies findings in the CXR image and then generates the corresponding report; the model is also adaptable to single-image, multi-image, and multi-study MRG scenarios and performs comparably to specialized models in visual grounding while showing strong VQA results.

Significance. If the performance claims are substantiated with rigorous, isolated evaluations, the work would be significant for demonstrating that a single multi-modal LLM can handle diverse CXR tasks in a conversational format while preserving clinical accuracy, potentially reducing reliance on task-specific models. The conversational training approach and CoT strategy for MRG represent potentially reusable design choices if shown to generalize.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'state-of-the-art clinical accuracy in MRG' is unsupported by any numerical metrics, baseline comparisons, dataset specifications, or evaluation protocols, preventing verification of the claim.
  2. [Abstract] Abstract: the attribution of SOTA clinical accuracy specifically to the chain-of-thought prompting strategy (identifying findings then generating reports) lacks isolating evidence such as an ablation comparing CoT vs. standard prompting on the identical model and test set.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly naming the clinical accuracy metric (e.g., CheXbert F1 or RadGraph) and the primary baselines used to declare SOTA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the claims require supporting details to be verifiable and will revise the abstract in the next version to address both points.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'state-of-the-art clinical accuracy in MRG' is unsupported by any numerical metrics, baseline comparisons, dataset specifications, or evaluation protocols, preventing verification of the claim.

    Authors: We agree that the abstract should be self-contained. The main text (Sections 4.1–4.3 and associated tables) reports the specific clinical accuracy metrics (via CheXbert/RadGraph factuality), baseline comparisons on MIMIC-CXR and other datasets, and the evaluation protocols used. We will revise the abstract to include concise numerical highlights and a brief mention of the evaluation setup. revision: yes

  2. Referee: [Abstract] Abstract: the attribution of SOTA clinical accuracy specifically to the chain-of-thought prompting strategy (identifying findings then generating reports) lacks isolating evidence such as an ablation comparing CoT vs. standard prompting on the identical model and test set.

    Authors: The paper presents the CoT strategy as the prompting method employed for the reported MRG results (Section 3.2). We do not provide an explicit ablation isolating CoT versus standard prompting on the same model and test set. We will revise the abstract to describe the prompting approach factually without claiming isolated causation for the SOTA result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims benchmarked externally

full rationale

The paper presents an empirical multi-modal LLM trained on an integrated visual instruction-following dataset in conversational format, then evaluated on standard tasks including MRG, visual grounding, and VQA. The SOTA claim for clinical accuracy in MRG is attributed to a chain-of-thought prompting strategy at inference time and is positioned as a performance result against prior external models. No equations, parameter fits, or derivations are described that reduce by construction to the inputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation relies on external benchmarks, satisfying the condition for a self-contained result without circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim depends on the effectiveness of the chain-of-thought strategy and the multi-task dataset integration for achieving SOTA results.

free parameters (1)
  • model training parameters
    LLM training involves many hyperparameters fitted during optimization.
axioms (1)
  • domain assumption Multi-task training on conversational data improves performance across tasks including clinical accuracy in report generation
    Central to the paper's approach as described in the abstract.

pith-pipeline@v0.9.0 · 5786 in / 1115 out tokens · 38164 ms · 2026-05-23T21:59:58.690284+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RA-RRG: Multimodal Retrieval-Augmented Radiology Report Generation with Key Phrase Extraction

    cs.CV 2025-04 unverdicted novelty 6.0

    RA-RRG extracts key phrases with LLMs, retrieves them via multimodal similarity, and conditions report generation on them to achieve SOTA CheXbert scores and competitive RadGraph F1 on MIMIC-CXR and IU X-ray while sup...

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716--23736

  4. [4]

    Bae, S.; Kyung, D.; Ryu, J.; Cho, E.; Lee, G.; Kweon, S.; Oh, J.; Ji, L.; Chang, E.; Kim, T.; et al. 2024. EHRXQA: A multi-modal question answering dataset for electronic health records with chest x-ray images. Advances in Neural Information Processing Systems, 36

  5. [5]

    Bannur, S.; Bouzid, K.; Castro, D. C.; Schwaighofer, A.; Bond-Taylor, S.; Ilse, M.; Pérez-García, F.; Salvatelli, V.; Sharma, H.; Meissen, F.; Ranjit, M.; Srivastav, S.; Gong, J.; Falck, F.; Oktay, O.; Thieme, A.; Lungren, M. P.; Wetscherek, M. T.; Alvarez-Valle, J.; and Hyland, S. L. 2024. MAIRA-2: Grounded Radiology Report Generation. arXiv:2406.04449

  6. [6]

    C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al

    Boecking, B.; Usuyama, N.; Bannur, S.; Castro, D. C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al. 2022 a . Making the most of text semantics to improve biomedical vision--language processing. In European conference on computer vision, 1--21. Springer

  7. [7]

    T.; Naumann, T.; Nori, A.; Alvarez Valle, J.; Poon, H.; and Oktay, O

    Boecking, B.; Usuyama, N.; Bannur, S.; Coelho de Castro, D.; Schwaighofer, A.; Hyland, S.; Wetscherek, M. T.; Naumann, T.; Nori, A.; Alvarez Valle, J.; Poon, H.; and Oktay, O. 2022 b . MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing (version 0.1). https://doi.org/10.13026/b90j-vb87

  8. [8]

    Cha, J.; Kang, W.; Mun, J.; and Roh, B. 2024. Honeybee: Locality-enhanced projector for multimodal llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13817--13827

  9. [9]

    Chaves, J. M. Z.; Huang, S.-C.; Xu, Y.; Xu, H.; Usuyama, N.; Zhang, S.; Wang, F.; Xie, Y.; Khademi, M.; Yang, Z.; Awadalla, H.; Gong, J.; Hu, H.; Yang, J.; Li, C.; Gao, J.; Gu, Y.; Wong, C.; Wei, M.; Naumann, T.; Chen, M.; Lungren, M. P.; Chaudhari, A.; Yeung-Levy, S.; Langlotz, C. P.; Wang, S.; and Poon, H. 2024. Towards a clinically accessible radiology...

  10. [10]

    Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; and Elhoseiny, M. 2023 a . MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478

  11. [11]

    Chen, Z.; Song, Y.; Chang, T.-H.; and Wan, X. 2022. Generating Radiology Reports via Memory-driven Transformer. arXiv:2010.16056

  12. [12]

    A vision- language foundation model to enhance efficiency of chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

    Chen, Z.; Varma, M.; Delbrouck, J.-B.; Paschali, M.; Blankemeier, L.; Veen, D. V.; Valanarasu, J. M. J.; Youssef, A.; Cohen, J. P.; Reis, E. P.; Tsai, E. B.; Johnston, A.; Olsen, C.; Abraham, T. M.; Gatidis, S.; Chaudhari, A. S.; and Langlotz, C. 2024. CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation. arXiv:2401.12208

  13. [13]

    Chen, Z.; Zhou, Y.; Tran, A.; Zhao, J.; Wan, L.; Ooi, G. S. K.; Cheng, L. T.-E.; Thng, C. H.; Xu, X.; Liu, Y.; et al. 2023 b . Medical phrase grounding with region-phrase context contrastive alignment. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 371--381. Springer

  14. [14]

    Chowdhury, M. E. H.; Rahman, T.; Khandakar, A.; Mazhar, R.; Kadir, M. A.; Mahbub, Z. B.; Islam, K. R.; Khan, M. S.; Iqbal, A.; Emadi, N. A.; Reaz, M. B. I.; and Islam, M. T. 2020. Can AI Help in Screening Viral and COVID-19 Pneumonia? IEEE Access, 8: 132665--132676

  15. [15]

    Dao, T. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691

  16. [16]

    Degerli, A.; Ahishali, M.; Yamac, M.; Kiranyaz, S.; Chowdhury, M. E. H.; Hameed, K.; Hamid, T.; Mazhar, R.; and Gabbouj, M. 2021. COVID-19 infection map generation and detection from chest X-ray images. Health Information Science and Systems, 9(1): 15

  17. [17]

    Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; and Li, H. 2022. TransVG: End-to-End Visual Grounding with Transformers. arXiv:2104.08541

  18. [18]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685

  19. [19]

    M.; and Zhu, Y

    Hu, X.; Gu, L.; An, Q.; Zhang, M.; Liu, L.; Kobayashi, K.; Harada, T.; Summers, R. M.; and Zhu, Y. 2023. Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23. ACM

  20. [20]

    L.; Bannur, S.; Bouzid, K.; Castro, D

    Hyland, S. L.; Bannur, S.; Bouzid, K.; Castro, D. C.; Ranjit, M.; Schwaighofer, A.; Pérez-García, F.; Salvatelli, V.; Srivastav, S.; Thieme, A.; Codella, N.; Lungren, M. P.; Wetscherek, M. T.; Oktay, O.; and Alvarez-Valle, J. 2024. MAIRA-1: A specialised large multimodal model for radiology report generation. arXiv:2311.13668

  21. [21]

    CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

    Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; Seekins, J.; Mong, D. A.; Halabi, S. S.; Sandberg, J. K.; Jones, R.; Larson, D. B.; Langlotz, C. P.; Patel, B. N.; Lungren, M. P.; and Ng, A. Y. 2019. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comp...

  22. [22]

    Jain, S.; Agrawal, A.; Saporta, A.; Truong, S.; Duong, D. N. D. N.; Bui, T.; Chambon, P.; Zhang, Y.; Lungren, M.; Ng, A.; Langlotz, C.; Rajpurkar, P.; and Rajpurkar, P. 2021. RadGraph: Extracting Clinical Entities and Relations from Radiology Reports. In Vanschoren, J.; and Yeung, S., eds., Proceedings of the Neural Information Processing Systems Track on...

  23. [23]

    Mistral 7B

    Jiang, A. Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D. S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; Lavaud, L. R.; Lachaux, M.-A.; Stock, P.; Scao, T. L.; Lavril, T.; Wang, T.; Lacroix, T.; and Sayed, W. E. 2023. Mistral 7B. arXiv:2310.06825

  24. [24]

    Jin, H.; Che, H.; Lin, Y.; and Chen, H. 2024. PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3): 2607--2615

  25. [25]

    Jing, B.; Xie, P.; and Xing, E. 2018. On the Automatic Generation of Medical Imaging Reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics

  26. [26]

    Johnson, A. E. W.; Pollard, T. J.; Berkowitz, S. J.; Greenbaum, N. R.; Lungren, M. P.; Deng, C.; Mark, R. G.; and Horng, S. 2019. MIMIC-CXR: A large publicly available database of labeled chest radiographs. CoRR, abs/1901.07042

  27. [27]

    J.; Chang, J.; and Ye, J

    Lee, S.; Kim, W. J.; Chang, J.; and Ye, J. C. 2024. LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation. arXiv:2305.11490

  28. [28]

    Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023 a . Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, 19730--19742. PMLR

  29. [29]

    Li, M.; Lin, B.; Chen, Z.; Lin, H.; Liang, X.; and Chang, X. 2023 b . Dynamic graph enhanced contrastive learning for chest x-ray report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3334--3343

  30. [30]

    Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74--81

  31. [31]

    Liu, B.; Zhan, L.; Xu, L.; Ma, L.; Yang, Y.; and Wu, X. 2021. SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering. CoRR, abs/2102.09542

  32. [32]

    Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2024. Visual instruction tuning. Advances in neural information processing systems, 36

  33. [33]

    Liu, J.; Lian, J.; and Yu, Y. 2020. ChestX-Det10: Chest X-ray Dataset on Detection of Thoracic Abnormalities. arXiv:2006.10550

  34. [34]

    Q.; Lam, K.; Le, L

    Nguyen, H. Q.; Lam, K.; Le, L. T.; Pham, H. H.; Tran, D. Q.; Nguyen, D. B.; Le, D. D.; Pham, C. M.; Tong, H. T. T.; Dinh, D. H.; Do, C. D.; Doan, L. T.; Nguyen, C. N.; Nguyen, B. T.; Nguyen, Q. V.; Hoang, A. D.; Phan, H. N.; Nguyen, A. T.; Ho, P. H.; Ngo, D. T.; Nguyen, N. T.; Nguyen, N. T.; Dao, M.; and Vu, V. 2022. VinDr-CXR: An open dataset of chest X-...

  35. [35]

    Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311--318

  36. [36]

    Pellegrini, C.; Özsoy, E.; Busam, B.; Navab, N.; and Keicher, M. 2023. RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance. arXiv:2311.18681

  37. [37]

    Pellegrini, C.; Özsoy, E.; Busam, B.; Navab, N.; and Keicher, M. 2024. RaDialog Instruct Dataset (version 1.1.0). PhysioNet

  38. [38]

    C.; Schwaighofer, A.; Lungren, M

    Pérez-García, F.; Sharma, H.; Bond-Taylor, S.; Bouzid, K.; Salvatelli, V.; Ilse, M.; Bannur, S.; Castro, D. C.; Schwaighofer, A.; Lungren, M. P.; Wetscherek, M.; Codella, N.; Hyland, S. L.; Alvarez-Valle, J.; and Oktay, O. 2024. RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text Supervision. arXiv:2401.10815

  39. [39]

    Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140): 1--67

  40. [40]

    Rahman, T.; Khandakar, A.; Qiblawey, Y.; Tahir, A.; Kiranyaz, S.; Abul Kashem, S.; Islam, M.; Al Maadeed, S.; Zughaier, S.; Khan, M.; and Chowdhury, M. 2021. Exploring the Effect of Image Enhancement Techniques on COVID-19 Detection using Chest X-rays Images. Computers in Biology and Medicine, 104319

  41. [41]

    P.; de Paiva, J

    Reis, E. P.; de Paiva, J. P.; da Silva, M. C.; Ribeiro, G. A.; Paiva, V. F.; Bulgarelli, L.; Lee, H. M.; Santos, P. V.; Brito, V. M.; Amaral, L. T.; Beraldo, G. L.; Haidar Filho, J. N.; Teles, G. B.; Szarf, G.; Pollard, T.; Johnson, A. E.; Celi, L. A.; and Amaro, E. J. 2022. BRAX, Brazilian labeled chest x-ray dataset. Scientific Data, 9(1): 487

  42. [42]

    Shih, G.; Wu, C.; Halabi, S.; Kohli, M.; Prevedello, L.; Cook, T.; Sharma, A.; Amorosa, J.; Arteaga, V.; Galperin-Aizenberg, M.; Gill, R.; Godoy, M.; Hobbs, S.; Jeudy, J.; Laroia, A.; Shah, P.; Vummidi, D.; Yaddanapudi, K.; and Stein, A. 2019. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumon...

  43. [43]

    Shin, H.-C.; Roberts, K.; Lu, L.; Demner-Fushman, D.; Yao, J.; and Summers, R. M. 2016. Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image Annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  44. [44]

    Shiraishi, J.; Katsuragawa, S.; Ikezoe, J.; Matsumoto, T.; Kobayashi, T.; Komatsu, K.; Matsui, M.; Fujita, H.; Kodera, Y.; and Doi, K. 2000. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nodules. AJR Am J Roentgenol, 174(1): 71--74

  45. [45]

    Y.; and Lungren, M

    Smit, A.; Jain, S.; Rajpurkar, P.; Pareek, A.; Ng, A. Y.; and Lungren, M. P. 2020. CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT. arXiv:2004.09167

  46. [46]

    Tahir, A.; Chowdhury, M.; Qiblawey, Y.; Khandakar, A.; Rahman, T.; Kiranyaz, S.; Khurshid, U.; Ibtehaz, N.; Mahmud, S.; and Ezeddin, M. 2021. COVID-QU-Ex Dataset. https://www.kaggle.com/datasets/tahirahmed/covidquex

  47. [47]

    Tanida, T.; M \"u ller, P.; Kaissis, G.; and Rueckert, D. 2023. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7433--7442

  48. [48]

    Xraygpt: Chest radiographs summarization using medical vision-language models

    Thawkar, O.; Shaker, A.; Mullappilly, S. S.; Cholakkal, H.; Anwer, R. M.; Khan, S.; Laaksonen, J.; and Khan, F. S. 2023. XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models. arXiv:2306.07971

  49. [49]

    Tu, T.; Azizi, S.; Driess, D.; Schaekermann, M.; Amin, M.; Chang, P.-C.; Carroll, A.; Lau, C.; Tanno, R.; Ktena, I.; et al. 2024. Towards generalist biomedical AI. NEJM AI, 1(3): AIoa2300138

  50. [50]

    Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; and Summers, R. M. 2017. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3462--3471

  51. [51]

    Wang, Z.; Liu, L.; Wang, L.; and Zhou, L. 2023. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11558--11567

  52. [52]

    V.; Zhou, D.; et al

    Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824--24837

  53. [53]

    Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; and Xie, W. 2023. Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data. arXiv:2308.02463

  54. [54]

    T.; Agu, N

    Wu, J. T.; Agu, N. N.; Lourentzou, I.; Sharma, A.; Paguio, J. A.; Yao, J. S.; Dee, E. C.; Mitchell, W.; Kashyap, S.; Giovannini, A.; Celi, L. A.; and Moradi, M. 2021. Chest ImaGenome Dataset for Clinical Reasoning. arXiv:2108.00316

  55. [55]

    Yang, L.; Xu, S.; Sellergren, A.; Kohlberger, T.; Zhou, Y.; Ktena, I.; Kiraly, A.; Ahmed, F.; Hormozdiari, F.; Jaroensri, T.; et al. 2024. Advancing multimodal medical capabilities of Gemini. arXiv:2405.03162

  56. [56]

    You, H.; Zhang, H.; Gan, Z.; Du, X.; Zhang, B.; Wang, Z.; Cao, L.; Chang, S.-F.; and Yang, Y. 2024. Ferret: Refer and Ground Anything Anywhere at Any Granularity. In The Twelfth International Conference on Learning Representations

  57. [57]

    K.; Baek, W.; and Roh, B

    You, K.; Gu, J.; Ham, J.; Park, B.; Kim, J.; Hong, E. K.; Baek, W.; and Roh, B. 2023. CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. In Medical Image Computing and Computer Assisted Intervention -- MICCAI 2023, 101--111. Springer Nature Switzerland

  58. [58]

    P.; Fonseca, E

    Yu, F.; Endo, M.; Krishnan, R.; Pan, I.; Tsai, A.; Reis, E. P.; Fonseca, E. K. U. N.; Lee, H. M. H.; Abad, Z. S. H.; Ng, A. Y.; et al. 2023. Evaluating progress in automatic chest x-ray radiology report generation. Patterns, 4(9)

  59. [59]

    Zawacki, A.; Wu, C.; Shih, G.; Elliott, J.; Fomitchev, M.; Hussain, M.; ParasLakhani; Culliton, P.; and Bao, S. 2019. SIIM-ACR Pneumothorax Segmentation

  60. [60]

    Zhang, Y.; Wang, X.; Xu, Z.; Yu, Q.; Yuille, A.; and Xu, D. 2020. When Radiology Report Generation Meets Knowledge Graph. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07): 12910--12917

  61. [61]

    Zheng, Y.; Gan, W.; Chen, Z.; Qi, Z.; Liang, Q.; and Yu, P. S. 2024. Large Language Models for Medicine: A Survey. arXiv:2405.13055

  62. [62]

    Zhou, Z.; Shi, M.; Wei, M.; Alabi, O.; Yue, Z.; and Vercauteren, T. 2024. Large Model driven Radiology Report Generation with Clinical Quality Reinforcement Learning. arXiv:2403.06728

  63. [63]

    Zhu, D.; Chen, J.; Shen, X.; Li, X.; and Elhoseiny, M. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592