pith. sign in

arxiv: 2606.06407 · v1 · pith:K7DDQLJGnew · submitted 2026-06-04 · 💻 cs.CV · cs.IR· cs.LG· eess.IV

A Vision-language Framework for Comparative Reasoning in Radiology

Pith reviewed 2026-06-28 01:39 UTC · model grok-4.3

classification 💻 cs.CV cs.IRcs.LGeess.IV
keywords comparative reasoningradiologyentity-aware retrievalvision-language modelreport decompositionmedical imaging AIlongitudinal interpretation
0
0 comments X

The pith

Radiology comparison can be learned as entity-aware cross-image reasoning from routine clinical reports at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats radiological diagnosis and follow-up as the task of comparing a current study against prior exams and analogous reference cases. It decomposes existing image-report pairs into anatomical structures, abnormal findings, and pathological conditions to create supervision signals without new manual labels. These signals train an entity-conditioned retriever and a vision-language model for generating descriptions of interval change. The resulting models improve retrieval and generation metrics on internal, external, and cross-center tests, including clinically confusable cases. The work therefore claims that scalable comparative reasoning aligned with real radiological practice can be extracted directly from existing clinical archives.

Core claim

By decomposing radiology reports into anatomical structures, abnormal findings, and pathological conditions, entity-aware models can be trained on more than 690,000 images to perform controllable retrieval of clinically analogous cases and to generate accurate interpretations of temporal change, with consistent gains over baselines in both retrieval recall and longitudinal accuracy across modalities and institutions.

What carries the argument

Entity-conditioned retrieval and generation, where an encoder conditions on report-derived entities to select reference cases and to produce comparative visual question answers.

If this is right

  • MedReCo achieves the highest Recall@1 across all 12 internal retrieval settings and raises external retrieval by an average of 6.0 percentage points.
  • MedReCo-VLM raises longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT.
  • Performance remains superior in clinically confusable differential diagnosis groups.
  • The same entity-decomposition pipeline works across eight institutions, four countries, and seven imaging modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition-plus-conditioning pattern could be tested on non-radiology image-report corpora such as pathology slides or ophthalmology photographs.
  • If the entity labels prove noisy, hybrid human-AI verification loops on a small fraction of reports might restore accuracy while retaining most of the scale advantage.
  • The framework supplies a concrete route to measure how much additional clinical alignment is gained by explicit comparison modeling versus single-image interpretation.

Load-bearing premise

Automatic decomposition of reports into anatomical structures, abnormal findings, and pathological conditions supplies sufficiently accurate and unbiased entity labels for model supervision.

What would settle it

If a manually verified subset of decomposed reports shows low entity-label accuracy and retraining on the corrected labels eliminates the reported performance gains, the central claim would be falsified.

read the original abstract

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper formulates radiological comparison as an entity-aware cross-image reasoning task and introduces MedReCo-DB (>690k images from >160k patients across 8 institutions) derived from routine image-report pairs. Reports are automatically decomposed into anatomical structures, abnormal findings, and pathological conditions to supervise entity-conditioned retrieval (MedReCo) and comparative VQA/generation (MedReCo-VLM). The work claims state-of-the-art Recall@1 across all 12 internal settings, +6.0 pp mean external retrieval improvement, and large gains (13.0-46.5 pp) in longitudinal follow-up accuracy on chest radiographs and CT.

Significance. If the entity labels prove reliable, the scale and multi-center construction of MedReCo-DB together with the entity-aware models would represent a substantive step toward clinically aligned comparative reasoning in medical imaging AI, moving beyond isolated interpretation. The multi-institutional, multi-modality scope and held-out evaluations are positive features.

major comments (2)
  1. [Abstract] Abstract: the central supervision signal is created by automatic decomposition of reports into anatomical structures, abnormal findings, and pathological conditions, yet the manuscript provides no accuracy metrics, inter-annotator agreement, human validation study, or error analysis for this decomposition step. Because every reported gain (Recall@1, longitudinal accuracy) is conditioned on these labels, the absence of validation directly undermines the claim that the models have learned genuine entity-aware comparative reasoning rather than artifacts of the extraction process.
  2. [Abstract] Abstract and methods description: the abstract asserts consistent outperformance on internal, external, and cross-center tests with specific percentage gains, but supplies no ablation studies, statistical tests, error bars, or dataset-construction validation (e.g., confirmation that the held-out splits preserve entity distributions). These omissions make it impossible to determine whether the reported improvements are robust or sensitive to the particular decomposition pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for validation of the report decomposition and additional robustness checks. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central supervision signal is created by automatic decomposition of reports into anatomical structures, abnormal findings, and pathological conditions, yet the manuscript provides no accuracy metrics, inter-annotator agreement, human validation study, or error analysis for this decomposition step. Because every reported gain (Recall@1, longitudinal accuracy) is conditioned on these labels, the absence of validation directly undermines the claim that the models have learned genuine entity-aware comparative reasoning rather than artifacts of the extraction process.

    Authors: We acknowledge that the manuscript does not report quantitative validation metrics, inter-annotator agreement, or error analysis for the automatic decomposition step. While the pipeline builds on established medical NLP methods and the overall scale provides indirect support, we agree this leaves the entity-aware claims open to the concern raised. In revision we will add a new subsection with a human validation study on a stratified sample of 1,000 reports (two radiologists per report), reporting per-entity accuracy, Cohen's kappa, and a categorized error analysis. This will allow readers to assess whether gains reflect genuine reasoning. revision: yes

  2. Referee: [Abstract] Abstract and methods description: the abstract asserts consistent outperformance on internal, external, and cross-center tests with specific percentage gains, but supplies no ablation studies, statistical tests, error bars, or dataset-construction validation (e.g., confirmation that the held-out splits preserve entity distributions). These omissions make it impossible to determine whether the reported improvements are robust or sensitive to the particular decomposition pipeline.

    Authors: The manuscript contains component ablations and multi-center held-out results, yet we agree that statistical significance testing, error bars across runs, and explicit verification that entity distributions are preserved in the splits are not reported. In the revision we will add: (i) results from three random seeds with standard-error bars, (ii) paired statistical tests (McNemar or t-tests) for all key comparisons, and (iii) a table comparing entity-type frequencies and KL divergence between training and test partitions. These additions will directly demonstrate robustness independent of any single decomposition run. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and held-out evaluation

full rationale

The paper constructs MedReCo-DB from routine image-report pairs across multiple institutions, decomposes reports to create entity labels for supervision, trains MedReCo and MedReCo-VLM, and reports performance on internal/external/cross-center held-out evaluations. No step reduces a claimed prediction or result to a fitted parameter, self-citation chain, or input by construction; the central claims rest on new data collection and measurable improvements on unseen cases rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides limited technical detail; primary domain assumption is the reliability of report decomposition for supervision. No explicit free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Radiology reports can be decomposed into anatomical structures, abnormal findings and pathological conditions to yield reliable supervision signals for entity-conditioned tasks.
    Invoked to create training labels for retrieval and comparative VQA; if noisy or biased, downstream performance claims are undermined.

pith-pipeline@v0.9.1-grok · 5838 in / 1386 out tokens · 53839 ms · 2026-06-28T01:39:19.001901+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 14 canonical work pages · 8 internal anchors

  1. [1]

    Content-based image retrieval in radiology: current status and future directions.Journal of digital imaging, 24(2):208–222, 2011

    Ceyhun Burak Akgül, Daniel L Rubin, Sandy Napel, Christopher F Beaulieu, Hayit Greenspan, and Burak Acar. Content-based image retrieval in radiology: current status and future directions.Journal of digital imaging, 24(2):208–222, 2011

  2. [2]

    Introducing claude 4, 2025

    Anthropic. Introducing claude 4, 2025

  3. [3]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

  4. [4]

    Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al

    Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al. Maira-2: Grounded radiology report generation.arXiv preprint arXiv:2406.04449, 2024

  5. [5]

    Chexpert plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats

    Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P Langlotz. Chexpert plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats. arXiv preprint arXiv:2405.19538, 2024

  6. [6]

    Bimcv-r: A landmark dataset for 3d ct text-image retrieval

    Yinda Chen, Che Liu, Xiaoyu Liu, Rossella Arcucci, and Zhiwei Xiong. Bimcv-r: A landmark dataset for 3d ct text-image retrieval. InInternationalconferenceonmedicalimagecomputingandcomputer-assisted intervention, pages 124–134. Springer, 2024

  7. [7]

    Generating radiology reports via memory- driven transformer

    Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory- driven transformer. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Nov. 2020

  8. [8]

    Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

    Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

  9. [9]

    Content-based image retrieval by using deep learning for interstitial lung disease diagnosis with chest ct.Radiology, 302(1):187–197, 2022

    Jooae Choe, Hye Jeon Hwang, Joon Beom Seo, Sang Min Lee, Jihye Yun, Min-Ju Kim, Jewon Jeong, Youngsoo Lee, Kiok Jin, Rohee Park, Jihoon Kim, Howook Jeon, Namkug Kim, Jaeyoun Yi, Donghoon Yu, and Byeongsoo Kim. Content-based image retrieval by using deep learning for interstitial lung disease diagnosis with chest ct.Radiology, 302(1):187–197, 2022

  10. [10]

    Controllable chest x-ray report generation from longitudinal representations

    Francesco Dalla Serra, Chaoyang Wang, Fani Deligianni, Jeff Dalton, and Alison O’Neil. Controllable chest x-ray report generation from longitudinal representations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4891–4904, 2023

  11. [11]

    Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports

    Jean-Benoit Delbrouck, Pierre Chambon, Zhihong Chen, Maya Varma, Andrew Johnston, Louis Blanke- meier, Dave Van Veen, Tan Bui, Steven Truong, and Curtis Langlotz. Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports. InFindings of the Association for Computational Linguistics, pages 12902–12915, 2024

  12. [12]

    Eisenhauer, Patrick Therasse, Jan Bogaerts, Lawrence H

    Elizabeth A. Eisenhauer, Patrick Therasse, Jan Bogaerts, Lawrence H. Schwartz, Daniel Sargent, Robert Ford, Janet Dancey, Susan Arbuck, Steve Gwyther, Margaret Mooney, Larry Rubinstein, Lalitha Shankar, |19 Lori Dodd, Robert Kaplan, Denis Lacombe, and Jaap Verweij. New response evaluation criteria in solid tumours: Revised recist guideline (version 1.1).E...

  13. [13]

    3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.Advances in Neural Information Processing Systems, 38, 2026

    Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.Advances in Neural Information Processing Systems, 38, 2026

  14. [14]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

    Ibrahim Ethem Hamamci, Sezgin Er, Chenyu Wang, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Omer Faruk Durugol, Benjamin Hou, Suprosanna Shit, et al. Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

  16. [16]

    Deep metric learning using triplet network

    Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. InSimilarity-based pattern recognition: third international workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3, pages 84–92, 2015

  17. [17]

    Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images.PhysioNet, Feb

    Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, liangchen liu, Kazuma Kobayashi, Tatsuya Harada, Ronald Summers, and Yingying Zhu. Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images.PhysioNet, Feb. 2025. Version 1.0.1

  18. [18]

    Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering

    Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M Summers, and Yingying Zhu. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4156–4165, 2023

  19. [19]

    Lungren, and Serena Yeung

    Shih-Cheng Huang, Liyue Shen, Matthew P. Lungren, and Serena Yeung. Gloria: A multimodal global- local representation learning framework for label-efficient medical image recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951, 2021

  20. [20]

    Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

    Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019

  21. [21]

    nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211, 2021

    Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211, 2021

  22. [22]

    Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.Advancesin Neural Information Processing Systems, 35:36722–36732, 2022

    Yuanfeng Ji, Haotian Bai, Chongjian Ge, Jie Yang, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhanng, Wanling Ma, Xiang Wan, et al. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.Advancesin Neural Information Processing Systems, 35:36722–36732, 2022

  23. [23]

    Hulu-med: A transparent generalist model towards holistic medical vision-language understanding

    Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668, 2025

  24. [24]

    On the automatic generation of medical imaging reports

    Baoyu Jing, Pengtao Xie, and Eric Xing. On the automatic generation of medical imaging reports. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pages 2577–2586, 2018

  25. [25]

    Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

  26. [26]

    Unibrain: Universal brain mri diagnosis with hierarchical knowledge-enhanced pre-training.Computerized Medical Imaging and Graphics, 122:102516, 2025

    Jiayu Lei, Lisong Dai, Haoyun Jiang, Chaoyi Wu, Xiaoman Zhang, Yao Zhang, Jiangchao Yao, Weidi Xie, Yanyong Zhang, Yuehua Li, Ya Zhang, and Yanfeng Wang. Unibrain: Universal brain mri diagnosis with hierarchical knowledge-enhanced pre-training.Computerized Medical Imaging and Graphics, 122:102516, 2025

  27. [27]

    Knowledge-driven encode, retrieve, paraphrase for medical image report generation

    Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 6666–6673, 2019. |20

  28. [28]

    Ultrasound report generation with cross-modality feature alignment via unsupervised guidance

    Jun Li, Tongkun Su, Baoliang Zhao, Faqin Lv, Qiong Wang, Nassir Navab, Ying Hu, and Zhongliang Jiang. Ultrasound report generation with cross-modality feature alignment via unsupervised guidance. IEEE Transactions on Medical Imaging, 2024

  29. [29]

    RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

    Wenxuan Li, Pedro RAS Bassi, Xinze Zhou, Jakob Wasserthal, Alan L Yuille, and Zongwei Zhou. Radthinking: A dataset for longitudinal clinical reasoning in radiology.arXiv preprint arXiv:2605.10761, 2026

  30. [30]

    Pmc- clip: Contrastive language-image pre-training using biomedical documents

    Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- clip: Contrastive language-image pre-training using biomedical documents. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 525–536. Springer, 2023

  31. [31]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  32. [32]

    Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

    Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

  33. [33]

    Segment anything in medical images

    Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images. Nature communications, 15(1):654, 2024

  34. [34]

    Mmxu: A multi-modal and multi-x-ray understanding dataset for disease progression

    Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, and Xiaofan Zhang. Mmxu: A multi-modal and multi-x-ray understanding dataset for disease progression. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9785–9803, 2025

  35. [35]

    A review of content-based image retrieval systems in medical applications—clinical benefits and future directions.International Journal of Medical Informatics, 73(1):1–23, 2004

    Henning Müller, Nicolas Michoux, David Bandon, and Antoine Geissbuhler. A review of content-based image retrieval systems in medical applications—clinical benefits and future directions.International Journal of Medical Informatics, 73(1):1–23, 2004

  36. [36]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

  37. [37]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAssociation for Computational Linguistics, pages 311–318, 2002

  38. [38]

    CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

    Eva Prakash, Yunhe Gao, Chong Wang, Justin Xu, Neal Prakash, Arne Michalson, Seena Dehkharghani, Eun Kyoung Hong, Julie Bauml, Roger Boodoo, Jean-Benoit Delbrouck, Sophie Ostmeier, and Curtis Langlotz. Chextemporal: A dataset for temporally-grounded reasoning in chest radiography.arXiv preprint arXiv:2605.11304, 2026

  39. [39]

    CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

    Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning.arXiv preprint arXiv:1711.05225, 2017

  40. [40]

    Biolord: Learning ontological representations from definitions for biomedical concepts and their textual descriptions

    François Remy, Kris Demuynck, and Thomas Demeester. Biolord: Learning ontological representations from definitions for biomedical concepts and their textual descriptions. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1454–1465, 2022

  41. [41]

    Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems, 34:8583–8595, 2021

    Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Su- sano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems, 34:8583–8595, 2021

  42. [42]

    Anna A. J. Roelofs, Nico Karssemeijer, Nicola Wedekind, Christoph Beck, Sabine van Woudenberg, Peter R. Snoeren, Jan H. C. L. Hendriks, Marco Rosselli Del Turco, Nils Bjurstam, Horst Junkermann, David Beijerinck, Bruno Seradour, Cees J. G. Evertsz, Linda van Erning, and Mireille J. M. Broeders. Importance of comparison of current and prior mammograms in b...

  43. [43]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

  44. [44]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. |21

  45. [45]

    Hergen: Elevating radiology report generation with longitudinal data

    Fuying Wang, Shenghui Du, and Lequan Yu. Hergen: Elevating radiology report generation with longitudinal data. InComputer Vision –ECCV 2024, pages 183–200. Springer, 2024

  46. [46]

    Ai-driven smart patient retrieval for precision oncology

    Yan-Ran Joyce Wang and Akshay S Chaudhari. Ai-driven smart patient retrieval for precision oncology. Nature Reviews Cancer, pages 1–3, 2026

  47. [47]

    Medclip: Contrastive learning from unpaired medical images and text

    Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing, page 3876, 2022

  48. [48]

    Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 16(1):7866, 2025

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Hui Hui, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 16(1):7866, 2025

  49. [49]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

  50. [50]

    A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

  51. [51]

    Bertscore: Evaluating text generation with bert

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InProceedings of the International Conference on Learning Representations

  52. [52]

    Towards scalable language-image pre-training for 3d medical imaging

    Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, and Todd Hollon. Towards scalable language-image pre-training for 3d medical imaging. arXiv preprint arXiv:2505.21862, 2025

  53. [53]

    Ratescore: A metric for radiology report generation

    Weike Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Ratescore: A metric for radiology report generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, 2024

  54. [54]

    Rethinking whole-body ct image interpretation: An abnormality-centric approach

    Ziheng Zhao, Lisong Dai, Ya Zhang, Weidi Xie, and Yanfeng Wang. Rethinking whole-body ct image interpretation: An abnormality-centric approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5179–5189, 2026

  55. [55]

    Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1):566, 2025

    Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, and Weidi Xie. Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1):566, 2025

  56. [56]

    1113” and “free

    Qingqing Zhu, Tejas Sudharshan Mathai, Pritam Mukherjee, Yifan Peng, Ronald M Summers, and Zhiyong Lu. Utilizing longitudinal chest x-rays and reports to pre-fill radiology reports. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 189–198. Springer, 2023. |22 S1 Supplementary Dataset and Construction Details ...

  57. [57]

    Aspects for comparison include, but are not limited to: presence/absence, location, size, morphology, margins, internal characteristics, associated findings, or potential etiology

    Question Focus: Questions should focus specifically on the differential manifestations of the abnormality [abnormal] between the two images. Aspects for comparison include, but are not limited to: presence/absence, location, size, morphology, margins, internal characteristics, associated findings, or potential etiology. The overall imaging appearance of t...

  58. [58]

    Comprehensive Coverage: Questions should cover all identifiable dimensions of difference, ensuring variety and that no key comparative points are omitted

  59. [59]

    Answer and Rationale: Each question must include the correct answer (e.g., A, B, C, D) accompanied by a concise rationale explaining the basis for the judgment

  60. [60]

    image comparison

    Perspective of Expression: The wording of questions and rationales should adopt an "image comparison" perspective. References to the cases must be specific; for example, use "the imaging from Case A" and "the imaging from Case B," avoiding generic terms like "one image" or "the other image." Additionally, refrain from using phrases such as "the report ind...

  61. [61]

    Avoid content related to unchanged features or aspects irrelevant to the comparison

    Relevance and Specificity: All questions must strictly pertain to the actual differences present between the two images. Avoid content related to unchanged features or aspects irrelevant to the comparison

  62. [62]

    question

    Information Fidelity: Questions, options, answers, and rationales must be strictly based on the provided factual information. Do not fabricate or introduce details not mentioned. Output Format (JSON): Each question must strictly adhere to the following JSON structure: { "question": "Question content", "condition": "[anatomy]_and_[abnormal]", "content_type...