A Vision-language Framework for Comparative Reasoning in Radiology

Lisong Dai; Pengcheng Qiu; Tengfei Zhang; Weidi Xie; Xiaoman Zhang; Yanfeng Wang; Ya Zhang; Ziheng Zhao

arxiv: 2606.06407 · v1 · pith:K7DDQLJGnew · submitted 2026-06-04 · 💻 cs.CV · cs.IR· cs.LG· eess.IV

A Vision-language Framework for Comparative Reasoning in Radiology

Tengfei Zhang , Ziheng Zhao , Lisong Dai , Xiaoman Zhang , Pengcheng Qiu , Ya Zhang , Yanfeng Wang , Weidi Xie This is my paper

Pith reviewed 2026-06-28 01:39 UTC · model grok-4.3

classification 💻 cs.CV cs.IRcs.LGeess.IV

keywords comparative reasoningradiologyentity-aware retrievalvision-language modelreport decompositionmedical imaging AIlongitudinal interpretation

0 comments

The pith

Radiology comparison can be learned as entity-aware cross-image reasoning from routine clinical reports at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats radiological diagnosis and follow-up as the task of comparing a current study against prior exams and analogous reference cases. It decomposes existing image-report pairs into anatomical structures, abnormal findings, and pathological conditions to create supervision signals without new manual labels. These signals train an entity-conditioned retriever and a vision-language model for generating descriptions of interval change. The resulting models improve retrieval and generation metrics on internal, external, and cross-center tests, including clinically confusable cases. The work therefore claims that scalable comparative reasoning aligned with real radiological practice can be extracted directly from existing clinical archives.

Core claim

By decomposing radiology reports into anatomical structures, abnormal findings, and pathological conditions, entity-aware models can be trained on more than 690,000 images to perform controllable retrieval of clinically analogous cases and to generate accurate interpretations of temporal change, with consistent gains over baselines in both retrieval recall and longitudinal accuracy across modalities and institutions.

What carries the argument

Entity-conditioned retrieval and generation, where an encoder conditions on report-derived entities to select reference cases and to produce comparative visual question answers.

If this is right

MedReCo achieves the highest Recall@1 across all 12 internal retrieval settings and raises external retrieval by an average of 6.0 percentage points.
MedReCo-VLM raises longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT.
Performance remains superior in clinically confusable differential diagnosis groups.
The same entity-decomposition pipeline works across eight institutions, four countries, and seven imaging modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition-plus-conditioning pattern could be tested on non-radiology image-report corpora such as pathology slides or ophthalmology photographs.
If the entity labels prove noisy, hybrid human-AI verification loops on a small fraction of reports might restore accuracy while retaining most of the scale advantage.
The framework supplies a concrete route to measure how much additional clinical alignment is gained by explicit comparison modeling versus single-image interpretation.

Load-bearing premise

Automatic decomposition of reports into anatomical structures, abnormal findings, and pathological conditions supplies sufficiently accurate and unbiased entity labels for model supervision.

What would settle it

If a manually verified subset of decomposed reports shows low entity-label accuracy and retraining on the corrected labels eliminates the reported performance gains, the central claim would be falsified.

read the original abstract

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large new comparative radiology dataset with entity-aware models reports gains, but automatic report decomposition for supervision has no validation.

read the letter

The punchline on this paper is that it introduces a large multi-institutional dataset for comparative radiology reasoning and entity-aware models that report gains over baselines, but the automatic decomposition of reports into entities for supervision is not validated.

The work is new in constructing MedReCo-DB from routine image-report pairs, totaling over 690,000 images from 160,000 patients across eight institutions in four countries and seven modalities. It uses report decomposition into anatomical structures, abnormal findings, and pathological conditions to supervise entity-conditioned retrieval and comparative visual question answering. The resulting MedReCo model for retrieval and MedReCo-VLM for generation show the highest Recall@1 in all 12 internal settings, mean 6 percentage point improvement externally, and substantial gains in longitudinal accuracy on chest radiographs and CT.

It does well by targeting a genuine mismatch in current AI, which focuses on isolated images while practice involves comparisons to priors and references. The cross-center evaluations and focus on clinically confusable cases add practical relevance.

The main soft spot is the unvalidated decomposition step. The abstract describes using it for supervision but provides no accuracy metrics or validation, which is load-bearing for the entity-aware claims. If the labels contain noise or bias, the reported improvements could be overstated. Additional details on methods, ablations, and variability would strengthen the paper.

This paper is for researchers in medical imaging AI looking to develop more clinically aligned systems, particularly those interested in multi-image reasoning or new datasets. Readers working on vision-language models for radiology would find the resource and approach useful.

It deserves a serious referee because the scale and the problem it tackles are significant.

Recommendation: send it to peer review, asking for validation on the report decomposition.

Referee Report

2 major / 0 minor

Summary. The paper formulates radiological comparison as an entity-aware cross-image reasoning task and introduces MedReCo-DB (>690k images from >160k patients across 8 institutions) derived from routine image-report pairs. Reports are automatically decomposed into anatomical structures, abnormal findings, and pathological conditions to supervise entity-conditioned retrieval (MedReCo) and comparative VQA/generation (MedReCo-VLM). The work claims state-of-the-art Recall@1 across all 12 internal settings, +6.0 pp mean external retrieval improvement, and large gains (13.0-46.5 pp) in longitudinal follow-up accuracy on chest radiographs and CT.

Significance. If the entity labels prove reliable, the scale and multi-center construction of MedReCo-DB together with the entity-aware models would represent a substantive step toward clinically aligned comparative reasoning in medical imaging AI, moving beyond isolated interpretation. The multi-institutional, multi-modality scope and held-out evaluations are positive features.

major comments (2)

[Abstract] Abstract: the central supervision signal is created by automatic decomposition of reports into anatomical structures, abnormal findings, and pathological conditions, yet the manuscript provides no accuracy metrics, inter-annotator agreement, human validation study, or error analysis for this decomposition step. Because every reported gain (Recall@1, longitudinal accuracy) is conditioned on these labels, the absence of validation directly undermines the claim that the models have learned genuine entity-aware comparative reasoning rather than artifacts of the extraction process.
[Abstract] Abstract and methods description: the abstract asserts consistent outperformance on internal, external, and cross-center tests with specific percentage gains, but supplies no ablation studies, statistical tests, error bars, or dataset-construction validation (e.g., confirmation that the held-out splits preserve entity distributions). These omissions make it impossible to determine whether the reported improvements are robust or sensitive to the particular decomposition pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for validation of the report decomposition and additional robustness checks. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central supervision signal is created by automatic decomposition of reports into anatomical structures, abnormal findings, and pathological conditions, yet the manuscript provides no accuracy metrics, inter-annotator agreement, human validation study, or error analysis for this decomposition step. Because every reported gain (Recall@1, longitudinal accuracy) is conditioned on these labels, the absence of validation directly undermines the claim that the models have learned genuine entity-aware comparative reasoning rather than artifacts of the extraction process.

Authors: We acknowledge that the manuscript does not report quantitative validation metrics, inter-annotator agreement, or error analysis for the automatic decomposition step. While the pipeline builds on established medical NLP methods and the overall scale provides indirect support, we agree this leaves the entity-aware claims open to the concern raised. In revision we will add a new subsection with a human validation study on a stratified sample of 1,000 reports (two radiologists per report), reporting per-entity accuracy, Cohen's kappa, and a categorized error analysis. This will allow readers to assess whether gains reflect genuine reasoning. revision: yes
Referee: [Abstract] Abstract and methods description: the abstract asserts consistent outperformance on internal, external, and cross-center tests with specific percentage gains, but supplies no ablation studies, statistical tests, error bars, or dataset-construction validation (e.g., confirmation that the held-out splits preserve entity distributions). These omissions make it impossible to determine whether the reported improvements are robust or sensitive to the particular decomposition pipeline.

Authors: The manuscript contains component ablations and multi-center held-out results, yet we agree that statistical significance testing, error bars across runs, and explicit verification that entity distributions are preserved in the splits are not reported. In the revision we will add: (i) results from three random seeds with standard-error bars, (ii) paired statistical tests (McNemar or t-tests) for all key comparisons, and (iii) a table comparing entity-type frequencies and KL divergence between training and test partitions. These additions will directly demonstrate robustness independent of any single decomposition run. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and held-out evaluation

full rationale

The paper constructs MedReCo-DB from routine image-report pairs across multiple institutions, decomposes reports to create entity labels for supervision, trains MedReCo and MedReCo-VLM, and reports performance on internal/external/cross-center held-out evaluations. No step reduces a claimed prediction or result to a fitted parameter, self-citation chain, or input by construction; the central claims rest on new data collection and measurable improvements on unseen cases rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides limited technical detail; primary domain assumption is the reliability of report decomposition for supervision. No explicit free parameters or invented physical entities are described.

axioms (1)

domain assumption Radiology reports can be decomposed into anatomical structures, abnormal findings and pathological conditions to yield reliable supervision signals for entity-conditioned tasks.
Invoked to create training labels for retrieval and comparative VQA; if noisy or biased, downstream performance claims are undermined.

pith-pipeline@v0.9.1-grok · 5838 in / 1386 out tokens · 53839 ms · 2026-06-28T01:39:19.001901+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 14 canonical work pages · 8 internal anchors

[1]

Content-based image retrieval in radiology: current status and future directions.Journal of digital imaging, 24(2):208–222, 2011

Ceyhun Burak Akgül, Daniel L Rubin, Sandy Napel, Christopher F Beaulieu, Hayit Greenspan, and Burak Acar. Content-based image retrieval in radiology: current status and future directions.Journal of digital imaging, 24(2):208–222, 2011

2011
[2]

Introducing claude 4, 2025

Anthropic. Introducing claude 4, 2025

2025
[3]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005
[4]

Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al

Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al. Maira-2: Grounded radiology report generation.arXiv preprint arXiv:2406.04449, 2024

work page arXiv 2024
[5]

Chexpert plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P Langlotz. Chexpert plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats. arXiv preprint arXiv:2405.19538, 2024

work page arXiv 2024
[6]

Bimcv-r: A landmark dataset for 3d ct text-image retrieval

Yinda Chen, Che Liu, Xiaoyu Liu, Rossella Arcucci, and Zhiwei Xiong. Bimcv-r: A landmark dataset for 3d ct text-image retrieval. InInternationalconferenceonmedicalimagecomputingandcomputer-assisted intervention, pages 124–134. Springer, 2024

2024
[7]

Generating radiology reports via memory- driven transformer

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory- driven transformer. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Nov. 2020

2020
[8]

Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

work page arXiv 2024
[9]

Content-based image retrieval by using deep learning for interstitial lung disease diagnosis with chest ct.Radiology, 302(1):187–197, 2022

Jooae Choe, Hye Jeon Hwang, Joon Beom Seo, Sang Min Lee, Jihye Yun, Min-Ju Kim, Jewon Jeong, Youngsoo Lee, Kiok Jin, Rohee Park, Jihoon Kim, Howook Jeon, Namkug Kim, Jaeyoun Yi, Donghoon Yu, and Byeongsoo Kim. Content-based image retrieval by using deep learning for interstitial lung disease diagnosis with chest ct.Radiology, 302(1):187–197, 2022

2022
[10]

Controllable chest x-ray report generation from longitudinal representations

Francesco Dalla Serra, Chaoyang Wang, Fani Deligianni, Jeff Dalton, and Alison O’Neil. Controllable chest x-ray report generation from longitudinal representations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4891–4904, 2023

2023
[11]

Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports

Jean-Benoit Delbrouck, Pierre Chambon, Zhihong Chen, Maya Varma, Andrew Johnston, Louis Blanke- meier, Dave Van Veen, Tan Bui, Steven Truong, and Curtis Langlotz. Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports. InFindings of the Association for Computational Linguistics, pages 12902–12915, 2024

2024
[12]

Eisenhauer, Patrick Therasse, Jan Bogaerts, Lawrence H

Elizabeth A. Eisenhauer, Patrick Therasse, Jan Bogaerts, Lawrence H. Schwartz, Daniel Sargent, Robert Ford, Janet Dancey, Susan Arbuck, Steve Gwyther, Margaret Mooney, Larry Rubinstein, Lalitha Shankar, |19 Lori Dodd, Robert Kaplan, Denis Lacombe, and Jaap Verweij. New response evaluation criteria in solid tumours: Revised recist guideline (version 1.1).E...

2009
[13]

3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.Advances in Neural Information Processing Systems, 38, 2026

Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.Advances in Neural Information Processing Systems, 38, 2026

2026
[14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

Ibrahim Ethem Hamamci, Sezgin Er, Chenyu Wang, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Omer Faruk Durugol, Benjamin Hou, Suprosanna Shit, et al. Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

2026
[16]

Deep metric learning using triplet network

Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. InSimilarity-based pattern recognition: third international workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3, pages 84–92, 2015

2015
[17]

Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images.PhysioNet, Feb

Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, liangchen liu, Kazuma Kobayashi, Tatsuya Harada, Ronald Summers, and Yingying Zhu. Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images.PhysioNet, Feb. 2025. Version 1.0.1

2025
[18]

Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering

Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M Summers, and Yingying Zhu. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4156–4165, 2023

2023
[19]

Lungren, and Serena Yeung

Shih-Cheng Huang, Liyue Shen, Matthew P. Lungren, and Serena Yeung. Gloria: A multimodal global- local representation learning framework for label-efficient medical image recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951, 2021

2021
[20]

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019

2019
[21]

nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211, 2021

Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211, 2021

2021
[22]

Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.Advancesin Neural Information Processing Systems, 35:36722–36732, 2022

Yuanfeng Ji, Haotian Bai, Chongjian Ge, Jie Yang, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhanng, Wanling Ma, Xiang Wan, et al. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.Advancesin Neural Information Processing Systems, 35:36722–36732, 2022

2022
[23]

Hulu-med: A transparent generalist model towards holistic medical vision-language understanding

Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668, 2025

work page arXiv 2025
[24]

On the automatic generation of medical imaging reports

Baoyu Jing, Pengtao Xie, and Eric Xing. On the automatic generation of medical imaging reports. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pages 2577–2586, 2018

2018
[25]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

2019
[26]

Unibrain: Universal brain mri diagnosis with hierarchical knowledge-enhanced pre-training.Computerized Medical Imaging and Graphics, 122:102516, 2025

Jiayu Lei, Lisong Dai, Haoyun Jiang, Chaoyi Wu, Xiaoman Zhang, Yao Zhang, Jiangchao Yao, Weidi Xie, Yanyong Zhang, Yuehua Li, Ya Zhang, and Yanfeng Wang. Unibrain: Universal brain mri diagnosis with hierarchical knowledge-enhanced pre-training.Computerized Medical Imaging and Graphics, 122:102516, 2025

2025
[27]

Knowledge-driven encode, retrieve, paraphrase for medical image report generation

Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 6666–6673, 2019. |20

2019
[28]

Ultrasound report generation with cross-modality feature alignment via unsupervised guidance

Jun Li, Tongkun Su, Baoliang Zhao, Faqin Lv, Qiong Wang, Nassir Navab, Ying Hu, and Zhongliang Jiang. Ultrasound report generation with cross-modality feature alignment via unsupervised guidance. IEEE Transactions on Medical Imaging, 2024

2024
[29]

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

Wenxuan Li, Pedro RAS Bassi, Xinze Zhou, Jakob Wasserthal, Alan L Yuille, and Zongwei Zhou. Radthinking: A dataset for longitudinal clinical reasoning in radiology.arXiv preprint arXiv:2605.10761, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Pmc- clip: Contrastive language-image pre-training using biomedical documents

Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- clip: Contrastive language-image pre-training using biomedical documents. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 525–536. Springer, 2023

2023
[31]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

work page arXiv 2023
[33]

Segment anything in medical images

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images. Nature communications, 15(1):654, 2024

2024
[34]

Mmxu: A multi-modal and multi-x-ray understanding dataset for disease progression

Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, and Xiaofan Zhang. Mmxu: A multi-modal and multi-x-ray understanding dataset for disease progression. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9785–9803, 2025

2025
[35]

A review of content-based image retrieval systems in medical applications—clinical benefits and future directions.International Journal of Medical Informatics, 73(1):1–23, 2004

Henning Müller, Nicolas Michoux, David Bandon, and Antoine Geissbuhler. A review of content-based image retrieval systems in medical applications—clinical benefits and future directions.International Journal of Medical Informatics, 73(1):1–23, 2004

2004
[36]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAssociation for Computational Linguistics, pages 311–318, 2002

2002
[38]

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

Eva Prakash, Yunhe Gao, Chong Wang, Justin Xu, Neal Prakash, Arne Michalson, Seena Dehkharghani, Eun Kyoung Hong, Julie Bauml, Roger Boodoo, Jean-Benoit Delbrouck, Sophie Ostmeier, and Curtis Langlotz. Chextemporal: A dataset for temporally-grounded reasoning in chest radiography.arXiv preprint arXiv:2605.11304, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning.arXiv preprint arXiv:1711.05225, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Biolord: Learning ontological representations from definitions for biomedical concepts and their textual descriptions

François Remy, Kris Demuynck, and Thomas Demeester. Biolord: Learning ontological representations from definitions for biomedical concepts and their textual descriptions. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1454–1465, 2022

2022
[41]

Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems, 34:8583–8595, 2021

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Su- sano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems, 34:8583–8595, 2021

2021
[42]

Anna A. J. Roelofs, Nico Karssemeijer, Nicola Wedekind, Christoph Beck, Sabine van Woudenberg, Peter R. Snoeren, Jan H. C. L. Hendriks, Marco Rosselli Del Turco, Nils Bjurstam, Horst Junkermann, David Beijerinck, Bruno Seradour, Cees J. G. Evertsz, Linda van Erning, and Mireille J. M. Broeders. Importance of comparison of current and prior mammograms in b...

2007
[43]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. |21

2024
[45]

Hergen: Elevating radiology report generation with longitudinal data

Fuying Wang, Shenghui Du, and Lequan Yu. Hergen: Elevating radiology report generation with longitudinal data. InComputer Vision –ECCV 2024, pages 183–200. Springer, 2024

2024
[46]

Ai-driven smart patient retrieval for precision oncology

Yan-Ran Joyce Wang and Akshay S Chaudhari. Ai-driven smart patient retrieval for precision oncology. Nature Reviews Cancer, pages 1–3, 2026

2026
[47]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing, page 3876, 2022

2022
[48]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 16(1):7866, 2025

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Hui Hui, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 16(1):7866, 2025

2025
[49]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

2025
[51]

Bertscore: Evaluating text generation with bert

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InProceedings of the International Conference on Learning Representations
[52]

Towards scalable language-image pre-training for 3d medical imaging

Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, and Todd Hollon. Towards scalable language-image pre-training for 3d medical imaging. arXiv preprint arXiv:2505.21862, 2025

work page arXiv 2025
[53]

Ratescore: A metric for radiology report generation

Weike Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Ratescore: A metric for radiology report generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, 2024

2024
[54]

Rethinking whole-body ct image interpretation: An abnormality-centric approach

Ziheng Zhao, Lisong Dai, Ya Zhang, Weidi Xie, and Yanfeng Wang. Rethinking whole-body ct image interpretation: An abnormality-centric approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5179–5189, 2026

2026
[55]

Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1):566, 2025

Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, and Weidi Xie. Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1):566, 2025

2025
[56]

1113” and “free

Qingqing Zhu, Tejas Sudharshan Mathai, Pritam Mukherjee, Yifan Peng, Ronald M Summers, and Zhiyong Lu. Utilizing longitudinal chest x-rays and reports to pre-fill radiology reports. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 189–198. Springer, 2023. |22 S1 Supplementary Dataset and Construction Details ...

2023
[57]

Aspects for comparison include, but are not limited to: presence/absence, location, size, morphology, margins, internal characteristics, associated findings, or potential etiology

Question Focus: Questions should focus specifically on the differential manifestations of the abnormality [abnormal] between the two images. Aspects for comparison include, but are not limited to: presence/absence, location, size, morphology, margins, internal characteristics, associated findings, or potential etiology. The overall imaging appearance of t...
[58]

Comprehensive Coverage: Questions should cover all identifiable dimensions of difference, ensuring variety and that no key comparative points are omitted
[59]

Answer and Rationale: Each question must include the correct answer (e.g., A, B, C, D) accompanied by a concise rationale explaining the basis for the judgment
[60]

image comparison

Perspective of Expression: The wording of questions and rationales should adopt an "image comparison" perspective. References to the cases must be specific; for example, use "the imaging from Case A" and "the imaging from Case B," avoiding generic terms like "one image" or "the other image." Additionally, refrain from using phrases such as "the report ind...
[61]

Avoid content related to unchanged features or aspects irrelevant to the comparison

Relevance and Specificity: All questions must strictly pertain to the actual differences present between the two images. Avoid content related to unchanged features or aspects irrelevant to the comparison
[62]

question

Information Fidelity: Questions, options, answers, and rationales must be strictly based on the provided factual information. Do not fabricate or introduce details not mentioned. Output Format (JSON): Each question must strictly adhere to the following JSON structure: { "question": "Question content", "condition": "[anatomy]_and_[abnormal]", "content_type...

2048

[1] [1]

Content-based image retrieval in radiology: current status and future directions.Journal of digital imaging, 24(2):208–222, 2011

Ceyhun Burak Akgül, Daniel L Rubin, Sandy Napel, Christopher F Beaulieu, Hayit Greenspan, and Burak Acar. Content-based image retrieval in radiology: current status and future directions.Journal of digital imaging, 24(2):208–222, 2011

2011

[2] [2]

Introducing claude 4, 2025

Anthropic. Introducing claude 4, 2025

2025

[3] [3]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005

[4] [4]

Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al

Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al. Maira-2: Grounded radiology report generation.arXiv preprint arXiv:2406.04449, 2024

work page arXiv 2024

[5] [5]

Chexpert plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P Langlotz. Chexpert plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats. arXiv preprint arXiv:2405.19538, 2024

work page arXiv 2024

[6] [6]

Bimcv-r: A landmark dataset for 3d ct text-image retrieval

Yinda Chen, Che Liu, Xiaoyu Liu, Rossella Arcucci, and Zhiwei Xiong. Bimcv-r: A landmark dataset for 3d ct text-image retrieval. InInternationalconferenceonmedicalimagecomputingandcomputer-assisted intervention, pages 124–134. Springer, 2024

2024

[7] [7]

Generating radiology reports via memory- driven transformer

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory- driven transformer. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Nov. 2020

2020

[8] [8]

Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

work page arXiv 2024

[9] [9]

Content-based image retrieval by using deep learning for interstitial lung disease diagnosis with chest ct.Radiology, 302(1):187–197, 2022

Jooae Choe, Hye Jeon Hwang, Joon Beom Seo, Sang Min Lee, Jihye Yun, Min-Ju Kim, Jewon Jeong, Youngsoo Lee, Kiok Jin, Rohee Park, Jihoon Kim, Howook Jeon, Namkug Kim, Jaeyoun Yi, Donghoon Yu, and Byeongsoo Kim. Content-based image retrieval by using deep learning for interstitial lung disease diagnosis with chest ct.Radiology, 302(1):187–197, 2022

2022

[10] [10]

Controllable chest x-ray report generation from longitudinal representations

Francesco Dalla Serra, Chaoyang Wang, Fani Deligianni, Jeff Dalton, and Alison O’Neil. Controllable chest x-ray report generation from longitudinal representations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4891–4904, 2023

2023

[11] [11]

Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports

Jean-Benoit Delbrouck, Pierre Chambon, Zhihong Chen, Maya Varma, Andrew Johnston, Louis Blanke- meier, Dave Van Veen, Tan Bui, Steven Truong, and Curtis Langlotz. Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports. InFindings of the Association for Computational Linguistics, pages 12902–12915, 2024

2024

[12] [12]

Eisenhauer, Patrick Therasse, Jan Bogaerts, Lawrence H

Elizabeth A. Eisenhauer, Patrick Therasse, Jan Bogaerts, Lawrence H. Schwartz, Daniel Sargent, Robert Ford, Janet Dancey, Susan Arbuck, Steve Gwyther, Margaret Mooney, Larry Rubinstein, Lalitha Shankar, |19 Lori Dodd, Robert Kaplan, Denis Lacombe, and Jaap Verweij. New response evaluation criteria in solid tumours: Revised recist guideline (version 1.1).E...

2009

[13] [13]

3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.Advances in Neural Information Processing Systems, 38, 2026

Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.Advances in Neural Information Processing Systems, 38, 2026

2026

[14] [14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

Ibrahim Ethem Hamamci, Sezgin Er, Chenyu Wang, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Omer Faruk Durugol, Benjamin Hou, Suprosanna Shit, et al. Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

2026

[16] [16]

Deep metric learning using triplet network

Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. InSimilarity-based pattern recognition: third international workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3, pages 84–92, 2015

2015

[17] [17]

Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images.PhysioNet, Feb

Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, liangchen liu, Kazuma Kobayashi, Tatsuya Harada, Ronald Summers, and Yingying Zhu. Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images.PhysioNet, Feb. 2025. Version 1.0.1

2025

[18] [18]

Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering

Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M Summers, and Yingying Zhu. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4156–4165, 2023

2023

[19] [19]

Lungren, and Serena Yeung

Shih-Cheng Huang, Liyue Shen, Matthew P. Lungren, and Serena Yeung. Gloria: A multimodal global- local representation learning framework for label-efficient medical image recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951, 2021

2021

[20] [20]

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019

2019

[21] [21]

nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211, 2021

Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature methods, 18(2):203–211, 2021

2021

[22] [22]

Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.Advancesin Neural Information Processing Systems, 35:36722–36732, 2022

Yuanfeng Ji, Haotian Bai, Chongjian Ge, Jie Yang, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhanng, Wanling Ma, Xiang Wan, et al. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.Advancesin Neural Information Processing Systems, 35:36722–36732, 2022

2022

[23] [23]

Hulu-med: A transparent generalist model towards holistic medical vision-language understanding

Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668, 2025

work page arXiv 2025

[24] [24]

On the automatic generation of medical imaging reports

Baoyu Jing, Pengtao Xie, and Eric Xing. On the automatic generation of medical imaging reports. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pages 2577–2586, 2018

2018

[25] [25]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

2019

[26] [26]

Unibrain: Universal brain mri diagnosis with hierarchical knowledge-enhanced pre-training.Computerized Medical Imaging and Graphics, 122:102516, 2025

Jiayu Lei, Lisong Dai, Haoyun Jiang, Chaoyi Wu, Xiaoman Zhang, Yao Zhang, Jiangchao Yao, Weidi Xie, Yanyong Zhang, Yuehua Li, Ya Zhang, and Yanfeng Wang. Unibrain: Universal brain mri diagnosis with hierarchical knowledge-enhanced pre-training.Computerized Medical Imaging and Graphics, 122:102516, 2025

2025

[27] [27]

Knowledge-driven encode, retrieve, paraphrase for medical image report generation

Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 6666–6673, 2019. |20

2019

[28] [28]

Ultrasound report generation with cross-modality feature alignment via unsupervised guidance

Jun Li, Tongkun Su, Baoliang Zhao, Faqin Lv, Qiong Wang, Nassir Navab, Ying Hu, and Zhongliang Jiang. Ultrasound report generation with cross-modality feature alignment via unsupervised guidance. IEEE Transactions on Medical Imaging, 2024

2024

[29] [29]

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

Wenxuan Li, Pedro RAS Bassi, Xinze Zhou, Jakob Wasserthal, Alan L Yuille, and Zongwei Zhou. Radthinking: A dataset for longitudinal clinical reasoning in radiology.arXiv preprint arXiv:2605.10761, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Pmc- clip: Contrastive language-image pre-training using biomedical documents

Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- clip: Contrastive language-image pre-training using biomedical documents. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 525–536. Springer, 2023

2023

[31] [31]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

work page arXiv 2023

[33] [33]

Segment anything in medical images

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images. Nature communications, 15(1):654, 2024

2024

[34] [34]

Mmxu: A multi-modal and multi-x-ray understanding dataset for disease progression

Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, and Xiaofan Zhang. Mmxu: A multi-modal and multi-x-ray understanding dataset for disease progression. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9785–9803, 2025

2025

[35] [35]

A review of content-based image retrieval systems in medical applications—clinical benefits and future directions.International Journal of Medical Informatics, 73(1):1–23, 2004

Henning Müller, Nicolas Michoux, David Bandon, and Antoine Geissbuhler. A review of content-based image retrieval systems in medical applications—clinical benefits and future directions.International Journal of Medical Informatics, 73(1):1–23, 2004

2004

[36] [36]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAssociation for Computational Linguistics, pages 311–318, 2002

2002

[38] [38]

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

Eva Prakash, Yunhe Gao, Chong Wang, Justin Xu, Neal Prakash, Arne Michalson, Seena Dehkharghani, Eun Kyoung Hong, Julie Bauml, Roger Boodoo, Jean-Benoit Delbrouck, Sophie Ostmeier, and Curtis Langlotz. Chextemporal: A dataset for temporally-grounded reasoning in chest radiography.arXiv preprint arXiv:2605.11304, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning.arXiv preprint arXiv:1711.05225, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Biolord: Learning ontological representations from definitions for biomedical concepts and their textual descriptions

François Remy, Kris Demuynck, and Thomas Demeester. Biolord: Learning ontological representations from definitions for biomedical concepts and their textual descriptions. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1454–1465, 2022

2022

[41] [41]

Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems, 34:8583–8595, 2021

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Su- sano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems, 34:8583–8595, 2021

2021

[42] [42]

Anna A. J. Roelofs, Nico Karssemeijer, Nicola Wedekind, Christoph Beck, Sabine van Woudenberg, Peter R. Snoeren, Jan H. C. L. Hendriks, Marco Rosselli Del Turco, Nils Bjurstam, Horst Junkermann, David Beijerinck, Bruno Seradour, Cees J. G. Evertsz, Linda van Erning, and Mireille J. M. Broeders. Importance of comparison of current and prior mammograms in b...

2007

[43] [43]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. |21

2024

[45] [45]

Hergen: Elevating radiology report generation with longitudinal data

Fuying Wang, Shenghui Du, and Lequan Yu. Hergen: Elevating radiology report generation with longitudinal data. InComputer Vision –ECCV 2024, pages 183–200. Springer, 2024

2024

[46] [46]

Ai-driven smart patient retrieval for precision oncology

Yan-Ran Joyce Wang and Akshay S Chaudhari. Ai-driven smart patient retrieval for precision oncology. Nature Reviews Cancer, pages 1–3, 2026

2026

[47] [47]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing, page 3876, 2022

2022

[48] [48]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 16(1):7866, 2025

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Hui Hui, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 16(1):7866, 2025

2025

[49] [49]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1):AIoa2400640, 2025

2025

[51] [51]

Bertscore: Evaluating text generation with bert

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InProceedings of the International Conference on Learning Representations

[52] [52]

Towards scalable language-image pre-training for 3d medical imaging

Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, and Todd Hollon. Towards scalable language-image pre-training for 3d medical imaging. arXiv preprint arXiv:2505.21862, 2025

work page arXiv 2025

[53] [53]

Ratescore: A metric for radiology report generation

Weike Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Ratescore: A metric for radiology report generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, 2024

2024

[54] [54]

Rethinking whole-body ct image interpretation: An abnormality-centric approach

Ziheng Zhao, Lisong Dai, Ya Zhang, Weidi Xie, and Yanfeng Wang. Rethinking whole-body ct image interpretation: An abnormality-centric approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5179–5189, 2026

2026

[55] [55]

Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1):566, 2025

Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, and Weidi Xie. Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1):566, 2025

2025

[56] [56]

1113” and “free

Qingqing Zhu, Tejas Sudharshan Mathai, Pritam Mukherjee, Yifan Peng, Ronald M Summers, and Zhiyong Lu. Utilizing longitudinal chest x-rays and reports to pre-fill radiology reports. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 189–198. Springer, 2023. |22 S1 Supplementary Dataset and Construction Details ...

2023

[57] [57]

Aspects for comparison include, but are not limited to: presence/absence, location, size, morphology, margins, internal characteristics, associated findings, or potential etiology

Question Focus: Questions should focus specifically on the differential manifestations of the abnormality [abnormal] between the two images. Aspects for comparison include, but are not limited to: presence/absence, location, size, morphology, margins, internal characteristics, associated findings, or potential etiology. The overall imaging appearance of t...

[58] [58]

Comprehensive Coverage: Questions should cover all identifiable dimensions of difference, ensuring variety and that no key comparative points are omitted

[59] [59]

Answer and Rationale: Each question must include the correct answer (e.g., A, B, C, D) accompanied by a concise rationale explaining the basis for the judgment

[60] [60]

image comparison

Perspective of Expression: The wording of questions and rationales should adopt an "image comparison" perspective. References to the cases must be specific; for example, use "the imaging from Case A" and "the imaging from Case B," avoiding generic terms like "one image" or "the other image." Additionally, refrain from using phrases such as "the report ind...

[61] [61]

Avoid content related to unchanged features or aspects irrelevant to the comparison

Relevance and Specificity: All questions must strictly pertain to the actual differences present between the two images. Avoid content related to unchanged features or aspects irrelevant to the comparison

[62] [62]

question

Information Fidelity: Questions, options, answers, and rationales must be strictly based on the provided factual information. Do not fabricate or introduce details not mentioned. Output Format (JSON): Each question must strictly adhere to the following JSON structure: { "question": "Question content", "condition": "[anatomy]_and_[abnormal]", "content_type...

2048