UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

Andreas M. Rauschecker; Chih-Hua Liu; Junayd Lateef; Madhumita Sushil; Shiv Ghosh; Yannan Yu

arxiv: 2605.17140 · v2 · pith:BQX76E42new · submitted 2026-05-16 · 💻 cs.CV · cs.AI· cs.CL

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

Shiv Ghosh , Junayd Lateef , Chih-Hua Liu , Yannan Yu , Andreas M. Rauschecker , Madhumita Sushil This is my paper

Pith reviewed 2026-05-21 08:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords visual question answeringbrain MRIgliomavision-language modelsmodality collapseneuro-oncologyMRI sequencesVQA benchmark

0 comments

The pith

Vision-language models cannot effectively process multi-sequence 3D brain tumor MRIs and instead over-rely on language priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new visual question answering dataset based on glioma MRI studies to benchmark how vision-language models handle complex medical imaging. It tests current state-of-the-art models on questions that require synthesizing information from multiple 3D sequences. The evaluation reveals that these models suppress visual features from the scans and default to language-based reasoning. This matters because reliable AI assistance could help scale expert interpretation of brain tumors amid growing demand and limited radiologist availability. The findings point to a fundamental limitation that must be overcome for safe clinical use of such models.

Core claim

The UCSF-PDGM-VQA dataset provides 2,387 QA pairs from 473 glioma-related MRI studies, and baseline evaluations of six VLMs and one LLM demonstrate that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, resulting in a suppression of visual features and over-reliance on language priors that causes modality collapse.

What carries the argument

The UCSF-PDGM-VQA benchmark, which consists of clinically relevant QA pairs designed to test synthesis across multiple 3D MRI sequences in glioma cases.

If this is right

Development of domain-specific VLMs is required to handle multi-sequence medical imaging without modality collapse.
Current VLMs pose reliability and safety risks if deployed for clinical brain tumor interpretation.
The dataset serves as a tool to track progress toward robust models for neuro-oncology.
Addressing this issue could enable semi-automated systems that reduce the time and cognitive load on radiologists.
Specialized benchmarks like this are needed to identify and fix gaps in VLM capabilities for other complex imaging domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improved performance on this dataset might indicate better handling of other multi-modal medical data such as combined imaging and patient history.
Similar modality collapse issues could appear in VLMs applied to other 3D medical scans like CT or PET.
Creating interactive VQA systems based on these models would require first resolving the visual processing deficiencies shown here.

Load-bearing premise

The QA pairs generated from the UCSF-PDGM dataset accurately reflect the clinically relevant tasks that radiologists perform when interpreting multi-sequence glioma MRIs.

What would settle it

Finding a VLM that achieves high accuracy on questions relying on visual integration across MRI sequences, even after controlling for possible language shortcuts, would indicate that the models can process the scans effectively.

Figures

Figures reproduced from arXiv: 2605.17140 by Andreas M. Rauschecker, Chih-Hua Liu, Junayd Lateef, Madhumita Sushil, Shiv Ghosh, Yannan Yu.

**Figure 2.** Figure 2: Example of Axial MRI montage [Sellergren et al., 2025], and the closed-weight GPT5-mini model [Singh et al., 2026]. Since these models cannot process an entire MRI study at once, and several models also do not support multi-slice input, initial experiments evaluated different input representations for robust model performance. The most informative slices were selected as those containing the highest tumor … view at source ↗

read the original abstract

Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New VQA dataset from UCSF-PDGM gliomas with baselines on VLMs, but the modality collapse claim needs tighter evidence that questions actually force 3D multi-sequence reasoning.

read the letter

Hey, the main takeaway is that this paper releases a new VQA benchmark with 2387 pairs drawn from the public UCSF-PDGM glioma MRI studies and shows current VLMs performing poorly on them, which they link to modality collapse. That dataset construction is the clearest new piece here, and it fills a gap since few existing benchmarks target neuro-oncology interpretation tasks that mix multiple sequences and volumes. Running baselines across six VLMs plus one LLM gives a practical starting point that others can compare against, and basing everything on a public source dataset is a plus for anyone who wants to reproduce or extend the work. The clinical framing around radiologist workload and the need for scalable tools also lands as reasonable motivation. The soft spot sits in how much the results actually isolate visual processing failure. The central claim of suppression of visual features and over-reliance on language priors only holds if the QA pairs require integrating information across sequences and 3D space rather than being answerable from general glioma knowledge or single-slice cues. Without more detail on question generation, answer verification steps, or controls for language bias, it is hard to rule out that the poor scores reflect dataset artifacts instead. That concern is moderate rather than decisive, but it does limit how far the interpretation can be taken right now. This paper is mainly for groups building or testing VLMs in medical imaging, especially those focused on oncology or radiology support tools. A reader who needs a new benchmark to measure progress in that niche would get direct value from the dataset itself. I would send it to peer review so the methods can be clarified and the dataset release can be confirmed.

Referee Report

2 major / 1 minor

Summary. The paper introduces the UCSF-PDGM-VQA dataset with 2,387 QA pairs derived from 473 glioma MRI studies in the public UCSF-PDGM collection. It reports baseline evaluations of six state-of-the-art VLMs plus one LLM and concludes that current models cannot effectively process multi-sequence 3D MRI scans, resulting in visual feature suppression and over-reliance on language priors (modality collapse).

Significance. If the QA pairs genuinely require cross-sequence 3D visual reasoning, the benchmark could usefully expose limitations in medical VLMs and motivate domain-specific improvements. The grounding in a public dataset and provision of baselines are strengths that support reproducibility.

major comments (2)

[§3 (Dataset Construction)] §3 (Dataset Construction): The manuscript provides insufficient detail on the question-generation process, answer verification, and any explicit controls ensuring that correct answers require integration across multiple MRI sequences and 3D volumes rather than single-slice or language-prior cues. This is load-bearing for the modality-collapse claim.
[§4 (Baseline Experiments)] §4 (Baseline Experiments): The reported results on the six VLMs and one LLM do not describe the precise evaluation metrics, language-bias controls, or statistical tests used, so the evidence for visual-feature suppression remains difficult to isolate from possible dataset artifacts.

minor comments (1)

[Abstract] Abstract: Adding the concrete performance numbers (e.g., accuracy or F1 scores) for the baseline models would give readers immediate context for the claimed failure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive review of our manuscript introducing the UCSF-PDGM-VQA dataset. Their comments have helped us identify areas where additional clarity is needed. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3 (Dataset Construction)] The manuscript provides insufficient detail on the question-generation process, answer verification, and any explicit controls ensuring that correct answers require integration across multiple MRI sequences and 3D volumes rather than single-slice or language-prior cues. This is load-bearing for the modality-collapse claim.

Authors: We agree with the referee that more detail on the dataset construction is necessary to fully substantiate the modality-collapse claim. In the revised manuscript, we will provide an expanded description of the question-generation process, including how questions were designed to require integration of information across multiple MRI sequences and 3D volumes. We will also elaborate on the answer verification steps and introduce explicit controls, such as comparisons with single-sequence questions, to rule out reliance on language priors or single-slice cues. These additions will be incorporated into §3. revision: yes
Referee: [§4 (Baseline Experiments)] The reported results on the six VLMs and one LLM do not describe the precise evaluation metrics, language-bias controls, or statistical tests used, so the evidence for visual-feature suppression remains difficult to isolate from possible dataset artifacts.

Authors: We acknowledge that the baseline experiments section would benefit from more precise descriptions of the evaluation methodology. In the revision, we will specify the exact metrics used (such as accuracy and F1-score), detail the language-bias controls implemented (including text-only baselines), and report the statistical tests applied to assess the significance of visual feature suppression. This will help isolate the effects from potential dataset artifacts and will be added to §4. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset creation and benchmarking study

full rationale

This is a dataset introduction and empirical benchmarking paper with no mathematical derivations, equations, fitted parameters, or predictions. The UCSF-PDGM-VQA dataset is constructed from the public UCSF-PDGM source, providing external grounding. Model evaluations on the 2,387 QA pairs yield the observation of modality collapse as a direct empirical result, not a reduction to inputs by construction. No self-definitional steps, self-citation load-bearing arguments, or ansatz smuggling are present. The derivation chain is self-contained against external benchmarks and public data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and benchmarking paper. No free parameters, mathematical axioms, or new invented entities are introduced or required for the central claim.

pith-pipeline@v0.9.0 · 5795 in / 1100 out tokens · 69891 ms · 2026-05-21T08:56:44.397072+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The UCSF-PDGM-VQA dataset curated in this study includes 2,387 question-answer pairs... from 473 brain MRI studies.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

[1]

Academic Radiology , volume=

It is About" Time": Academic Neuroradiologist Time Distribution for Interpreting Brain MRIs , author=. Academic Radiology , volume=. 2018 , publisher=

work page 2018
[2]

Krupinski, E. A. and Berbaum, K. S. and Caldwell, R. T. and Schartz, K. M. and Kim, J. , title =. Journal of the American College of Radiology , year =

work page
[3]

Xin, Yu and Ates, Gorkem Can and Gong, Kuang and Shao, Wei , journal=. Med3d. 2025 , publisher=

work page 2025
[4]

, author=

Learning neuroimaging models from health system-scale data. , author=. Nature biomedical engineering , year=

work page
[5]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

work page
[6]

2025 , eprint=

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge , author=. 2025 , eprint=

work page 2025
[7]

and Rudie, Jeffrey D

Calabrese, Evan and Villanueva-Meyer, Javier E. and Rudie, Jeffrey D. and Rauschecker, Andreas M. and Baid, Ujjwal and Bakas, Spyridon and Cha, Soonmee and Mongan, John T. and Hess, Christopher P. , title =. Radiology: Artificial Intelligence , volume =. 2022 , doi =. https://doi.org/10.1148/ryai.220058 , abstract =

work page doi:10.1148/ryai.220058 2022
[8]

and Rudie, Jeffrey D

Calabrese, Evan and Villanueva-Meyer, Javier E. and Rudie, Jeffrey D. and Rauschecker, Andreas M. and Baid, Ujjwal and Bakas, Spyridon and Cha, Soonmee and Mongan, John T. and Hess, Christopher P. , title =. 2022 , publisher =. doi:10.7937/tcia.bdgf-8v37 , url =

work page doi:10.7937/tcia.bdgf-8v37 2022
[9]

Bai, Fan and Du, Yuxin and Huang, Tiejun and Meng, Max Q-H and Zhao, Bo , journal=. M3

work page
[10]

Scientific data , volume=

A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific data , volume=. 2018 , publisher=

work page 2018
[11]

2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=. 2021 , organization=

work page 2021
[12]

2024 , eprint=

VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging , author=. 2024 , eprint=

work page 2024
[13]

arXiv preprint arXiv:2502.05091 , year=

Dcformer: Efficient 3D vision-language modeling with decomposed convolutions , author=. arXiv preprint arXiv:2502.05091 , year=

work page arXiv
[14]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[15]

Steiner and Can Kirmizibayrak and Rory Pilgrim and Daniel Golden and Lin Yang , journal=

Andrew Sellergren and Sahar Kazemzadeh and Tiam Jaroensri and Atilla Kiraly and Madeleine Traverse and Timo Kohlberger and Shawn Xu and Fayaz Jamil and Cían Hughes and Charles Lau and Justin Chen and Fereshteh Mahvar and Liron Yatziv and Tiffany Chen and Bram Sterling and Stefanie Anna Baby and Susanna Maria Baby and Jeremy Lai and Samuel Schmidgall and L...

work page
[16]

NEJM AI2(1), 2400640 (2025) https: //doi.org/10.1056/AIoa2400640 https://ai.nejm.org/doi/pdf/10.1056/AIoa2400640

Sheng Zhang and Yanbo Xu and Naoto Usuyama and Hanwen Xu and Jaspreet Bagga and Robert Tinn and Sam Preston and Rajesh Rao and Mu Wei and Naveen Valluri and Cliff Wong and Andrea Tupini and Yu Wang and Matt Mazzola and Swadheen Shukla and Lars Liden and Jianfeng Gao and Angela Crabtree and Brian Piening and Carlo Bifulco and Matthew P. Lungren and Tristan...

work page doi:10.1056/aioa2400640 2025
[17]

2025 , eprint=

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. 2025 , eprint=

work page 2025
[18]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[19]

arXiv preprint arXiv:2507.08036 , year=

Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights , author=. arXiv preprint arXiv:2507.08036 , year=

work page arXiv
[20]

Moor, Michael and Huang, Qian and Wu, Shirley and Yasunaga, Michihiro and Dalmia, Yash and Leskovec, Jure and Zakka, Cyril and Reis, Eduardo Pontes and Rajpurkar, Pranav , booktitle=. Med-. 2023 , organization=

work page 2023
[21]

arXiv preprint arXiv:2506.18378 , year=

Taming vision-language models for medical image analysis: A comprehensive review , author=. arXiv preprint arXiv:2506.18378 , year=

work page arXiv
[22]

Eslami, Sedigheh and Meinel, Christoph and De Melo, Gerard , booktitle=. Pub

work page
[23]

IEEE Transactions on Multimedia , volume=

Multi-task paired masking with alignment modeling for medical vision-language pre-training , author=. IEEE Transactions on Multimedia , volume=. 2023 , publisher=

work page 2023
[24]

and Jagad, Chirag and Senthilkumar, Pavithra and Thomopoulos, Sophia I

Dhinagar, Nikhil J. and Jagad, Chirag and Senthilkumar, Pavithra and Thomopoulos, Sophia I. and Khan, Mahir H. and Liew, Sook-Lei and the ENIGMA-Stroke Recovery Working Group and Banaj, Nerisa and Boric, Michael R. and Boyd, Lara A. and Brodtmann, Amy and Cassidy, Jessica M. and Conforto, Adriana B. and Cramer, Steven C. and Dula, Adrienne N. and Geranmay...

work page 2026
[25]

He, Xuehai and Zhang, Yichen and Mou, Luntian and Xing, Eric and Xie, Pengtao , journal=. Path

work page
[26]

Zhang, Xiaoman and Wu, Chaoyi and Zhao, Ziheng and Lin, Weixiong and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , journal=

work page
[27]

Generalist foundation models from a multimodal dataset for 3D computed tomography , ISSN=

Hamamci, Ibrahim Ethem and Er, Sezgin and Wang, Chenyu and Almas, Furkan and Simsek, Ayse Gulnihan and Esirgun, Sevval Nil and Dogan, Irem and Durugol, Omer Faruk and Hou, Benjamin and Shit, Suprosanna and Dai, Weicheng and Xu, Murong and Reynaud, Hadrien and Dasdelen, Muhammed Furkan and Wittmann, Bastian and Amiranashvili, Tamaz and Simsar, Enis and Sim...

work page doi:10.1038/s41551-025-01599-y
[28]

arXiv preprint arXiv:2511.17803 , year=

Pillar-0: A New Frontier for Radiology Foundation Models , author =. arXiv preprint arXiv:2511.17803 , year=

work page arXiv
[29]

and Zaharchuk, Greg and Willis, Marc and Yala, Adam and Johnston, Andrew and Boutin, Robert D

Blankemeier, Louis and Kumar, Ashwin and Cohen, Joseph Paul and Liu, Jiaming and Liu, Longchao and Van Veen, Dave and Gardezi, Syed Jamal Safdar and Yu, Hongkun and Paschali, Magdalini and Chen, Zhihong and Delbrouck, Jean-Benoit and Reis, Eduardo and Holland, Robbie and Truyts, Cesar and Bluethgen, Christian and Wu, Yufu and Lian, Long and Jensen, Malte ...

work page doi:10.1038/s41586-026-10181-8
[30]

Wehbe and Faraz S

Yikuan Li and Ramsey M. Wehbe and Faraz S. Ahmad and Hanyin Wang and Yuan Luo , year=. Clinical-. 2201.11838 , archivePrefix=

work page arXiv
[31]

arXiv preprint arXiv:2503.12355 , year=

Atlas: Multi-Scale Attention Improves Long Context Image Modeling , author=. arXiv preprint arXiv:2503.12355 , year=

work page arXiv
[32]

and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan , title =

Boecking, Benedikt and Usuyama, Naoto and Bannur, Shruthi and Castro, Daniel C. and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan , title =. Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Procee...

work page doi:10.1007/978-3-031-20059-5_1 2022
[33]

Nature Medicine , year=

Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof-of-concept trial , author=. Nature Medicine , year=

work page
[34]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[35]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page
[36]

2021 , eprint=

Florence: A New Foundation Model for Computer Vision , author=. 2021 , eprint=

work page 2021
[37]

European conference on computer vision , pages=

Davit: Dual attention vision transformers , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022
[38]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Unified contrastive learning in image-text-label space , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[39]

2019 , url=

Language Models are Unsupervised Multitask Learners , author=. 2019 , url=

work page 2019
[40]

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

OVQA: A clinically generated visual question answering dataset , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

work page
[41]

ViGIL@NeurIPS , year=

Visual Dialog for Radiology: Data Curation and FirstSteps , author=. ViGIL@NeurIPS , year=

work page
[42]

Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain , booktitle =

Asma. Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain , booktitle =. 2021 , publisher =

work page 2021
[43]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jeongwoo and Ji, Lei and Chang, Eric I-Chao and Kim, Tackeun and Choi, Edward , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

work page 2023
[44]

2025 , month = feb, note =

Hu, Xinyue and Gu, Lin and An, Qiyuan and Zhang, Mengliang and liu, liangchen and Kobayashi, Kazuma and Harada, Tatsuya and Summers, Ronald and Zhu, Yingying , title =. 2025 , month = feb, note =. doi:10.13026/e6dd-cn74 , url =

work page doi:10.13026/e6dd-cn74 2025
[45]

International Conference on Machine Learning , year=

Learning Transferable Visual Models From Natural Language Supervision , author=. International Conference on Machine Learning , year=

work page
[46]

Neurosurgery , volume=

Current clinical brain tumor imaging , author=. Neurosurgery , volume=. 2017 , publisher=

work page 2017
[47]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

work page 2024
[48]

2025 , note =

OpenAI , title =. 2025 , note =

work page 2025
[49]

2024 , eprint=

MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging , author=. 2024 , eprint=

work page 2024
[50]

2021 , organization=

Hatamizadeh, Ali and Nath, Vishwesh and Tang, Yucheng and Yang, Dong and Roth, Holger R and Xu, Daguang , booktitle=. 2021 , organization=

work page 2021
[51]

Price, Mackenzie and Ballard, Christine and Benedetti, Julia and Neff, Corey and Cioffi, Gino and Waite, Kristin A and Kruchko, Carol and Barnholtz-Sloan, Jill S and Ostrom, Quinn T , journal=

work page
[52]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

work page 2023
[53]

arXiv preprint arXiv:2603.21687 , year=

Mirage the illusion of visual understanding , author=. arXiv preprint arXiv:2603.21687 , year=

work page arXiv
[54]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[55]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[56]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

work page 2026
[57]

2021 , eprint=

The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification , author=. 2021 , eprint=

work page 2021
[58]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025

[1] [1]

Academic Radiology , volume=

It is About" Time": Academic Neuroradiologist Time Distribution for Interpreting Brain MRIs , author=. Academic Radiology , volume=. 2018 , publisher=

work page 2018

[2] [2]

Krupinski, E. A. and Berbaum, K. S. and Caldwell, R. T. and Schartz, K. M. and Kim, J. , title =. Journal of the American College of Radiology , year =

work page

[3] [3]

Xin, Yu and Ates, Gorkem Can and Gong, Kuang and Shao, Wei , journal=. Med3d. 2025 , publisher=

work page 2025

[4] [4]

, author=

Learning neuroimaging models from health system-scale data. , author=. Nature biomedical engineering , year=

work page

[5] [5]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

work page

[6] [6]

2025 , eprint=

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge , author=. 2025 , eprint=

work page 2025

[7] [7]

and Rudie, Jeffrey D

Calabrese, Evan and Villanueva-Meyer, Javier E. and Rudie, Jeffrey D. and Rauschecker, Andreas M. and Baid, Ujjwal and Bakas, Spyridon and Cha, Soonmee and Mongan, John T. and Hess, Christopher P. , title =. Radiology: Artificial Intelligence , volume =. 2022 , doi =. https://doi.org/10.1148/ryai.220058 , abstract =

work page doi:10.1148/ryai.220058 2022

[8] [8]

and Rudie, Jeffrey D

Calabrese, Evan and Villanueva-Meyer, Javier E. and Rudie, Jeffrey D. and Rauschecker, Andreas M. and Baid, Ujjwal and Bakas, Spyridon and Cha, Soonmee and Mongan, John T. and Hess, Christopher P. , title =. 2022 , publisher =. doi:10.7937/tcia.bdgf-8v37 , url =

work page doi:10.7937/tcia.bdgf-8v37 2022

[9] [9]

Bai, Fan and Du, Yuxin and Huang, Tiejun and Meng, Max Q-H and Zhao, Bo , journal=. M3

work page

[10] [10]

Scientific data , volume=

A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific data , volume=. 2018 , publisher=

work page 2018

[11] [11]

2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=. 2021 , organization=

work page 2021

[12] [12]

2024 , eprint=

VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging , author=. 2024 , eprint=

work page 2024

[13] [13]

arXiv preprint arXiv:2502.05091 , year=

Dcformer: Efficient 3D vision-language modeling with decomposed convolutions , author=. arXiv preprint arXiv:2502.05091 , year=

work page arXiv

[14] [14]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[15] [15]

Steiner and Can Kirmizibayrak and Rory Pilgrim and Daniel Golden and Lin Yang , journal=

Andrew Sellergren and Sahar Kazemzadeh and Tiam Jaroensri and Atilla Kiraly and Madeleine Traverse and Timo Kohlberger and Shawn Xu and Fayaz Jamil and Cían Hughes and Charles Lau and Justin Chen and Fereshteh Mahvar and Liron Yatziv and Tiffany Chen and Bram Sterling and Stefanie Anna Baby and Susanna Maria Baby and Jeremy Lai and Samuel Schmidgall and L...

work page

[16] [16]

NEJM AI2(1), 2400640 (2025) https: //doi.org/10.1056/AIoa2400640 https://ai.nejm.org/doi/pdf/10.1056/AIoa2400640

Sheng Zhang and Yanbo Xu and Naoto Usuyama and Hanwen Xu and Jaspreet Bagga and Robert Tinn and Sam Preston and Rajesh Rao and Mu Wei and Naveen Valluri and Cliff Wong and Andrea Tupini and Yu Wang and Matt Mazzola and Swadheen Shukla and Lars Liden and Jianfeng Gao and Angela Crabtree and Brian Piening and Carlo Bifulco and Matthew P. Lungren and Tristan...

work page doi:10.1056/aioa2400640 2025

[17] [17]

2025 , eprint=

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. 2025 , eprint=

work page 2025

[18] [18]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[19] [19]

arXiv preprint arXiv:2507.08036 , year=

Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights , author=. arXiv preprint arXiv:2507.08036 , year=

work page arXiv

[20] [20]

Moor, Michael and Huang, Qian and Wu, Shirley and Yasunaga, Michihiro and Dalmia, Yash and Leskovec, Jure and Zakka, Cyril and Reis, Eduardo Pontes and Rajpurkar, Pranav , booktitle=. Med-. 2023 , organization=

work page 2023

[21] [21]

arXiv preprint arXiv:2506.18378 , year=

Taming vision-language models for medical image analysis: A comprehensive review , author=. arXiv preprint arXiv:2506.18378 , year=

work page arXiv

[22] [22]

Eslami, Sedigheh and Meinel, Christoph and De Melo, Gerard , booktitle=. Pub

work page

[23] [23]

IEEE Transactions on Multimedia , volume=

Multi-task paired masking with alignment modeling for medical vision-language pre-training , author=. IEEE Transactions on Multimedia , volume=. 2023 , publisher=

work page 2023

[24] [24]

and Jagad, Chirag and Senthilkumar, Pavithra and Thomopoulos, Sophia I

Dhinagar, Nikhil J. and Jagad, Chirag and Senthilkumar, Pavithra and Thomopoulos, Sophia I. and Khan, Mahir H. and Liew, Sook-Lei and the ENIGMA-Stroke Recovery Working Group and Banaj, Nerisa and Boric, Michael R. and Boyd, Lara A. and Brodtmann, Amy and Cassidy, Jessica M. and Conforto, Adriana B. and Cramer, Steven C. and Dula, Adrienne N. and Geranmay...

work page 2026

[25] [25]

He, Xuehai and Zhang, Yichen and Mou, Luntian and Xing, Eric and Xie, Pengtao , journal=. Path

work page

[26] [26]

Zhang, Xiaoman and Wu, Chaoyi and Zhao, Ziheng and Lin, Weixiong and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , journal=

work page

[27] [27]

Generalist foundation models from a multimodal dataset for 3D computed tomography , ISSN=

Hamamci, Ibrahim Ethem and Er, Sezgin and Wang, Chenyu and Almas, Furkan and Simsek, Ayse Gulnihan and Esirgun, Sevval Nil and Dogan, Irem and Durugol, Omer Faruk and Hou, Benjamin and Shit, Suprosanna and Dai, Weicheng and Xu, Murong and Reynaud, Hadrien and Dasdelen, Muhammed Furkan and Wittmann, Bastian and Amiranashvili, Tamaz and Simsar, Enis and Sim...

work page doi:10.1038/s41551-025-01599-y

[28] [28]

arXiv preprint arXiv:2511.17803 , year=

Pillar-0: A New Frontier for Radiology Foundation Models , author =. arXiv preprint arXiv:2511.17803 , year=

work page arXiv

[29] [29]

and Zaharchuk, Greg and Willis, Marc and Yala, Adam and Johnston, Andrew and Boutin, Robert D

Blankemeier, Louis and Kumar, Ashwin and Cohen, Joseph Paul and Liu, Jiaming and Liu, Longchao and Van Veen, Dave and Gardezi, Syed Jamal Safdar and Yu, Hongkun and Paschali, Magdalini and Chen, Zhihong and Delbrouck, Jean-Benoit and Reis, Eduardo and Holland, Robbie and Truyts, Cesar and Bluethgen, Christian and Wu, Yufu and Lian, Long and Jensen, Malte ...

work page doi:10.1038/s41586-026-10181-8

[30] [30]

Wehbe and Faraz S

Yikuan Li and Ramsey M. Wehbe and Faraz S. Ahmad and Hanyin Wang and Yuan Luo , year=. Clinical-. 2201.11838 , archivePrefix=

work page arXiv

[31] [31]

arXiv preprint arXiv:2503.12355 , year=

Atlas: Multi-Scale Attention Improves Long Context Image Modeling , author=. arXiv preprint arXiv:2503.12355 , year=

work page arXiv

[32] [32]

and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan , title =

Boecking, Benedikt and Usuyama, Naoto and Bannur, Shruthi and Castro, Daniel C. and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan , title =. Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Procee...

work page doi:10.1007/978-3-031-20059-5_1 2022

[33] [33]

Nature Medicine , year=

Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof-of-concept trial , author=. Nature Medicine , year=

work page

[34] [34]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023

[35] [35]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page

[36] [36]

2021 , eprint=

Florence: A New Foundation Model for Computer Vision , author=. 2021 , eprint=

work page 2021

[37] [37]

European conference on computer vision , pages=

Davit: Dual attention vision transformers , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022

[38] [38]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Unified contrastive learning in image-text-label space , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[39] [39]

2019 , url=

Language Models are Unsupervised Multitask Learners , author=. 2019 , url=

work page 2019

[40] [40]

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

OVQA: A clinically generated visual question answering dataset , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

work page

[41] [41]

ViGIL@NeurIPS , year=

Visual Dialog for Radiology: Data Curation and FirstSteps , author=. ViGIL@NeurIPS , year=

work page

[42] [42]

Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain , booktitle =

Asma. Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain , booktitle =. 2021 , publisher =

work page 2021

[43] [43]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jeongwoo and Ji, Lei and Chang, Eric I-Chao and Kim, Tackeun and Choi, Edward , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

work page 2023

[44] [44]

2025 , month = feb, note =

Hu, Xinyue and Gu, Lin and An, Qiyuan and Zhang, Mengliang and liu, liangchen and Kobayashi, Kazuma and Harada, Tatsuya and Summers, Ronald and Zhu, Yingying , title =. 2025 , month = feb, note =. doi:10.13026/e6dd-cn74 , url =

work page doi:10.13026/e6dd-cn74 2025

[45] [45]

International Conference on Machine Learning , year=

Learning Transferable Visual Models From Natural Language Supervision , author=. International Conference on Machine Learning , year=

work page

[46] [46]

Neurosurgery , volume=

Current clinical brain tumor imaging , author=. Neurosurgery , volume=. 2017 , publisher=

work page 2017

[47] [47]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

work page 2024

[48] [48]

2025 , note =

OpenAI , title =. 2025 , note =

work page 2025

[49] [49]

2024 , eprint=

MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging , author=. 2024 , eprint=

work page 2024

[50] [50]

2021 , organization=

Hatamizadeh, Ali and Nath, Vishwesh and Tang, Yucheng and Yang, Dong and Roth, Holger R and Xu, Daguang , booktitle=. 2021 , organization=

work page 2021

[51] [51]

Price, Mackenzie and Ballard, Christine and Benedetti, Julia and Neff, Corey and Cioffi, Gino and Waite, Kristin A and Kruchko, Carol and Barnholtz-Sloan, Jill S and Ostrom, Quinn T , journal=

work page

[52] [52]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

work page 2023

[53] [53]

arXiv preprint arXiv:2603.21687 , year=

Mirage the illusion of visual understanding , author=. arXiv preprint arXiv:2603.21687 , year=

work page arXiv

[54] [54]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025

[55] [55]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025

[56] [56]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

work page 2026

[57] [57]

2021 , eprint=

The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification , author=. 2021 , eprint=

work page 2021

[58] [58]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025