NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding
Pith reviewed 2026-05-21 06:50 UTC · model grok-4.3
The pith
AI models lag behind text-only baselines on a new 3D brain MRI question benchmark
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NeuroQA supplies 56953 verified QA pairs from full 3D brain MRI volumes of 12977 subjects spanning ages 5-104 and five clinical domains. It employs 131 image-grounded templates answerable from a 3-plane viewer and 72 image-informed templates based on volumetry or clinical instruments, all checked by a 38-rule pipeline and two expert reviews to ensure zero contradictions. On closed-format test items the leading zero-shot VLM reaches 47.5 percent accuracy while a supervised 3D CNN baseline reaches 43.7 percent, both below the 49.4 percent text-only majority floor.
What carries the argument
The 38-rule deterministic pipeline together with answer-distribution refinement and a separate image-grounding protocol that together force questions to require the MRI volume while preserving clinical validity.
If this is right
- The benchmark enables systematic testing of 11 reasoning skills in Yes/No, multiple-choice, and open formats using full 3D volumes rather than 2D slices.
- It supports model development across Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment with subject-level splits to avoid leakage.
- Public QA pairs for open datasets and reproducible scripts for restricted ones allow broad use while respecting data agreements.
- Clinician evaluation of 100 frozen test items on a three-plane viewer confirms alignment with real diagnostic practice.
- The held-out private test set and online leaderboard provide a stable way to track progress on image-grounded medical VQA.
Where Pith is reading between the lines
- Models that improve on NeuroQA could support more reliable automated review of brain scans in radiology workflows.
- The use of quantitative volumetry for ground truth offers a template for building similar benchmarks in other 3D medical imaging modalities.
- Persistent gaps may point to the need for architectural changes in VLMs to better process true volumetric data instead of slice projections.
- The verification approach could be adapted to create image-grounded QA resources for additional clinical prediction tasks.
Load-bearing premise
The answer-distribution refinement and image-grounding protocol successfully remove text-only shortcuts while preserving clinical validity without introducing selection biases that affect model rankings.
What would settle it
Finding that model accuracy on closed-format items stays the same or rises when the actual MRI volumes are replaced by blank or noise images would show the questions do not require visual input.
Figures
read the original abstract
We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents NeuroQA, a large-scale benchmark for 3D brain MRI visual question answering comprising 56,953 QA pairs from 12,977 subjects across 12 datasets and five clinical domains (Alzheimer's, Parkinson's, tumors, white matter disease, neurodevelopment). It defines 203 templates (131 image-grounded via 3-plane viewer, 72 image-informed via volumetry or clinical scores), generated through a 38-rule deterministic pipeline with two rounds of expert review that guarantees zero same-subject contradictions. Answer-distribution refinement reduces closed-format text-only accuracy from >80% to 44.6%, and baselines show the best zero-shot VLM at 47.5% and a supervised 3D CNN at 43.7%, both below the 49.4% text-only majority-template floor. The work includes a separate image-grounding protocol, clinician evaluation of 100 items, subject-level splits, a held-out private test set, and a two-tier release (public QA for open datasets, reproducible scripts for DUA-restricted data).
Significance. If the construction and validation hold, NeuroQA supplies a valuable, clinically grounded resource that moves beyond 2D-slice or narrow-label medical VQA by pairing every question with full 3D volumes. The 38-rule pipeline, two rounds of expert review, zero same-subject contradictions, and independent clinician assessment on a three-plane viewer provide concrete, reproducible support for data quality. The public/private release strategy, subject-level splits, and online leaderboard further enhance utility and reproducibility for the community. The reported gap between VLMs/CNNs and the majority baseline, if free of refinement artifacts, would usefully quantify current limitations in image-grounded 3D reasoning.
major comments (1)
- [Abstract] Abstract and methods description of answer-distribution refinement: the process that reduces closed-format text-only accuracy from >80% to 44.6% and yields the 47.5% VLM vs. 49.4% majority-template comparison is presented at a high level without explicit selection/reweighting criteria or verification that clinical validity and image necessity are preserved. Because this step is load-bearing for the central claim that current models fall short specifically on image-grounded 3D understanding (rather than an artifact of balancing), additional detail or pseudocode would be required to rule out selection biases that could affect VLM vs. text-only rankings.
minor comments (1)
- [Abstract] The abstract states that an image-grounding protocol is released with the benchmark; a brief pointer to its location or a one-sentence summary of its procedure in the main text would improve immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of NeuroQA and the constructive feedback on clarifying the answer-distribution refinement. We address the single major comment below and will incorporate the requested details in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and methods description of answer-distribution refinement: the process that reduces closed-format text-only accuracy from >80% to 44.6% and yields the 47.5% VLM vs. 49.4% majority-template comparison is presented at a high level without explicit selection/reweighting criteria or verification that clinical validity and image necessity are preserved. Because this step is load-bearing for the central claim that current models fall short specifically on image-grounded 3D understanding (rather than an artifact of balancing), additional detail or pseudocode would be required to rule out selection biases that could affect VLM vs. text-only rankings.
Authors: We agree that the abstract presents the refinement at a high level and that explicit criteria are needed to support the central claim. In the revised manuscript we will expand Section 3.3 with the precise selection and reweighting rules, including pseudocode for the iterative per-template distribution adjustment. The added description will document that adjustments are performed only after the two rounds of expert review, that clinical validity is preserved by retaining only pairs that remain consistent with FreeSurfer measurements and radiology reports, and that image necessity is independently verified by the released image-grounding protocol (which flags items answerable without the volume). We will also include a short verification table showing that post-refinement text-only accuracy drops uniformly across domains without altering the relative difficulty ordering between image-grounded and image-informed templates. These additions will confirm that the observed gap (47.5 % VLM vs. 49.4 % majority baseline) is not an artifact of the balancing procedure. revision: yes
Circularity Check
NeuroQA benchmark construction shows no circularity
full rationale
The paper presents an empirical benchmark construction using a deterministic 38-rule pipeline, expert review, and answer-distribution refinement to generate and validate 56,953 QA pairs from 3D MRI volumes. No mathematical derivations, equations, or fitted parameters are claimed as predictions; the reported accuracies (e.g., VLM at 47.5% below 49.4% majority floor) are direct empirical measurements on the constructed test set. The refinement step is a data-processing choice to reduce text-only shortcuts, not a self-definitional loop or self-citation that bears the central claim. The work is self-contained as a dataset release with verifiable generation scripts and splits, independent of any prior author results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption FreeSurfer measurements and radiology report fields provide reliable ground truth for QA verification.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We apply answer-distribution refinement, reducing closed-format text-only accuracy from >80% to 44.6%; image necessity is assessed separately through an image-grounding protocol
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Clifford R Jack Jr, Ronald C Petersen, Yue Cheng Xu, Peter C O’Brien, Glenn E Smith, Robert J Ivnik, Bradley F Boeve, Stephen C Waring, Eric G Tangalos, and Emre Kokmen. Prediction of ad with mri-based hippocampal volume in mild cognitive impairment.Neurology, 52(7):1397–1397, 1999
work page 1999
-
[2]
James H Cole, Riccardo E Marioni, Sarah E Harris, and Ian J Deary. Brain age and other bodily ‘ages’: implications for neuropsychiatry.Molecular psychiatry, 24(2):266–281, 2019
work page 2019
-
[3]
Mohamad Habes, Guray Erus, Jon B Toledo, Tianhao Zhang, Nick Bryan, Lenore J Launer, Yves Rosseel, Deborah Janowitz, Jimit Doshi, Sandra Van der Auwera, et al. White matter hyperintensities and imaging patterns of brain ageing in the general population.Brain, 139(4):1164–1179, 2016
work page 2016
-
[4]
Longitudinal brain volume changes in major depressive disorder
Dilara Yüksel, Jennifer Engelen, Verena Schuster, Bruno Dietsche, Carsten Konrad, Andreas Jansen, Udo Dannlowski, Tilo Kircher, and Axel Krug. Longitudinal brain volume changes in major depressive disorder. Journal of Neural Transmission, 125(10):1433–1447, 2018
work page 2018
-
[5]
Mapping cortical brain asymmetry in 17,141 healthy individuals worldwide via the enigma consortium
Xiang-Zhen Kong, Samuel R Mathias, Tulio Guadalupe, ENIGMA Laterality Working Group, David C Glahn, Barbara Franke, Fabrice Crivello, Nathalie Tzourio-Mazoyer, Simon E Fisher, Paul M Thompson, et al. Mapping cortical brain asymmetry in 17,141 healthy individuals worldwide via the enigma consortium. Proceedings of the National Academy of Sciences, 115(22):...
work page 2018
-
[6]
Foundation models for generalist medical artificial intelligence.Nature, 616 (7956):259–265, 2023
Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616 (7956):259–265, 2023. 10
work page 2023
-
[7]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018
work page 2018
-
[10]
PathVQA: 30000+ Questions for Medical Visual Question Answering
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[11]
Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding
Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, and Pranav Rajpurkar. Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding. InBiocomputing 2026: Proceedings of the Pacific Symposium, pages 251–264. World Scientific, 2025
work page 2026
-
[12]
arXiv preprint arXiv:2603.21687 , year=
Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026
-
[13]
Don’t just assume; look and answer: Overcoming priors for visual question answering
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980, 2018
work page 2018
-
[14]
Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Weikai Li, Wei Wang, Fabien Scalzo, and Yizhou Sun. A multimodal llm approach for visual question answering on multiparametric 3d brain mri.arXiv preprint arXiv:2509.25889, 2025
-
[15]
Yuli Wang, Jian Peng, Yuwei Dai, Craig Jones, Haris Sair, Jinglai Shen, Nicolas Loizou, Jing Wu, Wen-Chi Hsu, Maliha Imami, et al. Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection.Advances in Neural Information Processing Systems, 37:99947–99964, 2024
work page 2024
-
[16]
Zhihao Peng, Cheng Wang, Shengyuan Liu, Zhiying Liang, Zanting Ye, Minjie Ju, PeterYM Woo, and Yixuan Yuan. Omnibrainbench: A comprehensive multimodal benchmark for brain imaging analysis across multi-stage clinical tasks.arXiv preprint arXiv:2511.00846, 2025
-
[17]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017
work page 2017
-
[18]
Rahul S Desikan, Florent Ségonne, Bruce Fischl, Brian T Quinn, Bradford C Dickerson, Deborah Blacker, Randy L Buckner, Anders M Dale, R Paul Maguire, Bradley T Hyman, et al. An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest.Neuroimage, 31(3):968–980, 2006
work page 2006
-
[19]
Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021
work page 2021
-
[20]
Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, et al. Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024
work page 2024
-
[21]
Vqa-med: Overview of the medical visual question answering task at imageclef 2019
Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019
work page 2019
-
[22]
Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. InProceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021
work page 2021
-
[23]
Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, and Bo Yan. Microvqa++: High-quality microscopy reasoning dataset with weakly supervised graphs for multimodal large language model.arXiv preprint arXiv:2511.11407, 2025. 11
-
[24]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis
Bo Liu, Ke Zou, Li-Ming Zhan, Zexin Lu, Xiaoyu Dong, Yidi Chen, Chengqiang Xie, Jiannong Cao, Xiao- Ming Wu, and Huazhu Fu. Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21310–21320, 2025
work page 2025
-
[26]
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Radqa: A question answering dataset to improve comprehension of radiology reports
Sarvesh Soni, Meghana Gudala, Atieh Pajouhi, and Kirk Roberts. Radqa: A question answering dataset to improve comprehension of radiology reports. InProceedings of the thirteenth language resources and evaluation conference, pages 6250–6259, 2022
work page 2022
-
[28]
Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, R Summers, and Yingying Zhu. Medicaldiff-vqa: a large-scale medical dataset for difference visual question answering on chest x-ray images.PhysioNet, 12:13, 2023
work page 2023
-
[29]
Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P Kulkarni, Ziyue Xu, Daguang Xu, Syed Muhammad Anwar, Holger R Roth, and Marius George Linguraru. Lumen: Longitudinal multi-modal radiology model for prognosis and diagnosis.arXiv preprint arXiv:2602.21142, 2026
-
[30]
Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
work page 2020
-
[31]
Visual hallucinations of multi-modal large language models
Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. Visual hallucinations of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9614–9631, 2024
work page 2024
-
[32]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022
work page 2022
-
[33]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[34]
Large language models encode clinical knowledge
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023
work page 2023
-
[35]
Feng Guo, Jiaxiang Liu, Yang Li, Qianqian Shi, and Mingkun Xu. Mm-neuroonco: A multimodal benchmark and instruction dataset for mri-based brain tumor diagnosis.arXiv preprint arXiv:2602.22955, 2026
-
[36]
Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025
-
[37]
The claude 3 model family: Opus, sonnet, haiku.Technical Report, 2024
Anthropic. The claude 3 model family: Opus, sonnet, haiku.Technical Report, 2024
work page 2024
-
[38]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[39]
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023
work page 2023
-
[40]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Towards generalist biomedical ai.Nejm Ai, 1(3): AIoa2300138, 2024
Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai.Nejm Ai, 1(3): AIoa2300138, 2024
work page 2024
-
[42]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special- purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452, 2023
-
[45]
Zijian Dong, Ruilin Li, Yilei Wu, Thuan T Nguyen, Joanna S Chong, Fang Ji, Nathanael R Tong, Christopher L Chen, and Juan H Zhou. Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048– 86073, 2024
work page 2024
-
[46]
Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023
-
[47]
Favour Nerrise, Lucy Yin, Mohammad H. Abbasi, Kilian M. Pohl, and Ehsan Adeli. Geosae: Geometric prior-guided layer-wise sparse autoencoder annotation of brain mri foundation models, 2026. URL https://arxiv.org/abs/2605.01829
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Vladimir S Fonov, Alan C Evans, Robert C McKinstry, C Robert Almli, and DL Collins. Unbiased nonlinear average age-appropriate brain templates from birth to adulthood.NeuroImage, 47:S102, 2009
work page 2009
-
[49]
Freesurfer.Neuroimage, 62(2):774–781, 2012
Bruce Fischl. Freesurfer.Neuroimage, 62(2):774–781, 2012
work page 2012
-
[50]
Autorg-brain: Grounded report generation for brain mri.arXiv preprint arXiv:2407.16684, 2024
Jiayu Lei, Xiaoman Zhang, Chaoyi Wu, Lisong Dai, Ya Zhang, Yanyong Zhang, Yanfeng Wang, Weidi Xie, and Yuehua Li. Autorg-brain: Grounded report generation for brain mri.arXiv preprint arXiv:2407.16684, 2024
-
[51]
John C Morris. The clinical dementia rating (cdr) current version and scoring rules.Neurology, 43(11): 2412–2412, 1993
work page 1993
-
[52]
Christopher G Goetz, Barbara C Tilley, Stephanie R Shaftman, Glenn T Stebbins, Stanley Fahn, Pablo Martinez-Martin, Werner Poewe, Cristina Sampaio, Matthew B Stern, Richard Dodel, et al. Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs): scale presentation and clinimetric testing results.Movement diso...
work page 2008
-
[53]
Mark Endo, Favour Nerrise, Qingyu Zhao, Edith V Sullivan, Li Fei-Fei, Victor W Henderson, Kilian M Pohl, Kathleen L Poston, and Ehsan Adeli. Data-driven discovery of movement-linked heterogeneity in neurodegenerative diseases.Nature machine intelligence, 6(9):1034–1045, 2024
work page 2024
-
[54]
Ronald Carl Petersen, Paul S Aisen, Laurel A Beckett, Michael C Donohue, Anthony Collins Gamst, Danielle J Harvey, Clifford R Jack Jr, William J Jagust, Leslie M Shaw, Arthur W Toga, et al. Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3):201–209, 2010
work page 2010
-
[55]
Kathryn A Ellis, Ashley I Bush, David Darby, Daniela De Fazio, Jonathan Foster, Peter Hudson, Nicola T Lautenschlager, Nat Lenzo, Ralph N Martins, Paul Maruff, et al. The australian imaging, biomarkers and lifestyle (aibl) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease...
work page 2009
-
[56]
The parkinson progression marker initiative (ppmi).Progress in neurobiology, 95(4):629–635, 2011
Kenneth Marek, Danna Jennings, Shirley Lasch, Andrew Siderowf, Caroline Tanner, Tanya Simuni, Chris Coffey, Karl Kieburtz, Emily Flagg, Sohini Chowdhury, et al. The parkinson progression marker initiative (ppmi).Progress in neurobiology, 95(4):629–635, 2011
work page 2011
-
[57]
Ujjwal Baid, Satyam Ghodasara, Suyash Mohan, Michel Bilello, Evan Calabrese, Errol Colak, Keyvan Farahani, Jayashree Kalpathy-Cramer, Felipe C Kitamura, Sarthak Pati, et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification.arXiv preprint arXiv:2107.02314, 2021. 13
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[58]
Dominic LaBella, Katherine Schumacher, Michael Mix, Kevin Leu, Shan McBurney-Lin, Pierre Nedelec, Javier Villanueva-Meyer, David R Raleigh, Jonathan Shapey, Tom Vercauteren, et al. The 2024 brain tumor segmentation challenge meningioma radiotherapy (brats-men-rt) dataset.Scientific Data, 2026
work page 2024
-
[59]
Hugo J Kuijf, J Matthijs Biesbroek, Jeroen De Bresser, Rutger Heinen, Simon Andermatt, Mariana Bento, Matt Berseth, Mikhail Belyaev, M Jorge Cardoso, Adria Casamitjana, et al. Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge. IEEE transactions on medical imaging, 38(11):2556–2...
work page 2019
-
[60]
Roberto Souza, Oeslle Lucena, Julia Garrafa, David Gobbi, Marina Saluzzi, Simone Appenzeller, Letícia Rittner, Richard Frayne, and Roberto Lotufo. An open, multi-vendor, multi-field-strength brain mr dataset and analysis of publicly available skull stripping methods agreement.NeuroImage, 170:482–494, 2018
work page 2018
-
[61]
The wu-minn human connectome project: an overview.Neuroimage, 80: 62–79, 2013
David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The wu-minn human connectome project: an overview.Neuroimage, 80: 62–79, 2013
work page 2013
-
[62]
Betty Jo Casey, Tariq Cannonier, May I Conley, Alexandra O Cohen, Deanna M Barch, Mary M Heitzeg, Mary E Soules, Theresa Teslovich, Danielle V Dellarco, Hugh Garavan, et al. The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites.Developmental cognitive neuroscience, 32:43–54, 2018
work page 2018
-
[63]
Ixi dataset-information extraction from images project (epsrc gr/s21533/02)[internet]
D Hill, S Williams, D Hawkes, and SM Smith. Ixi dataset-information extraction from images project (epsrc gr/s21533/02)[internet]. 2006 [cited 2013 may 7], 2006
work page 2006
-
[64]
Mohammad H. Abbasi and Ehsan Adeli. sMRI Processing Pipeline: A lightweight, end-to-end workflow for structural brain MRI preprocessing and quality control, 2025. URL https://doi.org/10.5281/ zenodo.17503175. Zenodo, doi: 10.5281/zenodo.17503175
-
[65]
Ziad S Nasreddine, Natalie A Phillips, Valérie Bédirian, Simon Charbonneau, Victor Whitehead, Isabelle Collin, Jeffrey L Cummings, and Howard Chertkow. The montreal cognitive assessment, moca: a brief screening tool for mild cognitive impairment.Journal of the American Geriatrics Society, 53(4):695–699, 2005
work page 2005
-
[66]
Mini-mental state.Journal of psychiatric research, 12(3):189–198, 1975
Marshal F Folstein, Susan E Folstein, and Paul R McHugh. Mini-mental state.Journal of psychiatric research, 12(3):189–198, 1975
work page 1975
-
[67]
Noelle E Carlozzi, David S Tulsky, Robert V Kail, and Jennifer L Beaumont. Vi. nih toolbox cognition battery (cb): measuring processing speed.Monographs of the Society for Research in Child Development, 78(4):88–102, 2013
work page 2013
-
[68]
Thomas M Achenbach. Manual for the aseba school-age forms and profiles.University of Vermont Research Center for Children, Youth, and Families, 2001
work page 2001
-
[69]
Parkinsonism: onset, progression, and mortality.Neurology, 17(5): 427–427, 1967
Margaret M Hoehn and Melvin D Yahr. Parkinsonism: onset, progression, and mortality.Neurology, 17(5): 427–427, 1967
work page 1967
-
[70]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[71]
Jacob Cohen. A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1):37–46, 1960
work page 1960
-
[72]
Sau Lai Yip, Sunan He, Yuxiang Nie, Shu Pui Chan, Yilin Ye, Sum Ying Lam, and Hao Chen. Medbookvqa: A systematic and comprehensive medical benchmark derived from open-access book.arXiv preprint arXiv:2506.00855, 2025
-
[73]
arXiv preprint arXiv:2404.00578 , year=
Fan Bai, Yuxin Du, Tiejun Huang, Max Q-H Meng, and Bo Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models.arXiv preprint arXiv:2404.00578, 2024
-
[74]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024
work page 2024
-
[75]
J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. biometrics, pages 159–174, 1977. 14 Contents of Appendix/Supplementary Material A Limitations and Future Work 16 B Complete Quality Rule List 16 C Dataset Composition and Splits 16 D Leaderboard and Public Release 16 E Expert Review Details 17 E.1 First Round: Te...
work page 1977
-
[76]
Laterality removal (BraTS-GLI/MEN, 786 QA removed).Expert review flagged laterality labels as unreliable. Systematic verification of 220 unilateral BraTS-GLI subjects showed only 49% agreement between report-stated and voxel-based laterality, effectively random, because BraTS reports and NIfTI images may originate from different processing stages or conve...
-
[77]
Cerebellopontine angle fix (BraTS-MEN, 28 QA corrected).Fifteen BraTS-MEN subjects with cerebellopontine angle (CPA) lesions were incorrectly labeled as “cerebellar” in Location questions. CPA lesions are extra-axial posterior fossa masses, anatomically distinct from cerebellar parenchymal lesions. Corrected to “posterior fossa.” 17 Table 6: Per-dataset c...
work page 2000
-
[78]
Removed across 7 datasets; retained questions with clearly visible asymmetry (>10% difference)
Subtle asymmetry removal (1,524 QA removed).57% of laterality questions had less than 10% volume asymmetry between left and right structures, differences imperceptible on visual inspection. Removed across 7 datasets; retained questions with clearly visible asymmetry (>10% difference)
-
[79]
What is the T1 signal intensity of the lesion?
Signal wording standardization (654 QA reworded).The neuroradiologist recommended standard clinical phrasing: “What is the T1 signal intensity of the lesion?” was revised to “What is the signal intensity of the lesion on T1-weighted imaging?” No answers changed; only question text was reworded to match radiology reporting conventions. 18 Figure 6: Phase 1...
-
[80]
Anatomy questions.Both reviewers rated Anatomy questions as correct and highly relevant. The neuroradiologist noted that axial views are atypical for hippocampal assessment; however, this applies only to the 2D survey visualization. NEUROQA provides full 3D volumetric input to models, including all orientations
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.