pith. sign in

arxiv: 2606.21020 · v1 · pith:OWSSBIK4new · submitted 2026-06-19 · 💻 cs.CV · cs.AI

CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays

Pith reviewed 2026-06-26 15:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords chest X-rayvision-language modelslesion perceptionbenchmarkmedical imagingvisual groundingdomain adaptation
0
0 comments X

The pith

Vision-language models detect chest X-ray lesions at coarse level but lose accuracy on contours and attributes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CheXpercept, a benchmark that tests vision-language models on chest X-ray lesion perception through three sequential levels that follow a radiologist's typical process. The first level checks basic detection of a lesion's presence, the second evaluates the exact contour of the lesion, and the third extracts semantic attributes such as size or shape. Benchmarking fourteen models reveals solid results only on the initial detection step, followed by sharp drops in performance on the finer tasks. Medical-tuned models show virtually no gain over general models, indicating that existing adaptation methods fail to build genuine visual expertise for these images.

Core claim

The paper establishes that existing vision-language models achieve adequate performance only at the coarse level of lesion detection in chest X-rays, with accuracy degrading precipitously on deeper visual tasks of contour evaluation and attribute extraction, and that medical VLMs exhibit almost no perceptual advantage over general-domain counterparts.

What carries the argument

CheXpercept, a sequential multi-level perception benchmark built from 10,400 QA items across 2,100 chest X-rays and seven lesion types via a semi-automated pipeline reviewed by six medical experts.

If this is right

  • Clinical use of VLMs for chest X-ray analysis requires explicit improvements in fine-grained visual grounding beyond classification accuracy.
  • Domain adaptation techniques for medical VLMs must incorporate mechanisms that preserve perceptual detail across task depths.
  • Benchmark design for medical imaging should prioritize sequential workflows that test progressive levels of visual analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The multi-level structure could be adapted to evaluate perception in other imaging modalities such as CT or MRI.
  • Training objectives for future models may need to include direct supervision on lesion contours and attributes rather than relying on classification alone.
  • The gap between coarse and fine performance suggests that scaling model size or data volume alone is unlikely to resolve the perceptual shortfall.

Load-bearing premise

The semi-automated generation pipeline and expert review produce QA items that accurately capture expert-level lesion perception and mirror a radiologist's cognitive workflow across the three levels.

What would settle it

A model that maintains high accuracy on the contour evaluation and attribute extraction levels, comparable to expert radiologists, would show the claim of systemic failure in current domain adaptation does not hold.

Figures

Figures reproduced from arXiv: 2606.21020 by Edward Choi, Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Jeong Yun Jang, Nalee Kim, Sang Hoon Seo.

Figure 1
Figure 1. Figure 1: Overview of the CheXpercept benchmark. The evaluation systematically advances from [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the 20 lung sub-regions. They are formed by intersecting the basic upper, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the suboptimal mask generation process. First, disjoint lung sub-regions [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end illustration of a single deformation step on a pneumonia case in which one [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System message used for all evaluations. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Stage 1 (detection) prompt templates. C.2 Stage 2: contour evaluation This stage is evaluated along both the RR (suboptimal mask) and RF (optimal mask) paths. The model is provided with a chest X-ray in which the candidate lesion mask is rendered as a colored overlay. Specifically, the model is asked to determine whether the presented mask requires major revision ( [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Stage 2 (contour evaluation) prompt template. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Stage 3 (contour revision) prompt templates. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Stage 4 (attribute extraction) prompt templates. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evaluation pipeline for a lesion-free (LF) consolidation case. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evaluation pipeline for an effusion revision-free (RF) case. Because the optimal mask is [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evaluation pipeline for a cardiomegaly revision-required (RR) case. Ground-truth answers [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Annotation panel shown to experts during optimal-mask filtering. [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Decision sheet used during optimal-mask filtering. [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
read the original abstract

The evaluation of vision-language models (VLMs) for chest X-ray (CXR) analysis has largely been limited to disease-presence classification without visual grounding. Such evaluations fail to verify the expert-level lesion perception necessary to ensure the clinical reliability of VLMs. To address these limitations, we introduce CheXpercept, a sequential, multi-level perception benchmark that mirrors a radiologist's cognitive workflow across coarse-level detection, fine-level contour evaluation and revision, and semantic-level attribute extraction. To ensure high clinical fidelity at scale, we construct the dataset using a semi-automated generation pipeline paired with a review by six medical experts. CheXpercept contains 10,400 QA items derived from 2,100 CXRs, covering seven clinically critical pulmonary and cardiac lesions. To demonstrate the current landscape of VLM perception, we benchmark 14 general and medical VLMs on CheXpercept. The models achieve adequate performance only at the coarse level, with accuracy degrading precipitously on deeper visual tasks. Notably, medical VLMs show almost no perceptual advantage over their general-domain counterparts, highlighting a systemic flaw in current domain adaptation. The code and dataset will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CheXpercept, a sequential multi-level benchmark for vision-language models on chest X-ray lesion perception. It consists of 10,400 QA items derived from 2,100 CXRs covering seven lesions, constructed via a semi-automated pipeline with review by six medical experts. The benchmark probes three levels mirroring radiologist workflow: coarse detection, fine contour evaluation/revision, and semantic attribute extraction. Benchmarking 14 general and medical VLMs shows adequate performance only at the coarse level with precipitous degradation on finer tasks, and no perceptual advantage for medical VLMs over general-domain counterparts. Code and dataset are to be released publicly.

Significance. If the benchmark items validly capture expert-level lesion perception without pipeline artifacts, the work would highlight a systemic limitation in current VLM domain adaptation for clinical CXR analysis. The public release of code and dataset strengthens reproducibility and enables future extensions or ablations by the community.

major comments (2)
  1. [Methods / Dataset Construction] Dataset construction (described in the methods): the semi-automated pipeline with six-expert review is presented as ensuring clinical fidelity, but no inter-rater agreement statistics, no comparison against fully manual expert annotations, and no ablation on cueing or ambiguity in fine-level contour/semantic questions are reported. This directly bears on the central claim that performance degradation reflects true perceptual deficits rather than benchmark construction artifacts.
  2. [Experiments / Results] Results section: the headline finding of 'almost no perceptual advantage' for medical VLMs and 'precipitous' degradation on deeper tasks is presented without error bars, per-model breakdowns by lesion type, or controls for question difficulty distribution across the three levels. These omissions make it impossible to assess whether the observed pattern is robust or driven by a subset of items.
minor comments (2)
  1. [Abstract] Abstract states headline results but supplies no quantitative scores or exclusion criteria; moving a concise summary table or key metrics into the abstract would improve readability.
  2. [Introduction / Benchmark Design] Notation for the three perception levels is introduced without an explicit diagram or example QA triple per level; adding one would clarify the sequential workflow claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Methods / Dataset Construction] Dataset construction (described in the methods): the semi-automated pipeline with six-expert review is presented as ensuring clinical fidelity, but no inter-rater agreement statistics, no comparison against fully manual expert annotations, and no ablation on cueing or ambiguity in fine-level contour/semantic questions are reported. This directly bears on the central claim that performance degradation reflects true perceptual deficits rather than benchmark construction artifacts.

    Authors: We agree that inter-rater agreement statistics would strengthen validation of clinical fidelity and will compute and report them from the expert review process in the revised methods section. We will also add ablations examining cueing effects and question ambiguity to the supplementary material. A direct comparison to fully manual annotations was not performed due to the design of the semi-automated pipeline for scalability; we will explicitly discuss this as a limitation in the revision. revision: partial

  2. Referee: [Experiments / Results] Results section: the headline finding of 'almost no perceptual advantage' for medical VLMs and 'precipitous' degradation on deeper tasks is presented without error bars, per-model breakdowns by lesion type, or controls for question difficulty distribution across the three levels. These omissions make it impossible to assess whether the observed pattern is robust or driven by a subset of items.

    Authors: We agree these additions are needed to demonstrate robustness. We will add error bars to all reported metrics, include per-model breakdowns by lesion type in the results, and provide an analysis of question difficulty distribution across the three levels to rule out imbalance as a confounding factor. revision: yes

standing simulated objections not resolved
  • Direct comparison against fully manual expert annotations

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential predictions

full rationale

The paper introduces CheXpercept as a new benchmark dataset and evaluates existing VLMs on it. The abstract and provided text describe a semi-automated QA generation pipeline reviewed by experts, followed by model benchmarking. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear. The central claims rest on empirical results from the benchmark rather than any derivation chain that reduces to its own construction by definition. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that expert-reviewed semi-automated QA items faithfully represent radiologist perception; no free parameters, axioms, or invented entities are introduced beyond standard dataset construction practices.

pith-pipeline@v0.9.1-grok · 5764 in / 1134 out tokens · 26470 ms · 2026-06-26T15:03:09.623807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 2 canonical work pages

  1. [1]

    Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

    Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, et al. Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

  2. [2]

    Maira-2: Grounded radiology report generation.arXiv preprint arXiv:2406.04449, 2024

    Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, et al. Maira-2: Grounded radiology report generation.arXiv preprint arXiv:2406.04449, 2024

  3. [3]

    Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain

    Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. InProceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021

  4. [4]

    Radiotransformer: A cascaded global-focal transformer for visual attention–guided disease classification

    Moinak Bhattacharya, Shubham Jain, and Prateek Prasanna. Radiotransformer: A cascaded global-focal transformer for visual attention–guided disease classification. InEuropean Conference on Computer Vision, pages 679–698. Springer, 2022

  5. [5]

    Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  6. [6]

    Towards injecting medical visual knowledge into multimodal llms at scale

    Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7346–7370, 2024

  7. [7]

    A vision-language foundation model to enhance efficiency of chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

    Zhihong Chen, Maya Varma, Justin Xu, Magdalini Paschali, Dave Van Veen, Andrew Johnston, Alaa Youssef, Louis Blankemeier, Christian Bluethgen, Stephan Altmayer, et al. A vision-language foundation model to enhance efficiency of chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

  8. [8]

    Instruction-guided lesion segmentation for chest x-rays with automatically generated large-scale dataset

    Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo, Eunho Yang, and Edward Choi. Instruction-guided lesion segmentation for chest x-rays with automatically generated large-scale dataset. arXiv preprint arXiv:2511.15186, 2025

  9. [9]

    MIMIC-CXR-Ext-ILS: Lesion Segmentation Masks and Instruction-Answer Pairs for Chest X-rays

    Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo, Eunho Yang, and Edward Choi. MIMIC-CXR-Ext-ILS: Lesion Segmentation Masks and Instruction-Answer Pairs for Chest X-rays. PhysioNet, March 2026. doi: 10.13026/8ejy-4t06. URL https://doi.org/10.13026/8ejy-4t06. Version 1.0.0

  10. [10]

    Chexmask-u: Quantifying uncertainty in landmark-based anatomical segmentation for x-ray images.arXiv preprint arXiv:2512.10715, 2025

    Matias Cosarinsky, Nicolas Gaggion, Rodrigo Echeveste, and Enzo Ferrante. Chexmask-u: Quantifying uncertainty in landmark-based anatomical segmentation for x-ray images.arXiv preprint arXiv:2512.10715, 2025

  11. [11]

    Perceptual and interpretive error in diagnostic radiology—causes and potential solutions.Academic radiology, 26(6):833–845, 2019

    Andrew J Degnan, Emily H Ghobadi, Peter Hardy, Elizabeth Krupinski, Elena P Scali, Lindsay Stratchko, Adam Ulano, Eric Walker, Ashish P Wasnik, and William F Auffermann. Perceptual and interpretive error in diagnostic radiology—causes and potential solutions.Academic radiology, 26(6):833–845, 2019

  12. [12]

    Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

    Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, and Bo Wang. Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

  13. [13]

    Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  14. [14]

    Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering

    Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M Summers, and Yingying Zhu. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4156–...

  15. [15]

    Maira-1: A specialised large multimodal model for radiology report generation.arXiv preprint arXiv:2311.13668, 2023

    Stephanie L Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Mercy Ranjit, Anton Schwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, et al. Maira-1: A specialised large multimodal model for radiology report generation.arXiv preprint arXiv:2311.13668, 2023

  16. [16]

    Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

    Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025. 10

  17. [17]

    MIMIC-CXR-JPG - chest radiographs with structured labels.PhysioNet, March 2024

    Alistair Johnson, Matthew Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. MIMIC-CXR-JPG - chest radiographs with structured labels.PhysioNet, March 2024. doi: 10.13026/jsn5-t979. URLhttps://doi.org/10.13026/jsn5-t979. Version 2.1.0

  18. [18]

    Mimic-cxr database

    Alistair Johnson, Tom Pollard, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr database. PhysioNet10, 13026(C2JT1Q):5, 2024

  19. [19]

    Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

  20. [20]

    A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

  21. [21]

    Cxrea- sonbench: A benchmark for evaluating structured diagnostic reasoning in chest x-rays.arXiv preprint arXiv:2505.18087, 2025

    Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, and Edward Choi. Cxrea- sonbench: A benchmark for evaluating structured diagnostic reasoning in chest x-rays.arXiv preprint arXiv:2505.18087, 2025

  22. [22]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

  23. [23]

    Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis

    Bo Liu, Ke Zou, Li-Ming Zhan, Zexin Lu, Xiaoyu Dong, Yidi Chen, Chengqiang Xie, Jiannong Cao, Xiao- Ming Wu, and Huazhu Fu. Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21310–21320, 2025

  24. [24]

    Lunguage: A benchmark for structured and sequential chest x-ray interpretation.arXiv preprint arXiv:2505.21190, 2025

    Jong Hak Moon, Geon Choi, Paloma Rabaey, Min Gwan Kim, Hyuk Gi Hong, Jung-Oh Lee, Hangyul Yoon, Eun Woo Doe, Jiyoun Kim, Harshita Sharma, et al. Lunguage: A benchmark for structured and sequential chest x-ray interpretation.arXiv preprint arXiv:2505.21190, 2025

  25. [25]

    Reasoning visual language model for chest x-ray analysis.arXiv preprint arXiv:2510.23968, 2025

    Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akcicek, Hongxu Yin, Pavlo Molchanov, Marc Edgar, Yufan He, et al. Reasoning visual language model for chest x-ray analysis.arXiv preprint arXiv:2510.23968, 2025

  26. [26]

    Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding

    Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, and Pranav Rajpurkar. Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding. InBiocomputing 2026: Proceedings of the Pacific Symposium, pages 251–264. World Scientific, 2025

  27. [27]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

  28. [28]

    Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026

    Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026. URL https: //qwen.ai/blog?id=qwen3.6-27b

  29. [29]

    Interpretation of plain chest roentgenogram.Chest, 141(2):545–558, 2012

    Suhail Raoof, David Feigin, Arthur Sung, Sabiha Raoof, Lavanya Irugulpati, and Edward C Rosenow III. Interpretation of plain chest roentgenogram.Chest, 141(2):545–558, 2012

  30. [30]

    Accurate fine-grained segmentation of human anatomy in radiographs via volumetric pseudo-labeling.arXiv preprint arXiv:2306.03934, 2023

    Constantin Seibold, Alexander Jaus, Matthias A Fink, Moon Kim, Simon Reiß, Ken Herrmann, Jens Kleesiek, and Rainer Stiefelhagen. Accurate fine-grained segmentation of human anatomy in radiographs via volumetric pseudo-labeling.arXiv preprint arXiv:2306.03934, 2023

  31. [31]

    Medgemma technical report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

  32. [32]

    Medgemma 1.5 technical report.arXiv preprint arXiv:2604.05081, 2026

    Andrew Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger, Fayaz Jamil, Madeleine Traverse, Alberto Tono, Bashir Sadjad, Lin Yang, Charles Lau, et al. Medgemma 1.5 technical report.arXiv preprint arXiv:2604.05081, 2026

  33. [33]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  34. [34]

    Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 11

  35. [35]

    Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  36. [36]

    Internvl3

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  37. [37]

    Following the di- agnostic trace: Visual cognition-guided cooperative network for chest x-ray diagnosis.arXiv preprint arXiv:2602.21657, 2026

    Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang, and Jun Feng. Following the di- agnostic trace: Visual cognition-guided cooperative network for chest x-ray diagnosis.arXiv preprint arXiv:2602.21657, 2026

  38. [38]

    Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

    Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

  39. [39]

    Pmc- vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

  40. [40]

    None” fraction in Stage 3 is strictly capped below 50% by design, ensuring that a naive always-“None

    Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 12 A Benchmark Statistics A.1 Per-lesion composition Table 4 reports the per-lesion composition of CheXpercept. Each lesion contributes...

  41. [41]

    38 Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...