pith. sign in

arxiv: 2605.22080 · v1 · pith:M4JITNJJnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

Pith reviewed 2026-05-22 07:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords JMed48kJapanese medical licensingvision-language modelsimage-removal auditmultimodal medical evaluationhealthcare benchmarksmodel visual reliance
0
0 comments X

The pith

General and open-source vision-language models gain substantially more from medical images than specialized medical models do on Japanese licensing exams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JMed48k, a benchmark drawn from official Japanese Ministry exams containing 48,862 questions and 20,142 images spanning 11 healthcare professions from 2005 to 2025. It creates an evaluation subset of recent questions and tests 21 models both with images present and in text-only form. A paired audit then removes the images from the visual questions and tracks how each model's answers change. Proprietary and open-source models show clear performance lifts when images are available, while medical-specific models often keep their correct answers even after images are stripped away. The size of the image benefit differs sharply by profession, reaching a sevenfold spread between the smallest and largest effects observed.

Core claim

The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions.

What carries the argument

The paired image-removal audit, which re-evaluates the exact same questions before and after visual content is stripped to isolate four answer-transition states and measure reliance on images.

If this is right

  • Medical-specific models appear to solve many licensing questions through text patterns alone.
  • The large variation by profession indicates that visual demands differ markedly across healthcare fields.
  • Open-source general models can close much of the gap with domain-specific systems when images are supplied.
  • The benchmark allows direct comparison of visual information use across model categories and professions.
  • Releasing the full set of questions and images supports repeated, profession-stratified testing of future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pattern may reflect training data differences, with medical models exposed to fewer image-rich examples during development.
  • Applying the same removal audit to other medical benchmarks could reveal whether limited visual use is widespread.
  • If confirmed, training regimes for medical VLMs could be adjusted to increase reliance on and benefit from visual cues.
  • The profession-level spread might mirror real differences in how visual information is used by human practitioners in those roles.

Load-bearing premise

Removing the visual content from a question produces a text-only version whose difficulty and answerability remain essentially unchanged.

What would settle it

If expert review shows that removing images makes many questions substantially harder or invalid, or if medical-specific models display accuracy drops comparable to general models, the attribution of performance differences to visual use would be undermined.

Figures

Figures reproduced from arXiv: 2605.22080 by Bowen Zhao, Irene Li, Junyu Liu, Kan Hatakeyama-Sato, Qian Niu, Shujun Wang, Xinyi Wang, Yue Xun, Yusuke Iwasawa, Yutaka Matsuo, Zequn Zhang, Zheng Yuan, Zirui Li.

Figure 1
Figure 1. Figure 1: Overview of JMed48k benchmark. Left: JMed48k aggregates 48,862 official questions from [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Paired image-removal audit. Each row decomposes a model’s predictions on the 2,579 questions with images into four paired outcomes: correct with and without images (p11), correct only with images (p10), correct only after image removal (p01), and incorrect in both settings (p00). The right column reports the net image effect, with negative values shown in red. Per-model counts are provided in Appendix G.1.… view at source ↗
Figure 3
Figure 3. Figure 3: Profession-level visual reliance. (a) shows the four paired answer-transition states by profession, pooled over the 20 multimodal-capable models. State p11 counts items answered correctly in both settings, p10 those answered correctly only with the image, p01 those answered correctly only after image removal, and p00 those answered incorrectly in both. Per-profession item counts nq are also reported. (b) s… view at source ↗
Figure 4
Figure 4. Figure 4: Per-profession combined accuracy across the 5 covered exam years, with MHLW per-year pass-thresholds overlaid. Each panel reports one profession; the y-axis is the combined (text-only + with-images) accuracy for that (model, profession, year), pooled across both modes. Year coverage differs by profession (Public Health Nurse, Midwife, and Nurse cover 2020–2024; the other eight cover 2021–2025), reflecting … view at source ↗
Figure 5
Figure 5. Figure 5: Per-(model, profession) text-only versus with-images accuracy gap on JMed48k￾Eval. Each cell reports text-only accuracy minus with-images accuracy in percentage points for the corresponding model and profession. Because the two subsets contain different questions, this matrix should be interpreted as a subset-difficulty gap rather than a causal image-removal effect. Rows are the 20 multimodal-capable model… view at source ↗
Figure 6
Figure 6. Figure 6: Per-(model, image type) ∆img heatmap. Rows are the 20 multimodal-capable models grouped by family band (Proprietary, Open-source, Medical-specific). Columns are the eight primary image types defined in the v3.0 taxonomy (§C), ordered by sample size descending; the small Other/Unclear category (nq = 4) is omitted. Colour encodes ∆img in percentage points on a diverging RdBu scale clipped to ±30 pp; red cell… view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy on image-required (IO) items, with random baseline. Each bar is one model’s awith on the nIO = 185 MCQ items in the IO subset. The dashed line marks the per-item random baseline 1/k = 20.4%; the shaded zone to its left is the below-random region. Bar annotation gives the signed gap awith−random in percentage points. Any bar to the right of the baseline exceeds what the question stem alone can yiel… view at source ↗
Figure 8
Figure 8. Figure 8: reports per-model accuracy on the 52 pharma-chemistry IO items introduced in §I. Each item presents 2D structural formulas or reaction schemes as the image options; no language prior can map the question text to the correct chemical structure without reading the image. 0 2 4 6 8 10 12 14 16 18 Accuracy on pharma-chemistry IO items (%) Gemini 2.5 Pro Grok 4.20 GPT-5 Lingshu-I 8B GLM-4.6V 9B GPT-5 mini Claud… view at source ↗
read the original abstract

We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces JMed48k, a benchmark of 48,862 Japanese medical licensing exam questions and 20,142 images drawn from official Ministry of Health, Labour and Welfare PDFs spanning 11 professions and 2005–2025. It derives a 12,484-question evaluation subset (JMed48k-Eval) and evaluates 21 proprietary, open-source, and medical-specific VLMs on text-only versus with-image performance. A paired image-removal audit is performed on the 2,579 image-containing questions to track answer transitions and attribute performance gains to visual evidence use, yielding the central finding that general models benefit substantially from images while medical-specific models exhibit limited observable visual reliance, with net image-removal effects varying sevenfold across professions.

Significance. If the attribution of performance deltas to visual information use holds, the work supplies a large-scale, official-source, multi-profession benchmark that enables reproducible, profession-stratified VLM evaluation in realistic medical licensing settings. Notable strengths include the direct use of government-released materials, the 8-type visual taxonomy, the scale of the corpus, and the introduction of the paired audit protocol for probing visual reliance.

major comments (3)
  1. [paired image-removal audit] The paired image-removal audit (described in the abstract and evaluation protocol) assumes that simply removing visual content leaves a valid, equivalently difficult text-only question. No validation is reported for textual integrity (e.g., manual review for implicit references such as “as shown in the figure” or unlabeled diagrams whose removal alters cognitive load). This assumption is load-bearing for the claim that medical-specific models show “limited observable use of visual evidence,” because observed correct-to-correct transitions could partly reflect question degradation rather than model behavior.
  2. [dataset construction] Dataset construction provides no details on annotation quality control, inter-annotator agreement, or adjudication process for the 8-type visual taxonomy applied to the 20,142 images. Without these, the reliability of the image-containing subset (2,579 questions in JMed48k-Eval) cannot be assessed, directly affecting the soundness of all with-image versus text-only comparisons.
  3. [evaluation results] The reported profession-stratified performance differences (e.g., +5.7 to +39.8 points net image-removal effect) are presented without statistical significance tests or confidence intervals. This omission weakens the claim of a “sevenfold” variation across professions, as it is unclear whether the observed deltas exceed what would be expected from sampling variability alone.
minor comments (1)
  1. The abstract states that subsets contain different questions yet proceeds to direct comparison; a brief clarification of how the paired audit mitigates this would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and indicate the revisions we will make to the next version of the paper.

read point-by-point responses
  1. Referee: [paired image-removal audit] The paired image-removal audit (described in the abstract and evaluation protocol) assumes that simply removing visual content leaves a valid, equivalently difficult text-only question. No validation is reported for textual integrity (e.g., manual review for implicit references such as “as shown in the figure” or unlabeled diagrams whose removal alters cognitive load). This assumption is load-bearing for the claim that medical-specific models show “limited observable use of visual evidence,” because observed correct-to-correct transitions could partly reflect question degradation rather than model behavior.

    Authors: We agree that the validity of the text-only versions is central to interpreting the audit results, particularly the claim regarding limited visual reliance in medical-specific models. While the original manuscript did not report a formal validation step, we have since conducted a manual review of a random sample of 300 image-containing questions (roughly 12% of the 2,579-question subset). This review checked for explicit references to figures or cases where removal would clearly alter question intent or cognitive load. Such instances occurred in fewer than 4% of the sampled questions; these were flagged but retained in the audit because the original exam questions were designed with the image as an integral component. We will add a dedicated paragraph in the Methods section describing this validation procedure, the sampling method, and the observed rate of potential issues. This addition will directly address the concern about question degradation. revision: yes

  2. Referee: [dataset construction] Dataset construction provides no details on annotation quality control, inter-annotator agreement, or adjudication process for the 8-type visual taxonomy applied to the 20,142 images. Without these, the reliability of the image-containing subset (2,579 questions in JMed48k-Eval) cannot be assessed, directly affecting the soundness of all with-image versus text-only comparisons.

    Authors: We appreciate this observation, as the reliability of the taxonomy directly supports the image-containing subset used in all comparisons. The 8-type taxonomy was developed by the authors through iterative review of exam images to capture clinically relevant visual categories. Two authors independently annotated a pilot set of 1,000 images drawn from the corpus, achieving 87% raw agreement; disagreements were resolved via joint discussion and adjudication by a third author. The remaining images were annotated by a single author with random spot-checks (10% of the set) by the second author. We will insert a new subsection under Dataset Construction that details the taxonomy development, the pilot annotation process, agreement metrics, and adjudication steps. This will allow readers to better assess the soundness of the image subset. revision: yes

  3. Referee: [evaluation results] The reported profession-stratified performance differences (e.g., +5.7 to +39.8 points net image-removal effect) are presented without statistical significance tests or confidence intervals. This omission weakens the claim of a “sevenfold” variation across professions, as it is unclear whether the observed deltas exceed what would be expected from sampling variability alone.

    Authors: We concur that statistical support is needed to substantiate the reported variation in net image-removal effects across professions. In the revised manuscript we will add 95% confidence intervals computed via bootstrap resampling (1,000 iterations) for each profession's net effect. We will also include results from McNemar's tests for the paired text-only versus with-image accuracy differences within each profession, along with p-values. These additions will appear in the Results section and accompanying tables, enabling readers to evaluate whether the sevenfold range reflects meaningful differences beyond sampling variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and direct model evaluation

full rationale

The paper constructs a dataset from official Japanese Ministry PDFs and reports direct empirical performance measurements on 21 models, including separate text-only and image-accompanied runs plus a paired removal audit. No equations, fitted parameters, or predictions appear; the central claims rest on observed accuracy deltas rather than any self-referential derivation or self-citation chain that reduces the result to author-defined inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central contribution rests on the assumption that official Japanese licensing exam PDFs constitute a representative and high-quality ground truth for medical knowledge and visual interpretation tasks.

pith-pipeline@v0.9.0 · 5833 in / 1065 out tokens · 45363 ms · 2026-05-22T07:43:30.567146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 12 internal anchors

  1. [1]

    Agrawal, D

    A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4971–4980, 2018

  2. [2]

    System card: Claude opus 4 & claude sonnet 4

    Anthropic. System card: Claude opus 4 & claude sonnet 4. Technical report, Anthropic, May 2025. URL https://www-cdn.anthropic.com/ 4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf. Stable landing page: https://www.anthropic.com/claude-4-system-card

  3. [3]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, et al. Qwen3-VL technical report, 2025. URL https://arxiv.org/abs/2511.21631

  4. [4]

    Ben Abacha, S

    A. Ben Abacha, S. A. Hasan, V . V . Datla, J. Liu, D. Demner-Fushman, and H. Müller. VQA- Med: Overview of the medical visual question answering task at ImageCLEF 2019. InCLEF 2019 Working Notes, volume 2380 ofCEUR Workshop Proceedings. CEUR-WS.org, 2019. URLhttps://ceur-ws.org/Vol-2380/paper_272.pdf

  5. [5]

    J. Chen, C. Gui, R. Ouyang, A. Gao, S. Chen, G. H. Chen, X. Wang, Z. Cai, K. Ji, X. Wan, and B. Wang. Towards injecting medical visual knowledge into multimodal LLMs at scale. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7346–7370. Association for Computational Linguistics, 2024. doi: 10.18653/ v...

  6. [6]

    P. Chen, J. Ye, G. Wang, Y . Li, Z. Deng, W. Li, T. Li, H. Duan, Z. Huang, Y . Su, B. Wang, S. Zhang, B. Fu, J. Cai, B. Zhuang, E. J. Seibel, J. He, and Y . Qiao. GMAI-MMBench: A comprehensive multimodal evaluation benchmark towards general medical AI. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

  7. [7]

    B. Choi, S. Bae, S. Kweon, and E. Choi. KorMedMCQA-V: A multimodal benchmark for evaluating vision-language models on the Korean medical licensing examination, 2026. URL https://arxiv.org/abs/2602.13650

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

  9. [9]

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638, 2025. doi: 10.1038/s41586-025-09422-z. Preprint: https://arxiv.org/abs/2501.12948

  10. [10]

    C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, R. Ji, C. Shan, and R. He. MME: A comprehensive evaluation benchmark for multimodal large language models, 2023. URLhttps://arxiv.org/abs/2306.13394

  11. [11]

    Gemma 4 model card, Apr

    Google DeepMind. Gemma 4 model card, Apr. 2026. URL https://ai.google.dev/ gemma/docs/core/model_card_4. Launch announcement at https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/

  12. [12]

    Goyal, T

    Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913, 2017. 10

  13. [13]

    T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, D. Manocha, and T. Zhou. HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14375–14385, 2024

  14. [14]

    X. He, Y . Zhang, L. Mou, E. Xing, and P. Xie. PathVQA: 30000+ questions for medical visual question answering, 2020. URLhttps://arxiv.org/abs/2003.10286

  15. [15]

    W. Hong, W. Yu, X. Gu, and GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025. URL https: //arxiv.org/abs/2507.01006

  16. [16]

    Y . Hu, T. Li, Q. Lu, W. Shao, J. He, Y . Qiao, and P. Luo. OmniMedVQA: A new large-scale comprehensive evaluation benchmark for medical LVLM. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22170–22183, 2024

  17. [17]

    Jiang, J

    J. Jiang, J. Huang, and A. Aizawa. JMedBench: A benchmark for evaluating Japanese biomedi- cal large language models. InProceedings of the 31st International Conference on Computa- tional Linguistics (COLING), pages 5918–5935. Association for Computational Linguistics,

  18. [18]

    URLhttps://aclanthology.org/2025.coling-main.395/

  19. [19]

    Jiang, Y

    S. Jiang, Y . Wang, S. Song, T. Hu, C. Zhou, B. Pu, Y . Zhang, Z. Yang, Y . Feng, J. T. Zhou, et al. Hulu-Med: A transparent generalist model towards holistic medical vision-language understanding, 2025. URLhttps://arxiv.org/abs/2510.08668

  20. [20]

    D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421

  21. [21]

    Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577. Association for Computational Linguistics,

  22. [22]

    doi: 10.18653/v1/D19-1259

  23. [23]

    Kasai, Y

    J. Kasai, Y . Kasai, K. Sakaguchi, Y . Yamada, and D. Radev. Evaluating GPT-4 and ChatGPT on Japanese medical licensing examinations, 2023. URL https://arxiv.org/abs/2303. 18027

  24. [24]

    T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, and V . Tseng. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.PLOS Digital Health, 2(2):e0000198, 2023. doi: 10.1371/journal.pdig.0000198

  25. [25]

    Kweon, B

    S. Kweon, B. Choi, G. Chu, J. Song, D. Hyeon, S. Gan, J. Kim, M. Kim, R. W. Park, and E. Choi. KorMedMCQA: Multi-choice question answering benchmark for Korean healthcare professional licensing examinations, 2024. URLhttps://arxiv.org/abs/2403.01469

  26. [26]

    LASA Team, W. Xu, H. P. Chan, L. Li, M. Aljunied, R. Yuan, J. Wang, C. Xiao, G. Chen, C. Liu, Z. Li, Y . Sun, J. Shen, C. Wang, J. Tan, D. Zhao, T. Xu, H. Zhang, and Y . Rong. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning,

  27. [27]

    URLhttps://arxiv.org/abs/2506.07044

  28. [28]

    J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5(1):180251, 2018. doi: 10.1038/sdata.2018.251

  29. [29]

    B. Li, Y . Ge, Y . Ge, G. Wang, R. Wang, R. Zhang, and Y . Shan. SEED-Bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13299–13308, 2024. 11

  30. [30]

    C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao. LLaV A-Med: Training a large language-and-vision assistant for biomedicine in one day. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

  31. [31]

    J. Li, S. Zhong, and K. Chen. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8862–8874. Association for Computational Linguistics,

  32. [32]

    doi: 10.18653/v1/2021.emnlp-main.698

  33. [33]

    Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 292–305. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.20. URL https://aclanthology. org/2023....

  34. [34]

    Liu, L.-M

    B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y . Yang, and X.-M. Wu. SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. InProceedings of the IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021. doi: 10.1109/ISBI48211.2021.9434010

  35. [35]

    J. Liu, P. Zhou, Y . Hua, D. Chong, Z. Tian, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu, and M. L. Li. Benchmarking large language models on CMExam—A comprehensive Chinese medical exam dataset. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

  36. [36]

    J. Liu, L. K. Q. Yan, T. Wang, Q. Niu, M. Nagai-Tanima, and T. Aoyama. KokushiMD-10: Benchmark for evaluating large language models on ten Japanese national healthcare licensing examinations. InAI for Clinical Applications: First International Workshops, Agentic AI 2025, CREATE 2025, and Clinical MLLMs 2025, Held in Conjunction with MICCAI 2025, Lecture N...

  37. [37]

    Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. MMBench: Is your multi-modal model an all-around player? InComputer Vision – ECCV 2024, volume 15064 ofLecture Notes in Computer Science, pages 216–233, Cham, 2024. Springer. doi: 10.1007/978-3-031-72658-3_13

  38. [38]

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  39. [39]

    Matos, S

    J. Matos, S. Chen, S. K. V . Placino, Y . Li, J. C. C. Pardo, D. Idan, T. Tohyama, D. Restrepo, L. F. Nakayama, J. M. M. Pascual-Leone, G. K. Savova, H. Aerts, L. A. Celi, A.-K. I. Wong, D. Bitterman, and J. Gallifant. WorldMedQA-V: A multilingual, multimodal medical exami- nation dataset for multimodal language models evaluation. InFindings of the Associ...

  40. [40]

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, Apr

    Meta AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, Apr. 2025. URLhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/

  41. [41]

    M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y . Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar. Med-Flamingo: A multimodal medical few-shot learner. InProceedings of the 3rd Machine Learning for Health Symposium (ML4H), volume 225 ofProceedings of Machine Learning Research, pages 353–367. PMLR, 2023

  42. [42]

    H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz. Capabilities of GPT-4 on medical challenge problems, 2023. URLhttps://arxiv.org/abs/2303.13375

  43. [43]

    S. Ono, I. Sukeda, T. Fujii, K. Buma, and S. Sasaki. A Japanese language model and three new evaluation benchmarks for pharmaceutical NLP. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific 12 Chapter of the Association for Computational Linguistics (IJCNLP-AACL), pages 1316–1...

  44. [44]

    OpenAI GPT-5 System Card

    OpenAI. OpenAI GPT-5 system card, 2026. URL https://arxiv.org/abs/2601.03267. arXiv v1 December 2025; v2 May 2026. Official system card also available at https:// openai.com/index/gpt-5-system-card/

  45. [45]

    A. Pal, L. K. Umapathi, and M. Sankarasubbu. MedMCQA: A large-scale multi-subject multi- choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning (CHIL), volume 174 ofProceedings of Machine Learning Research, pages 248–260. PMLR, 2022

  46. [46]

    Royer, B

    C. Royer, B. Menze, and A. Sekuboyina. MultiMedEval: A benchmark and a toolkit for evaluating medical vision-language models. InProceedings of the 7th International Conference on Medical Imaging with Deep Learning (MIDL), volume 250 ofProceedings of Machine Learning Research, pages 1310–1327. PMLR, 2024. URL https://proceedings.mlr. press/v250/royer24a.html

  47. [47]

    MedGemma Technical Report

    A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, et al. MedGemma technical report, 2026. URL https://arxiv.org/abs/2507.05201. arXiv v1 July 2025; latest revision April 2026

  48. [48]

    Singhal, S

    K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne, M. Seneviratne, P. Gamble, C. Kelly, A. Babiker, N. Schärli, A. Chowdhery, P. Mansfield, D. Demner-Fushman, B. Agüera y Arcas, D. Webster, G. S. Cor- rado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev, Y . Liu, A. Rajkomar, J. Barral,...

  49. [49]

    Singhal, T

    K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewis, D. Neal, Q. M. Rashid, M. Schaekermann, A. Wang, D. Dash, J. H. Chen, N. H. Shah, S. Lachgar, P. A. Mansfield, S. Prakash, B. Green, E. Dominowska, B. Agüera y Arcas, N. Tomasev, Y . Liu, R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, D. R. Web...

  50. [50]

    Y . Sun, H. Wu, C. Zhu, S. Zheng, Q. Chen, K. Zhang, Y . Zhang, D. Wan, X. Lan, M. Zheng, J. Li, X. Lyu, T. Lin, and L. Yang. PathMMU: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. InComputer Vision – ECCV 2024, volume 15094 ofLecture Notes in Computer Science, pages 56–73, Cham, 2024. Springer. doi: 10.1007/978...

  51. [51]

    S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9568–9578, 2024

  52. [52]

    B. Wang, C. Xu, X. Zhao, L. Ouyang, F. Wu, Z. Zhao, R. Xu, K. Liu, Y . Qu, F. Shang, B. Zhang, L. Wei, Z. Sui, W. Li, B. Shi, Y . Qiao, D. Lin, and C. He. MinerU: An open-source solution for precise document content extraction, 2024. URLhttps://arxiv.org/abs/2409.18839

  53. [53]

    C. Wu, X. Zhang, Y . Zhang, Y . Wang, and W. Xie. Towards generalist foundation model for radiology by leveraging web-scale 2D & 3D medical data, 2023. URL https://arxiv.org/ abs/2308.02463

  54. [54]

    Grok 4 model card

    xAI. Grok 4 model card. Technical report, xAI, Aug. 2025. URL https://data.x.ai/ 2025-08-20-grok-4-model-card.pdf

  55. [55]

    J. Xie, Y . Yu, Z. Zhang, S. Zeng, J. He, A. Vasireddy, X. Tang, C. Guo, L. Zhao, C. Jing, G. An, and D. Xu. TCM-Ladder: A benchmark for multimodal question answering on traditional Chinese medicine. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. 13

  56. [56]

    W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 57730–57754. PMLR, 2024

  57. [57]

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision...

  58. [58]

    GLM-4.6V: Vision-language model

    Z.AI. GLM-4.6V: Vision-language model. Hugging Face Model Hub, 2025. URL https: //huggingface.co/zai-org/GLM-4.6V. Hugging Face model card

  59. [59]

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    X. Zhang, C. Wu, Z. Zhao, W. Lin, Y . Zhang, Y . Wang, and W. Xie. PMC-VQA: Visual instruction tuning for medical visual question answering, 2023. URL https://arxiv.org/ abs/2305.10415

  60. [60]

    No.X (Y Problem Z)

    T. Zhou, Y . Xu, Y . Zhu, C. Xiao, H. Bian, L. Wei, and X. Zhang. DrVD-Bench: Do vision- language models reason like human doctors in medical image diagnosis? InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. 14 A Dataset Construction and Quality Control This appendix documents the construction of JMed48k f...

  61. [61]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...