JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
Pith reviewed 2026-05-22 07:43 UTC · model grok-4.3
The pith
General and open-source vision-language models gain substantially more from medical images than specialized medical models do on Japanese licensing exams.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions.
What carries the argument
The paired image-removal audit, which re-evaluates the exact same questions before and after visual content is stripped to isolate four answer-transition states and measure reliance on images.
If this is right
- Medical-specific models appear to solve many licensing questions through text patterns alone.
- The large variation by profession indicates that visual demands differ markedly across healthcare fields.
- Open-source general models can close much of the gap with domain-specific systems when images are supplied.
- The benchmark allows direct comparison of visual information use across model categories and professions.
- Releasing the full set of questions and images supports repeated, profession-stratified testing of future models.
Where Pith is reading between the lines
- The pattern may reflect training data differences, with medical models exposed to fewer image-rich examples during development.
- Applying the same removal audit to other medical benchmarks could reveal whether limited visual use is widespread.
- If confirmed, training regimes for medical VLMs could be adjusted to increase reliance on and benefit from visual cues.
- The profession-level spread might mirror real differences in how visual information is used by human practitioners in those roles.
Load-bearing premise
Removing the visual content from a question produces a text-only version whose difficulty and answerability remain essentially unchanged.
What would settle it
If expert review shows that removing images makes many questions substantially harder or invalid, or if medical-specific models display accuracy drops comparable to general models, the attribution of performance differences to visual use would be undermined.
Figures
read the original abstract
We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JMed48k, a benchmark of 48,862 Japanese medical licensing exam questions and 20,142 images drawn from official Ministry of Health, Labour and Welfare PDFs spanning 11 professions and 2005–2025. It derives a 12,484-question evaluation subset (JMed48k-Eval) and evaluates 21 proprietary, open-source, and medical-specific VLMs on text-only versus with-image performance. A paired image-removal audit is performed on the 2,579 image-containing questions to track answer transitions and attribute performance gains to visual evidence use, yielding the central finding that general models benefit substantially from images while medical-specific models exhibit limited observable visual reliance, with net image-removal effects varying sevenfold across professions.
Significance. If the attribution of performance deltas to visual information use holds, the work supplies a large-scale, official-source, multi-profession benchmark that enables reproducible, profession-stratified VLM evaluation in realistic medical licensing settings. Notable strengths include the direct use of government-released materials, the 8-type visual taxonomy, the scale of the corpus, and the introduction of the paired audit protocol for probing visual reliance.
major comments (3)
- [paired image-removal audit] The paired image-removal audit (described in the abstract and evaluation protocol) assumes that simply removing visual content leaves a valid, equivalently difficult text-only question. No validation is reported for textual integrity (e.g., manual review for implicit references such as “as shown in the figure” or unlabeled diagrams whose removal alters cognitive load). This assumption is load-bearing for the claim that medical-specific models show “limited observable use of visual evidence,” because observed correct-to-correct transitions could partly reflect question degradation rather than model behavior.
- [dataset construction] Dataset construction provides no details on annotation quality control, inter-annotator agreement, or adjudication process for the 8-type visual taxonomy applied to the 20,142 images. Without these, the reliability of the image-containing subset (2,579 questions in JMed48k-Eval) cannot be assessed, directly affecting the soundness of all with-image versus text-only comparisons.
- [evaluation results] The reported profession-stratified performance differences (e.g., +5.7 to +39.8 points net image-removal effect) are presented without statistical significance tests or confidence intervals. This omission weakens the claim of a “sevenfold” variation across professions, as it is unclear whether the observed deltas exceed what would be expected from sampling variability alone.
minor comments (1)
- The abstract states that subsets contain different questions yet proceeds to direct comparison; a brief clarification of how the paired audit mitigates this would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and indicate the revisions we will make to the next version of the paper.
read point-by-point responses
-
Referee: [paired image-removal audit] The paired image-removal audit (described in the abstract and evaluation protocol) assumes that simply removing visual content leaves a valid, equivalently difficult text-only question. No validation is reported for textual integrity (e.g., manual review for implicit references such as “as shown in the figure” or unlabeled diagrams whose removal alters cognitive load). This assumption is load-bearing for the claim that medical-specific models show “limited observable use of visual evidence,” because observed correct-to-correct transitions could partly reflect question degradation rather than model behavior.
Authors: We agree that the validity of the text-only versions is central to interpreting the audit results, particularly the claim regarding limited visual reliance in medical-specific models. While the original manuscript did not report a formal validation step, we have since conducted a manual review of a random sample of 300 image-containing questions (roughly 12% of the 2,579-question subset). This review checked for explicit references to figures or cases where removal would clearly alter question intent or cognitive load. Such instances occurred in fewer than 4% of the sampled questions; these were flagged but retained in the audit because the original exam questions were designed with the image as an integral component. We will add a dedicated paragraph in the Methods section describing this validation procedure, the sampling method, and the observed rate of potential issues. This addition will directly address the concern about question degradation. revision: yes
-
Referee: [dataset construction] Dataset construction provides no details on annotation quality control, inter-annotator agreement, or adjudication process for the 8-type visual taxonomy applied to the 20,142 images. Without these, the reliability of the image-containing subset (2,579 questions in JMed48k-Eval) cannot be assessed, directly affecting the soundness of all with-image versus text-only comparisons.
Authors: We appreciate this observation, as the reliability of the taxonomy directly supports the image-containing subset used in all comparisons. The 8-type taxonomy was developed by the authors through iterative review of exam images to capture clinically relevant visual categories. Two authors independently annotated a pilot set of 1,000 images drawn from the corpus, achieving 87% raw agreement; disagreements were resolved via joint discussion and adjudication by a third author. The remaining images were annotated by a single author with random spot-checks (10% of the set) by the second author. We will insert a new subsection under Dataset Construction that details the taxonomy development, the pilot annotation process, agreement metrics, and adjudication steps. This will allow readers to better assess the soundness of the image subset. revision: yes
-
Referee: [evaluation results] The reported profession-stratified performance differences (e.g., +5.7 to +39.8 points net image-removal effect) are presented without statistical significance tests or confidence intervals. This omission weakens the claim of a “sevenfold” variation across professions, as it is unclear whether the observed deltas exceed what would be expected from sampling variability alone.
Authors: We concur that statistical support is needed to substantiate the reported variation in net image-removal effects across professions. In the revised manuscript we will add 95% confidence intervals computed via bootstrap resampling (1,000 iterations) for each profession's net effect. We will also include results from McNemar's tests for the paired text-only versus with-image accuracy differences within each profession, along with p-values. These additions will appear in the Results section and accompanying tables, enabling readers to evaluate whether the sevenfold range reflects meaningful differences beyond sampling variability. revision: yes
Circularity Check
No circularity: empirical benchmark and direct model evaluation
full rationale
The paper constructs a dataset from official Japanese Ministry PDFs and reports direct empirical performance measurements on 21 models, including separate text-only and image-accompanied runs plus a paired removal audit. No equations, fitted parameters, or predictions appear; the central claims rest on observed accuracy deltas rather than any self-referential derivation or self-citation chain that reduces the result to author-defined inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4971–4980, 2018
work page 2018
-
[2]
System card: Claude opus 4 & claude sonnet 4
Anthropic. System card: Claude opus 4 & claude sonnet 4. Technical report, Anthropic, May 2025. URL https://www-cdn.anthropic.com/ 4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf. Stable landing page: https://www.anthropic.com/claude-4-system-card
work page 2025
-
[3]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, et al. Qwen3-VL technical report, 2025. URL https://arxiv.org/abs/2511.21631
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
A. Ben Abacha, S. A. Hasan, V . V . Datla, J. Liu, D. Demner-Fushman, and H. Müller. VQA- Med: Overview of the medical visual question answering task at ImageCLEF 2019. InCLEF 2019 Working Notes, volume 2380 ofCEUR Workshop Proceedings. CEUR-WS.org, 2019. URLhttps://ceur-ws.org/Vol-2380/paper_272.pdf
work page 2019
-
[5]
J. Chen, C. Gui, R. Ouyang, A. Gao, S. Chen, G. H. Chen, X. Wang, Z. Cai, K. Ji, X. Wan, and B. Wang. Towards injecting medical visual knowledge into multimodal LLMs at scale. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7346–7370. Association for Computational Linguistics, 2024. doi: 10.18653/ v...
work page 2024
-
[6]
P. Chen, J. Ye, G. Wang, Y . Li, Z. Deng, W. Li, T. Li, H. Duan, Z. Huang, Y . Su, B. Wang, S. Zhang, B. Fu, J. Cai, B. Zhuang, E. J. Seibel, J. He, and Y . Qiao. GMAI-MMBench: A comprehensive multimodal evaluation benchmark towards general medical AI. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024
work page 2024
- [7]
-
[8]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638, 2025. doi: 10.1038/s41586-025-09422-z. Preprint: https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025
-
[10]
C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, R. Ji, C. Shan, and R. He. MME: A comprehensive evaluation benchmark for multimodal large language models, 2023. URLhttps://arxiv.org/abs/2306.13394
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Google DeepMind. Gemma 4 model card, Apr. 2026. URL https://ai.google.dev/ gemma/docs/core/model_card_4. Launch announcement at https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/
work page 2026
- [12]
-
[13]
T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, D. Manocha, and T. Zhou. HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14375–14385, 2024
work page 2024
-
[14]
X. He, Y . Zhang, L. Mou, E. Xing, and P. Xie. PathVQA: 30000+ questions for medical visual question answering, 2020. URLhttps://arxiv.org/abs/2003.10286
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[15]
W. Hong, W. Yu, X. Gu, and GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025. URL https: //arxiv.org/abs/2507.01006
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Y . Hu, T. Li, Q. Lu, W. Shao, J. He, Y . Qiao, and P. Luo. OmniMedVQA: A new large-scale comprehensive evaluation benchmark for medical LVLM. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22170–22183, 2024
work page 2024
- [17]
-
[18]
URLhttps://aclanthology.org/2025.coling-main.395/
work page 2025
- [19]
-
[20]
D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421
-
[21]
Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577. Association for Computational Linguistics,
work page 2019
-
[22]
doi: 10.18653/v1/D19-1259
- [23]
-
[24]
T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, and V . Tseng. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.PLOS Digital Health, 2(2):e0000198, 2023. doi: 10.1371/journal.pdig.0000198
- [25]
-
[26]
LASA Team, W. Xu, H. P. Chan, L. Li, M. Aljunied, R. Yuan, J. Wang, C. Xiao, G. Chen, C. Liu, Z. Li, Y . Sun, J. Shen, C. Wang, J. Tan, D. Zhao, T. Xu, H. Zhang, and Y . Rong. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning,
-
[27]
URLhttps://arxiv.org/abs/2506.07044
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5(1):180251, 2018. doi: 10.1038/sdata.2018.251
-
[29]
B. Li, Y . Ge, Y . Ge, G. Wang, R. Wang, R. Zhang, and Y . Shan. SEED-Bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13299–13308, 2024. 11
work page 2024
-
[30]
C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao. LLaV A-Med: Training a large language-and-vision assistant for biomedicine in one day. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023
work page 2023
-
[31]
J. Li, S. Zhong, and K. Chen. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8862–8874. Association for Computational Linguistics,
work page 2021
-
[32]
doi: 10.18653/v1/2021.emnlp-main.698
-
[33]
Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 292–305. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.20. URL https://aclanthology. org/2023....
-
[34]
B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y . Yang, and X.-M. Wu. SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. InProceedings of the IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021. doi: 10.1109/ISBI48211.2021.9434010
-
[35]
J. Liu, P. Zhou, Y . Hua, D. Chong, Z. Tian, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu, and M. L. Li. Benchmarking large language models on CMExam—A comprehensive Chinese medical exam dataset. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023
work page 2023
-
[36]
J. Liu, L. K. Q. Yan, T. Wang, Q. Niu, M. Nagai-Tanima, and T. Aoyama. KokushiMD-10: Benchmark for evaluating large language models on ten Japanese national healthcare licensing examinations. InAI for Clinical Applications: First International Workshops, Agentic AI 2025, CREATE 2025, and Clinical MLLMs 2025, Held in Conjunction with MICCAI 2025, Lecture N...
work page 2025
-
[37]
Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. MMBench: Is your multi-modal model an all-around player? InComputer Vision – ECCV 2024, volume 15064 ofLecture Notes in Computer Science, pages 216–233, Cham, 2024. Springer. doi: 10.1007/978-3-031-72658-3_13
-
[38]
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[39]
J. Matos, S. Chen, S. K. V . Placino, Y . Li, J. C. C. Pardo, D. Idan, T. Tohyama, D. Restrepo, L. F. Nakayama, J. M. M. Pascual-Leone, G. K. Savova, H. Aerts, L. A. Celi, A.-K. I. Wong, D. Bitterman, and J. Gallifant. WorldMedQA-V: A multilingual, multimodal medical exami- nation dataset for multimodal language models evaluation. InFindings of the Associ...
-
[40]
The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, Apr
Meta AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, Apr. 2025. URLhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/
work page 2025
-
[41]
M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y . Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar. Med-Flamingo: A multimodal medical few-shot learner. InProceedings of the 3rd Machine Learning for Health Symposium (ML4H), volume 225 ofProceedings of Machine Learning Research, pages 353–367. PMLR, 2023
work page 2023
-
[42]
H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz. Capabilities of GPT-4 on medical challenge problems, 2023. URLhttps://arxiv.org/abs/2303.13375
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
S. Ono, I. Sukeda, T. Fujii, K. Buma, and S. Sasaki. A Japanese language model and three new evaluation benchmarks for pharmaceutical NLP. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific 12 Chapter of the Association for Computational Linguistics (IJCNLP-AACL), pages 1316–1...
work page 2025
-
[44]
OpenAI. OpenAI GPT-5 system card, 2026. URL https://arxiv.org/abs/2601.03267. arXiv v1 December 2025; v2 May 2026. Official system card also available at https:// openai.com/index/gpt-5-system-card/
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
A. Pal, L. K. Umapathi, and M. Sankarasubbu. MedMCQA: A large-scale multi-subject multi- choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning (CHIL), volume 174 ofProceedings of Machine Learning Research, pages 248–260. PMLR, 2022
work page 2022
-
[46]
C. Royer, B. Menze, and A. Sekuboyina. MultiMedEval: A benchmark and a toolkit for evaluating medical vision-language models. InProceedings of the 7th International Conference on Medical Imaging with Deep Learning (MIDL), volume 250 ofProceedings of Machine Learning Research, pages 1310–1327. PMLR, 2024. URL https://proceedings.mlr. press/v250/royer24a.html
work page 2024
-
[47]
A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, et al. MedGemma technical report, 2026. URL https://arxiv.org/abs/2507.05201. arXiv v1 July 2025; latest revision April 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne, M. Seneviratne, P. Gamble, C. Kelly, A. Babiker, N. Schärli, A. Chowdhery, P. Mansfield, D. Demner-Fushman, B. Agüera y Arcas, D. Webster, G. S. Cor- rado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev, Y . Liu, A. Rajkomar, J. Barral,...
-
[49]
K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewis, D. Neal, Q. M. Rashid, M. Schaekermann, A. Wang, D. Dash, J. H. Chen, N. H. Shah, S. Lachgar, P. A. Mansfield, S. Prakash, B. Green, E. Dominowska, B. Agüera y Arcas, N. Tomasev, Y . Liu, R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, D. R. Web...
-
[50]
Y . Sun, H. Wu, C. Zhu, S. Zheng, Q. Chen, K. Zhang, Y . Zhang, D. Wan, X. Lan, M. Zheng, J. Li, X. Lyu, T. Lin, and L. Yang. PathMMU: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. InComputer Vision – ECCV 2024, volume 15094 ofLecture Notes in Computer Science, pages 56–73, Cham, 2024. Springer. doi: 10.1007/978...
-
[51]
S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9568–9578, 2024
work page 2024
-
[52]
B. Wang, C. Xu, X. Zhao, L. Ouyang, F. Wu, Z. Zhao, R. Xu, K. Liu, Y . Qu, F. Shang, B. Zhang, L. Wei, Z. Sui, W. Li, B. Shi, Y . Qiao, D. Lin, and C. He. MinerU: An open-source solution for precise document content extraction, 2024. URLhttps://arxiv.org/abs/2409.18839
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [53]
-
[54]
xAI. Grok 4 model card. Technical report, xAI, Aug. 2025. URL https://data.x.ai/ 2025-08-20-grok-4-model-card.pdf
work page 2025
-
[55]
J. Xie, Y . Yu, Z. Zhang, S. Zeng, J. He, A. Vasireddy, X. Tang, C. Guo, L. Zhao, C. Jing, G. An, and D. Xu. TCM-Ladder: A benchmark for multimodal question answering on traditional Chinese medicine. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. 13
work page 2025
-
[56]
W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 57730–57754. PMLR, 2024
work page 2024
-
[57]
X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision...
work page 2024
-
[58]
GLM-4.6V: Vision-language model
Z.AI. GLM-4.6V: Vision-language model. Hugging Face Model Hub, 2025. URL https: //huggingface.co/zai-org/GLM-4.6V. Hugging Face model card
work page 2025
-
[59]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
X. Zhang, C. Wu, Z. Zhao, W. Lin, Y . Zhang, Y . Wang, and W. Xie. PMC-VQA: Visual instruction tuning for medical visual question answering, 2023. URL https://arxiv.org/ abs/2305.10415
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
T. Zhou, Y . Xu, Y . Zhu, C. Xiao, H. Bian, L. Wei, and X. Zhang. DrVD-Bench: Do vision- language models reason like human doctors in medical image diagnosis? InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. 14 A Dataset Construction and Quality Control This appendix documents the construction of JMed48k f...
work page 2025
-
[61]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.