MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
Pith reviewed 2026-05-15 22:50 UTC · model grok-4.3
The pith
MedXIAOHE shows how entity-aware pretraining plus reinforcement learning can produce a medical vision-language model that tops closed-source systems on benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedXIAOHE is constructed via an entity-aware continual pretraining framework that structures heterogeneous medical corpora to expand knowledge coverage and shrink long-tail gaps, combined with reinforcement learning that injects diverse medical reasoning patterns and tool-augmented agentic training for multi-step diagnostic reasoning with verifiable traces, plus user-preference rubrics and evidence-grounded methods for low-hallucination long-form generation, ultimately reaching state-of-the-art results on diverse medical benchmarks while exceeding leading closed-source multimodal systems on several capabilities.
What carries the argument
The entity-aware continual pretraining framework that organizes medical corpora to broaden coverage and reduce long-tail gaps, paired with reinforcement learning for medical reasoning patterns and tool-augmented training.
If this is right
- Multi-step diagnostic reasoning becomes possible with explicit, checkable decision traces.
- Long-form medical reports can be generated with lower hallucination rates and better adherence to given instructions.
- Coverage of rare diseases improves through deliberate organization of training corpora.
- The same sequence of pretraining, reinforcement, and agentic steps can be replicated to produce similar models for other medical tasks.
Where Pith is reading between the lines
- If the training choices scale, hospitals could run local versions of such models without relying on external closed APIs for routine image-text analysis.
- The emphasis on verifiable traces opens a path to integrate the model into systems that require audit logs for regulatory approval.
- Similar data-organization and reinforcement steps might transfer to non-medical domains that also have long-tail knowledge gaps, such as legal or technical documentation.
Load-bearing premise
That strong benchmark scores from the chosen training choices will produce reliable, low-hallucination results when the model encounters real unseen patient data and clinical workflows.
What would settle it
Direct comparison of model-generated diagnoses and reports against expert review on a fresh set of real patient cases withheld from all training and benchmark data, measuring rates of factual errors and instruction adherence.
read the original abstract
We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MedXIAOHE, a medical vision-language foundation model trained with an entity-aware continual pretraining framework on heterogeneous medical corpora to broaden coverage and reduce long-tail gaps (e.g., rare diseases), followed by reinforcement learning and tool-augmented agentic training to enable multi-step diagnostic reasoning with verifiable traces. It further incorporates user-preference rubrics, evidence-grounded reasoning, and low-hallucination report generation. The central claim is that MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities, while releasing a report documenting the practical design choices and evaluation framework.
Significance. If the performance claims hold after addressing evaluation gaps, the work would offer a detailed, practical recipe for constructing medical MLLMs with improved knowledge coverage and reasoning reliability. Explicit credit is due for the emphasis on heterogeneous corpora organization and the integration of RL/agentic stages with preference rubrics, which address real deployment needs in clinical settings.
major comments (3)
- [Abstract] Abstract: The assertion of SOTA performance and superiority over closed-source models is presented without any quantitative benchmark scores, baseline comparisons, ablation results, or evaluation details, leaving the central claim unsupported by visible evidence in the summary.
- [Data curation and pretraining] Data curation and pretraining sections: No explicit decontamination or overlap analysis is described between the heterogeneous medical corpora (including rare-disease data) and standard evaluation benchmarks such as VQA-RAD, SLAKE, or MedVQA; this is load-bearing for isolating genuine capability gains from potential leakage.
- [Evaluation framework] Evaluation framework: The contributions of the entity-aware pretraining, RL reasoning patterns, and agentic training to the claimed multi-step diagnostic performance are described qualitatively without specific metrics, ablation tables, or cross-validation against closed models, preventing attribution of results to the proposed components.
minor comments (2)
- [Methods] Clarify the precise implementation of 'entity-aware' organization (e.g., entity extraction method, loss formulation, or data structuring algorithm) to support reproducibility.
- [Introduction] Add citations for all referenced benchmarks and closed-source models in the comparison claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where revisions are warranted, we have updated the manuscript to strengthen clarity, evidence, and attribution of results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion of SOTA performance and superiority over closed-source models is presented without any quantitative benchmark scores, baseline comparisons, ablation results, or evaluation details, leaving the central claim unsupported by visible evidence in the summary.
Authors: We agree that the abstract should include key quantitative results to support the SOTA claims. In the revised manuscript, we will update the abstract to report specific benchmark scores (e.g., accuracy on VQA-RAD, SLAKE, and MedVQA) along with direct numerical comparisons against leading closed-source models such as GPT-4V and Claude-3 on multi-step reasoning tasks. revision: yes
-
Referee: [Data curation and pretraining] Data curation and pretraining sections: No explicit decontamination or overlap analysis is described between the heterogeneous medical corpora (including rare-disease data) and standard evaluation benchmarks such as VQA-RAD, SLAKE, or MedVQA; this is load-bearing for isolating genuine capability gains from potential leakage.
Authors: We acknowledge the importance of explicit decontamination analysis. While our curation pipeline incorporated n-gram overlap filtering and manual verification to exclude benchmark contamination, these steps were not detailed in the original text. We will add a dedicated subsection under Data Curation describing the decontamination procedure, including quantitative overlap statistics with VQA-RAD, SLAKE, and MedVQA, and how rare-disease sources were handled. revision: yes
-
Referee: [Evaluation framework] Evaluation framework: The contributions of the entity-aware pretraining, RL reasoning patterns, and agentic training to the claimed multi-step diagnostic performance are described qualitatively without specific metrics, ablation tables, or cross-validation against closed models, preventing attribution of results to the proposed components.
Authors: We agree that quantitative attribution is essential. The manuscript contains internal ablation results showing incremental gains (e.g., entity-aware pretraining contributing +4.2% on diagnostic accuracy, RL stages adding further improvements in reasoning trace quality). We will expand the Evaluation section with new ablation tables, per-component metrics, and side-by-side comparisons against closed-source models on verifiable multi-step reasoning benchmarks. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes an empirical training recipe (entity-aware pretraining, RL/agentic stages, preference rubrics) leading to benchmark performance claims. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. Benchmark results are presented as external evaluations rather than quantities forced by construction from the training inputs. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz that reduces the central claim to prior author work. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning
ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
-
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs
A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
Reference graph
Works this paper leans on
-
[1]
Efficient string matching: an aid to bibliographic search
Alfred V Aho and Margaret J Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM , 18(6):333–340, 1975
work page 1975
-
[2]
Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/2505.08775
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Preliminary study on the construction of chinese medical knowledge graph
Odma Byambasuren, Yunfei Yang, Zhifang Sui, Damai Dai, Baobao Chang, Sujian Li, and Hongying Zan. Preliminary study on the construction of chinese medical knowledge graph. Journal of Chinese Information Processing, 33(10):1–9, 2019
work page 2019
-
[4]
Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P Langlotz. Chexpert plus: Augmenting a large chest x- ray dataset with text radiology reports, patient demographics and additional image formats. arXiv preprint arXiv:2405.19538, 2024
-
[5]
Mikaël Chelli, Jules Descamps, Vincent Lavoué, Christophe Trojani, Michel Azar, Marcel Deckert, Jean-Luc Raynier, Gilles Clowez, Pascal Boileau, Caroline Ruetsch-Chelli, et al. Hallucination rates and reference accuracy of chatgpt and bard for systematic reviews: comparative analysis. Journal of medical Internet research , 26(1): e53164, 2024
work page 2024
-
[6]
Benchmarking large language models on answering and explaining challenging medical questions
Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. Benchmarking large language models on answering and explaining challenging medical questions. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3563–3599, 2025
work page 2025
-
[7]
Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale, 2024
Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, and Benyou Wang. Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale, 2024. URL https://arxiv.org/abs/2406. 19280
work page 2024
- [8]
-
[9]
Rarebench: Can llms serve as rare diseases specialists? arXiv preprint arXiv:2402.06341 , 2024
Xuanzhong Chen, Xiaohao Mao, Qihan Guo, Lun Wang, Shuyang Zhang, and Ting Chen. Rarebench: Can llms serve as rare diseases specialists? arXiv preprint arXiv:2402.06341 , 2024. URL https://arxiv.org/abs/2402. 06341
-
[10]
Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, and Nanqing Dong. Graphgen: Enhancing supervised fine-tuning for llms with knowledge-driven synthetic data generation, 2025. URL https://arxiv.org/abs/2505.20416
-
[11]
Gary S. Collins, Karel G. M. Moons, et al. The AIMe registry for artificial intelligence in biomedical research. Nature Methods, 18(11):1333–1336, 2021. doi: 10.1038/s41592-021-01241-0
-
[12]
Curebench: A benchmark and competition for agentic clinical reasoning (neurips 2025),
CureBench Organizers. Curebench: A benchmark and competition for agentic clinical reasoning (neurips 2025),
work page 2025
- [13]
-
[14]
Preparing a collection of radiology examinations for distribution and retrieval
Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association , 23(2):304–310, 2015
work page 2015
-
[15]
Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 1863...
work page 2025
-
[16]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Ya...
work page internal anchor Pith review arXiv 2025
-
[17]
Medrax: Medical reasoning agent for chest x-ray,
Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, and Bo Wang. Medrax: Medical reasoning agent for chest x-ray, 2025. URL https://arxiv.org/abs/2502.02673
-
[18]
Detecting hallucinations in large language models using uncertainty estimation
Sam Farquhar et al. Detecting hallucinations in large language models using uncertainty estimation. Nature,
-
[19]
doi: 10.1038/s41586-024-07421-0
-
[20]
Haozhen Gong, Xiaozhong Ji, Yuansen Liu, Wenbin Wu, Xiaoxiao Yan, Jingjing Liu, Kai Wu, Jiazhen Pan, Bailiang Jian, Jiangning Zhang, et al. Med-cmr: A fine-grained benchmark integrating visual evidence and clinical logic for medical complex multimodal reasoning. arXiv preprint arXiv:2512.00818 , 2025
-
[21]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Revisit the imbalance optimization in multi-task learning: An experimental analysis
Yihang Guo, Tianyuan Yu, Liang Bai, Yanming Guo, Yirun Ruan, William Li, and Weishi Zheng. Revisit the imbalance optimization in multi-task learning: An experimental analysis. arXiv preprint arXiv:2509.23915 , 2025
-
[23]
PathVQA: 30000+ Questions for Medical Visual Question Answering
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[24]
DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271 , 2025
work page internal anchor Pith review arXiv 2025
-
[25]
The landscape of medical agents: A survey
Xiaobin Hu, Yunhang Qian, Jiaquan Yu, Jingjing Liu, Peng Tang, Xiaozhong Ji, Chengming Xu, Jiawei Liu, Xiaoxiao Yan, Xinlei Yu, et al. The landscape of medical agents: A survey. Authorea Preprints, 2025
work page 2025
-
[26]
Omn- imedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm
Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omn- imedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 22170–22183,
-
[27]
URL https://openaccess.thecvf.com/content/CVPR2024/html/Hu_OmniMedVQA_A_New_Large-Scale_ Comprehensive_Evaluation_Benchmark_for_Medical_LVLM_CVPR_2024_paper.html
-
[28]
Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm
Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22170–22183, 2024
work page 2024
-
[29]
Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence , volume 33, pages 590–597, 2019
work page 2019
-
[30]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11 (14):6421, 2021
work page 2021
-
[31]
Pubmedqa: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages 2567–2577, 2019. 26
work page 2019
-
[32]
Alistair E. W. Johnson et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 , 2019
work page internal anchor Pith review arXiv 1901
-
[33]
Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports
Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih- ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data , 6(1):317, 2019
work page 2019
-
[34]
Bag of tricks for efficient text classi- fication
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classi- fication. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 427–431. Association for Computational Linguistics, April 2017
work page 2017
-
[35]
A dataset of clinically generated visual questions and answers about radiology images
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data , 5(1):1–10, 2018
work page 2018
-
[36]
Quarkmed medical foundation model technical report
Ao Li, Bin Yan, Bingfeng Cai, Chenxi Li, Cunzhong Zhao, Fugen Yao, Gaoqiang Liu, Guanjun Jiang, Jian Xu, Liang Dong, et al. Quarkmed medical foundation model technical report. arXiv preprint arXiv:2508.11894 , 2025
-
[37]
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Chunyuan Li et al. Llava-med: Training a large language-and-vision assistant for biomedicine. arXiv preprint arXiv:2306.00890, 2023
work page Pith review arXiv 2023
-
[38]
Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In 2021 IEEE 18th international symposium on biomed- ical imaging (ISBI) , pages 1650–1654. IEEE, 2021
work page 2021
-
[39]
Benchmarking large language models on cmexam-a comprehensive chinese medical exam dataset
Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, et al. Benchmarking large language models on cmexam-a comprehensive chinese medical exam dataset. Advances in Neural Information Processing Systems , 36:52430–52452, 2023
work page 2023
-
[40]
Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl
Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, and Yuxiao Dong. Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl. arXiv preprint arXiv:2509.10446, 2025
-
[41]
VividMed: Vision language model with versatile visual grounding for medicine
Lingxiao Luo, Bingda Tang, Xuanzhong Chen, Rong Han, and Ting Chen. VividMed: Vision language model with versatile visual grounding for medicine. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volum...
work page 2025
-
[42]
Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long.89. URL https://aclanthology.org/2025.naacl-long.89/
-
[43]
Accelerated hierarchical density based clustering
Leland McInnes and John Healy. Accelerated hierarchical density based clustering. In 2017 IEEE international conference on data mining workshops (ICDMW) , pages 33–42. IEEE, 2017
work page 2017
-
[44]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Some methods of classification and analysis of multivariate observations
James B McQueen. Some methods of classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symposium on Math. Stat. and Prob. , pages 281–297, 1967
work page 1967
-
[46]
Med-flamingo: a multimodal medical few-shot learner
Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H) , pages 353–367. PMLR, 2023
work page 2023
-
[47]
International classification of diseases-icd
World Health Organization et al. International classification of diseases-icd. World Health Organization - 2009 , 2009
work page 2009
-
[48]
Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning , pages 248–260. PMLR, 2022
work page 2022
-
[49]
Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 337–347. Springer, 2025
work page 2025
-
[50]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249 , 2025. 27
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[52]
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics , 20:53–65, 1987
work page 1987
-
[53]
Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, et al. Rocov2: Radiology objects in context version 2, an updated multimodal image dataset. Scientific Data , 11(1):688, 2024
work page 2024
-
[54]
Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, Juanma Zambrano Chaves, Szu-Yeu Hu, Mike Schaekermann, Aishwarya Kamath, Yong Cheng, David G. T. Barrett, Cathy Cheung, Basil Mustafa, Anil Palepu, Daniel McDuff, Le Hou, Tomer Golany, Luyang Liu, Jean baptiste Alayrac, Nei...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Shezheng Song, Hao Xu, Jun Ma, Shasha Li, Long Peng, Qian Wan, Xiaodong Liu, and Jie Yu. How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization. arXiv preprint arXiv:2501.13669, 2025
-
[57]
Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, et al. Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning. arXiv preprint arXiv:2503.07459 , 2025
-
[58]
Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report. arXiv preprint arXiv:2510.24701, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/ abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Llava-cot: Let vision language models reason step-by-step
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2087–2098, October 2025
work page 2087
-
[62]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Lin yang, Yuancheng Yang, Xu Wang, Changkun Liu, and Yanghaihua. MedMT-bench: Can LLMs memorize and understand long multi-turn conversations in medical scenarios?, 2025. URL https://openreview.net/forum? id=aKyBCsPOHB
work page 2025
-
[64]
Synthetic continued pretraining, 2024
Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto. Synthetic continued pretraining, 2024. URL https://arxiv.org/abs/2409.07431
-
[65]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022. 28
work page 2022
-
[66]
Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai
Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, et al. Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai. Advances in Neural Information Processing Systems , 37:94327–94427, 2024
work page 2024
-
[67]
Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, and Xuanjing Huang. A multi- dimensional constraint framework for evaluating and improving instruction following in large language models,
-
[68]
URL https://arxiv.org/abs/2505.07591
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Yiwen Ye, Yutong Xie, Jianpeng Zhang, Ziyang Chen, Qi Wu, and Yong Xia. Continual self-supervised learning: Towards universal multi-modal medical data representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11114–11124, 2024
work page 2024
-
[70]
Medlesionvqa: A multimodal benchmark emulating clinical visual diagnosis for body surface health
Deli Yu, Shengzhi Wang, Xiaozhong Ji, Bo Cui, Jieqiong Cao, Huichao Wang, Boyuan Jiang, Xu Wang, Qian Xu, Yi Zhao, et al. Medlesionvqa: A multimodal benchmark emulating clinical visual diagnosis for body surface health. In The Fourteenth International Conference on Learning Representations
-
[71]
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025
work page 2025
-
[72]
MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark
Xiang Yue et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024
work page 2024
-
[73]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 , 2023
work page internal anchor Pith review arXiv 2023
-
[74]
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alexander J. Smola. Multimodal chain-of-thought reasoning in language models. Trans. Mach. Learn. Res. , 2024, 2023. URL https://api. semanticscholar.org/CorpusID:256504063
work page 2024
-
[75]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learning. arXiv preprint arXiv:2505.14362 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Diagnosisarena: Benchmarking diagnostic reasoning for large language models
Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Shaoting Zhang, Pengfei Liu, and Xi- aofan Zhang. Diagnosisarena: Benchmarking diagnostic reasoning for large language models. arXiv preprint arXiv:2505.14107, 2025. URL https://arxiv.org/abs/2505.14107
-
[77]
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362, 2025. URL https://arxiv.org/abs/2501.18362. 29 8 Contributions The authors are listed in alphabetical order by their first names. Contributo...
work page internal anchor Pith review arXiv 2025
-
[78]
Secondarily, determine if they are **medical** entities; if not, do not output
Entity nouns must be informative proper nouns. Secondarily, determine if they are **medical** entities; if not, do not output
-
[79]
Pay attention to overly long medical entity nouns and determine if they can be segmented/split
-
[80]
The sentences below may contain special symbols and meaningless spaces; please ignore them directly
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.