Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Andrew Liu; Chenyu You; Dading Chong; Helin Wang; Junling Liu; Lei Zhu; Michael Lingzhi Li; Peilin Zhou; Yining Hua; Zhenhua Guo

arxiv: 2306.03030 · v3 · pith:SW5TMW52new · submitted 2023-06-05 · 💻 cs.CL

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Junling Liu , Peilin Zhou , Yining Hua , Dading Chong , Zhongyu Tian , Andrew Liu , Helin Wang , Chenyu You

show 3 more authors

Zhenhua Guo Lei Zhu Michael Lingzhi Li

This is my paper

classification 💻 cs.CL

keywords medicalcmexamllmschinesedatasetcomprehensiveevaluationaccuracy

0 comments

read the original abstract

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at https://github.com/williamliujl/CMExam.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering
cs.CL 2026-06 unverdicted novelty 6.0

BiRG-LoRA reaches 69.31% macro-average accuracy across CMB, CMExam, MedQA and MedMCQA, outperforming MoELoRA by 0.89 points with 28.1% fewer parameters under a matched single-seed protocol.
Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering
cs.CL 2026-06 unverdicted novelty 6.0

BiRG-LoRA achieves 69.31% macro-average accuracy across CMB, CMExam, MedQA and MedMCQA using a rank-gated LoRA with biaxial clinical gating, outperforming MoELoRA by 0.89 points with 28.1% fewer parameters.
Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering
cs.CL 2026-06 unverdicted novelty 5.0

BiRG-LoRA achieves 69.31% macro-average accuracy across CMB, CMExam, MedQA, and MedMCQA, outperforming MoELoRA by 0.89 points with 28.1% fewer trainable parameters under a matched Qwen3-8B protocol.
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
cs.CY 2026-04 unverdicted novelty 4.0

MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.