ClinConsensus: A Physician-Calibrated Benchmark for Evaluating Clinical Rubric Coverage in Chinese Medical LLMs

Bing Zhao; Chuanmiao Yan; Dayiheng Liu; Han Li; Hu Wei; Kailuan Wu; Kexin Yang; Lin Qu; Ruyi Xu; Sen Yang

arxiv: 2603.02097 · v5 · pith:WQ4LAUPLnew · submitted 2026-03-02 · 💻 cs.CL

ClinConsensus: A Physician-Calibrated Benchmark for Evaluating Clinical Rubric Coverage in Chinese Medical LLMs

Xiang Zheng , Han Li , Wenjie Luo , Weiqi Zhai , Yiyuan Li , Chuanmiao Yan , Xue Yang , Kailuan Wu

show 10 more authors

Ruyi Xu Tianyun Lu Tianyi Tang Yubo Ma Kexin Yang Dayiheng Liu Sen Yang Lin Qu Bing Zhao Hu Wei

This is my paper

classification 💻 cs.CL

keywords coveragemedicalclinicalcriteriaphysician-calibratedrubricacrossbenchmark

0 comments

read the original abstract

Open-ended medical LLM evaluation remains weakly grounded in physician-calibrated coverage of clinically relevant response criteria, especially in localized clinical settings. We introduce \textsc{ClinConsensus}, a Chinese medical benchmark of 2{,}500 expert-curated cases spanning 36 specialties, 12 task themes, multiple difficulty levels, and lay-facing versus professional-facing settings. Each case is paired with 30 case-specific binary rubric criteria. To evaluate whether responses satisfy enough physician-authored criteria, we propose \emph{Clinician-Anchored Coverage Score} (CACS), a physician-calibrated threshold metric instantiated at \(k=10\), and develop a dual-judge framework combining a GPT-5.1 grader with a physician-supervised Qwen3-8B judge. Evaluating 11 frontier LLMs, we find a persistent coverage gap: Rubric Accuracy ranges from 39.6\% to 52.1\%, whereas CACS@10 ranges from 17.8\% to 32.9\%, leaving a 19.2--21.9 point gap across models. Stratified analyses further reveal substantial variation across reasoning, evidence use, structured extraction, medication instructions, follow-up, and dialogue register. These results suggest that medical LLM evaluation should measure thresholded, rubric-grounded clinical coverage rather than average partial correctness.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions
cs.CY 2026-05 conditional novelty 6.0

Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing a...