LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Guoyi Li; Haiyang Geng; Jianhua Chen; Jiatong Ma; Ming Xiao; Shihao Xu; Tiancheng Zhou; Yafeng Deng; Yanli Ding; Yiming Yan

arxiv: 2602.09379 · v3 · pith:AGCZ5HCKnew · submitted 2026-02-10 · 💻 cs.MA · cs.CL

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Shihao Xu , Tiancheng Zhou , Jiatong Ma , Yanli Ding , Yiming Yan , Ming Xiao , Guoyi Li , Haiyang Geng

show 3 more authors

Yunyun Han Jianhua Chen Yafeng Deng

This is my paper

classification 💻 cs.MA cs.CL

keywords consultationdiagnosticdiagnosisllmspsychiatricdynamiclingxidiagbenchaccuracy

0 comments

read the original abstract

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
cs.CL 2026-04 unverdicted novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.