CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Yahan Li , Jifan Yao , John Bosco S. Bunyi , Adam C. Frank , Angel Hsing-Chi Hwang , Ruishan Liu

Authors on Pith no claims yet

classification 💻 cs.CL

keywords healthmentalexpertllmsquestionsanswerscounselbenchpatient

read the original abstract

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Expert evaluation of 1,080 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs
cs.CL 2026-05 unverdicted novelty 7.0

DESG uses dynamic graphs of decoupled clinical states and asymmetric geometry to evaluate therapeutic dialogue quality, reaching 0.9353 macro-F1 on a 600-window held-out test set and outperforming LLM judges and text ...
Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs
cs.CL 2026-04 conditional novelty 7.0

Graph2Counsel creates 760 synthetic counseling sessions from 76 client psychological graphs, outperforming prior datasets in expert ratings on specificity, authenticity, and safety while improving fine-tuned model per...
Mental Health AI Safety Claims Must Preserve Temporal Evidence
cs.AI 2026-05 unverdicted novelty 5.0

Mental health AI safety evaluations that discard temporal sequence and accumulation produce invalid conclusions; the paper formalizes this as Temporal Safety Non-Identifiability and proposes SCOPE-MH as a reporting st...