MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

Alice Oh; Chanbi Park; Hangyeol Yoo; Hoyun Song; Jihyun An; JiHyun Kim; Jinyoung Han; Jisu Shin; KyungTae Lim; Migyeong Kang

arxiv: 2602.12871 · v2 · pith:VQNJ5I43new · submitted 2026-02-13 · 💻 cs.CL

MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

Hoyun Song , Migyeong Kang , Jisu Shin , Jihyun Kim , Chanbi Park , Hangyeol Yoo , Jihyun An , Alice Oh

show 2 more authors

Jinyoung Han KyungTae Lim

This is my paper

Pith reviewed 2026-05-21 12:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelspsychiatric diagnosisDSM-5benchmarkknowledge graphconfidence calibrationdiagnostic evaluation

0 comments

The pith

Large language models apply DSM-5 criteria accurately in clear cases but fail to adjust their confidence when symptoms overlap across disorders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a benchmark to check whether LLMs can use official psychiatric rules to diagnose mental health conditions. The authors build a knowledge graph of the diagnostic criteria and rules for 23 disorders and then produce thousands of made-up patient stories that change in how complete the information is and how tricky the diagnosis becomes. Experiments reveal that leading models do fine when the questions are straightforward but have difficulty knowing how sure to be when multiple disorders fit the same symptoms. A sympathetic reader would care because psychiatric diagnosis in real life often involves incomplete information and symptom overlap, so any AI tool must handle that uncertainty to be useful.

Core claim

The paper establishes that state-of-the-art LLMs perform well on noise-free queries that probe DSM-5 knowledge, yet they struggle to calibrate their confidence when distinguishing between disorders with overlapping symptoms. This is shown through systematic testing on synthetic cases derived from a detailed knowledge graph of diagnostic criteria.

What carries the argument

MentalKG, a psychiatrist-validated knowledge graph that encodes DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders, used to generate synthetic clinical cases varying in information completeness and diagnostic complexity.

If this is right

LLMs are not yet reliable for use as psychiatric decision-support tools.
Diagnostic evaluations for AI must include tests with varying ambiguity rather than only clear-cut cases.
Future models will need better mechanisms for expressing and calibrating uncertainty in overlapping symptom scenarios.
Clinical adoption of LLMs in mental health requires safeguards against overconfident misdiagnoses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks could be developed for other areas of medicine where symptom overlap is common, such as internal medicine or neurology.
Training techniques that explicitly reward proper confidence levels might improve performance on this type of task.
Real clinical data could be used to validate whether the synthetic cases match the difficulty of actual patient presentations.

Load-bearing premise

The synthetic clinical cases created from the knowledge graph capture the same level of information completeness and diagnostic complexity found in real psychiatric assessments.

What would settle it

Running the same LLMs on a set of real, anonymized psychiatric case notes and checking if the pattern of strong performance on clear cases and weak confidence calibration on ambiguous ones still holds.

read the original abstract

Large language models (LLMs) have attracted growing interest as supportive tools for psychiatric assessment and clinical decision support. However, existing mental health benchmarks largely rely on social media data or supportive dialogue settings, limiting their ability to assess whether models can apply formal diagnostic criteria and differential diagnostic rules. In this paper, we introduce MentalBench, a benchmark for evaluating whether LLMs can make DSM-grounded psychiatric diagnostic decisions under varying levels of clinical ambiguity. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as an expert-curated logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling DSM-grounded evaluation. Our experiments show that although state-of-the-art LLMs perform well on noise-free queries that probe DSM-5 knowledge, they struggle to calibrate their confidence when distinguishing between disorders with overlapping symptoms. These findings raise concerns about the reliability of LLMs as psychiatric decision-support tools and highlight the need for more evaluation that reflects the diverse challenges in real-world psychiatric diagnosis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MentalBench gives a clean way to test LLMs on DSM-5 differential diagnosis with controlled ambiguity, but the synthetic cases leave open whether the calibration failures are real or setup-specific.

read the letter

The main thing to know is that this paper builds MentalKG, a psychiatrist-curated knowledge graph of DSM-5 criteria and rules for 23 disorders, then uses it to generate 24,750 synthetic cases that vary information completeness and diagnostic complexity. The experiments find that current LLMs do fine on straightforward queries but lose calibration when symptoms overlap across disorders. That controlled variation is the concrete step forward from earlier mental-health benchmarks that leaned on social media text or open-ended dialogue.

Referee Report

2 major / 2 minor

Summary. The paper introduces MentalBench, a DSM-5-grounded benchmark for evaluating LLMs on psychiatric diagnostic decisions. It centers on MentalKG, a psychiatrist-constructed knowledge graph encoding criteria and differential rules for 23 disorders, from which 24,750 synthetic clinical cases are generated by systematically varying information completeness and diagnostic complexity. Experiments indicate that state-of-the-art LLMs perform adequately on noise-free DSM-5 probes but exhibit poor confidence calibration when distinguishing disorders with overlapping symptoms.

Significance. If the synthetic cases faithfully encode real diagnostic ambiguity and incompleteness, the benchmark offers a structured, reproducible way to probe LLM reliability in clinical decision support. The knowledge-graph backbone provides a clear advantage over purely data-driven or dialogue-based mental-health evaluations by enabling explicit control over diagnostic rules and ambiguity levels.

major comments (2)

[Abstract / §3] Abstract and §3 (case generation): The central empirical claim—that LLMs struggle to calibrate confidence on overlapping-symptom distinctions—depends on the 24,750 synthetic cases accurately reflecting the information incompleteness and differential complexity of actual psychiatric encounters. The description states that MentalKG encodes only DSM-5 criteria and rules and that cases are generated via systematic variation of completeness and complexity; however, no external validation (psychiatrist realism ratings, comparison to de-identified clinical notes, or inter-rater agreement on diagnostic difficulty) is reported. This omission is load-bearing because unencoded factors such as longitudinal history, cultural context, or medical comorbidities could drive the observed calibration failures as artifacts of the synthetic distribution rather than general LLM properties.
[§4] §4 (experimental results): The reported finding that LLMs 'struggle to calibrate their confidence' is presented without explicit metrics or baselines for calibration (e.g., expected calibration error, Brier score, or comparison against a simple rule-based DSM matcher). Without these, it is difficult to quantify how much worse the models are relative to the benchmark's own logical backbone.

minor comments (2)

[Abstract] Abstract: The phrase 'psychiatrist-built and validated knowledge graph' is used without a citation or subsection reference to the validation procedure or inter-expert agreement statistics.
[Methods / Results] Throughout: The total of 24,750 cases is stated without a breakdown by disorder, completeness level, or complexity tier; a supplementary table would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and note planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract / §3] The central empirical claim—that LLMs struggle to calibrate confidence on overlapping-symptom distinctions—depends on the 24,750 synthetic cases accurately reflecting the information incompleteness and differential complexity of actual psychiatric encounters. ... no external validation (psychiatrist realism ratings, comparison to de-identified clinical notes, or inter-rater agreement on diagnostic difficulty) is reported. This omission is load-bearing because unencoded factors such as longitudinal history, cultural context, or medical comorbidities could drive the observed calibration failures as artifacts of the synthetic distribution rather than general LLM properties.

Authors: We appreciate the referee's emphasis on this point. MentalKG was constructed and internally validated by psychiatrists to faithfully encode DSM-5 criteria and differential rules, as stated in §3. We agree that external validation against real clinical encounters would strengthen claims of ecological validity. Such validation was not performed owing to ethical and logistical barriers in accessing de-identified patient records. In the revision we will expand the description of the KG construction process in §3, add an explicit limitations paragraph discussing unencoded factors (e.g., cultural context, comorbidities), and outline future work on external realism ratings. revision: partial
Referee: [§4] The reported finding that LLMs 'struggle to calibrate their confidence' is presented without explicit metrics or baselines for calibration (e.g., expected calibration error, Brier score, or comparison against a simple rule-based DSM matcher). Without these, it is difficult to quantify how much worse the models are relative to the benchmark's own logical backbone.

Authors: We agree that quantitative calibration metrics and an explicit baseline would improve clarity. In the revised §4 we will report Expected Calibration Error (ECE) and Brier scores for model confidence estimates across the ambiguity levels. We will also add a deterministic rule-based DSM matcher that applies the logical rules encoded in MentalKG as a baseline, allowing direct comparison of LLM calibration against the benchmark's expert-defined structure. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark derives from independent expert KG and generated cases

full rationale

The paper builds MentalKG as a psychiatrist-validated knowledge graph encoding DSM-5 criteria for 23 disorders, then uses it to generate 24,750 synthetic cases that vary completeness and complexity. LLM evaluations and the claim about poor calibration on overlapping symptoms are obtained by testing models against these externally generated cases. No equations, fitted parameters, or self-citation chains reduce the reported results back to the inputs by construction. The derivation chain remains self-contained and empirical.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that a knowledge graph can faithfully encode DSM-5 criteria and differential rules and that the generated synthetic cases reflect real clinical ambiguity.

axioms (1)

domain assumption DSM-5 diagnostic criteria and differential diagnostic rules can be accurately represented in a structured knowledge graph.
This premise is required to generate the 24,750 synthetic cases used for evaluation.

invented entities (2)

MentalKG no independent evidence
purpose: Expert-curated knowledge graph encoding DSM-5 criteria for 23 disorders.
Newly constructed backbone for case generation; validation details not provided in abstract.
MentalBench no independent evidence
purpose: Benchmark dataset and evaluation framework for LLM psychiatric diagnosis.
Newly introduced evaluation resource.

pith-pipeline@v0.9.0 · 5766 in / 1311 out tokens · 44022 ms · 2026-05-21T12:39:26.739170+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At the core of MENTALBENCH is MENTALKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders... generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Responsible Evaluation of AI for Mental Health
cs.CY 2026-01 unverdicted novelty 6.0

Proposes an interdisciplinary framework and taxonomy for responsible evaluation of AI mental health tools based on analysis of 135 publications identifying gaps in metrics, expert involvement, safety, and equity.