MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models
Pith reviewed 2026-05-21 12:39 UTC · model grok-4.3
The pith
Large language models apply DSM-5 criteria accurately in clear cases but fail to adjust their confidence when symptoms overlap across disorders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that state-of-the-art LLMs perform well on noise-free queries that probe DSM-5 knowledge, yet they struggle to calibrate their confidence when distinguishing between disorders with overlapping symptoms. This is shown through systematic testing on synthetic cases derived from a detailed knowledge graph of diagnostic criteria.
What carries the argument
MentalKG, a psychiatrist-validated knowledge graph that encodes DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders, used to generate synthetic clinical cases varying in information completeness and diagnostic complexity.
If this is right
- LLMs are not yet reliable for use as psychiatric decision-support tools.
- Diagnostic evaluations for AI must include tests with varying ambiguity rather than only clear-cut cases.
- Future models will need better mechanisms for expressing and calibrating uncertainty in overlapping symptom scenarios.
- Clinical adoption of LLMs in mental health requires safeguards against overconfident misdiagnoses.
Where Pith is reading between the lines
- Similar benchmarks could be developed for other areas of medicine where symptom overlap is common, such as internal medicine or neurology.
- Training techniques that explicitly reward proper confidence levels might improve performance on this type of task.
- Real clinical data could be used to validate whether the synthetic cases match the difficulty of actual patient presentations.
Load-bearing premise
The synthetic clinical cases created from the knowledge graph capture the same level of information completeness and diagnostic complexity found in real psychiatric assessments.
What would settle it
Running the same LLMs on a set of real, anonymized psychiatric case notes and checking if the pattern of strong performance on clear cases and weak confidence calibration on ambiguous ones still holds.
read the original abstract
Large language models (LLMs) have attracted growing interest as supportive tools for psychiatric assessment and clinical decision support. However, existing mental health benchmarks largely rely on social media data or supportive dialogue settings, limiting their ability to assess whether models can apply formal diagnostic criteria and differential diagnostic rules. In this paper, we introduce MentalBench, a benchmark for evaluating whether LLMs can make DSM-grounded psychiatric diagnostic decisions under varying levels of clinical ambiguity. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as an expert-curated logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling DSM-grounded evaluation. Our experiments show that although state-of-the-art LLMs perform well on noise-free queries that probe DSM-5 knowledge, they struggle to calibrate their confidence when distinguishing between disorders with overlapping symptoms. These findings raise concerns about the reliability of LLMs as psychiatric decision-support tools and highlight the need for more evaluation that reflects the diverse challenges in real-world psychiatric diagnosis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MentalBench, a DSM-5-grounded benchmark for evaluating LLMs on psychiatric diagnostic decisions. It centers on MentalKG, a psychiatrist-constructed knowledge graph encoding criteria and differential rules for 23 disorders, from which 24,750 synthetic clinical cases are generated by systematically varying information completeness and diagnostic complexity. Experiments indicate that state-of-the-art LLMs perform adequately on noise-free DSM-5 probes but exhibit poor confidence calibration when distinguishing disorders with overlapping symptoms.
Significance. If the synthetic cases faithfully encode real diagnostic ambiguity and incompleteness, the benchmark offers a structured, reproducible way to probe LLM reliability in clinical decision support. The knowledge-graph backbone provides a clear advantage over purely data-driven or dialogue-based mental-health evaluations by enabling explicit control over diagnostic rules and ambiguity levels.
major comments (2)
- [Abstract / §3] Abstract and §3 (case generation): The central empirical claim—that LLMs struggle to calibrate confidence on overlapping-symptom distinctions—depends on the 24,750 synthetic cases accurately reflecting the information incompleteness and differential complexity of actual psychiatric encounters. The description states that MentalKG encodes only DSM-5 criteria and rules and that cases are generated via systematic variation of completeness and complexity; however, no external validation (psychiatrist realism ratings, comparison to de-identified clinical notes, or inter-rater agreement on diagnostic difficulty) is reported. This omission is load-bearing because unencoded factors such as longitudinal history, cultural context, or medical comorbidities could drive the observed calibration failures as artifacts of the synthetic distribution rather than general LLM properties.
- [§4] §4 (experimental results): The reported finding that LLMs 'struggle to calibrate their confidence' is presented without explicit metrics or baselines for calibration (e.g., expected calibration error, Brier score, or comparison against a simple rule-based DSM matcher). Without these, it is difficult to quantify how much worse the models are relative to the benchmark's own logical backbone.
minor comments (2)
- [Abstract] Abstract: The phrase 'psychiatrist-built and validated knowledge graph' is used without a citation or subsection reference to the validation procedure or inter-expert agreement statistics.
- [Methods / Results] Throughout: The total of 24,750 cases is stated without a breakdown by disorder, completeness level, or complexity tier; a supplementary table would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major point below and note planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract / §3] The central empirical claim—that LLMs struggle to calibrate confidence on overlapping-symptom distinctions—depends on the 24,750 synthetic cases accurately reflecting the information incompleteness and differential complexity of actual psychiatric encounters. ... no external validation (psychiatrist realism ratings, comparison to de-identified clinical notes, or inter-rater agreement on diagnostic difficulty) is reported. This omission is load-bearing because unencoded factors such as longitudinal history, cultural context, or medical comorbidities could drive the observed calibration failures as artifacts of the synthetic distribution rather than general LLM properties.
Authors: We appreciate the referee's emphasis on this point. MentalKG was constructed and internally validated by psychiatrists to faithfully encode DSM-5 criteria and differential rules, as stated in §3. We agree that external validation against real clinical encounters would strengthen claims of ecological validity. Such validation was not performed owing to ethical and logistical barriers in accessing de-identified patient records. In the revision we will expand the description of the KG construction process in §3, add an explicit limitations paragraph discussing unencoded factors (e.g., cultural context, comorbidities), and outline future work on external realism ratings. revision: partial
-
Referee: [§4] The reported finding that LLMs 'struggle to calibrate their confidence' is presented without explicit metrics or baselines for calibration (e.g., expected calibration error, Brier score, or comparison against a simple rule-based DSM matcher). Without these, it is difficult to quantify how much worse the models are relative to the benchmark's own logical backbone.
Authors: We agree that quantitative calibration metrics and an explicit baseline would improve clarity. In the revised §4 we will report Expected Calibration Error (ECE) and Brier scores for model confidence estimates across the ambiguity levels. We will also add a deterministic rule-based DSM matcher that applies the logical rules encoded in MentalKG as a baseline, allowing direct comparison of LLM calibration against the benchmark's expert-defined structure. revision: yes
Circularity Check
No circularity: benchmark derives from independent expert KG and generated cases
full rationale
The paper builds MentalKG as a psychiatrist-validated knowledge graph encoding DSM-5 criteria for 23 disorders, then uses it to generate 24,750 synthetic cases that vary completeness and complexity. LLM evaluations and the claim about poor calibration on overlapping symptoms are obtained by testing models against these externally generated cases. No equations, fitted parameters, or self-citation chains reduce the reported results back to the inputs by construction. The derivation chain remains self-contained and empirical.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DSM-5 diagnostic criteria and differential diagnostic rules can be accurately represented in a structured knowledge graph.
invented entities (2)
-
MentalKG
no independent evidence
-
MentalBench
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
At the core of MENTALBENCH is MENTALKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders... generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Responsible Evaluation of AI for Mental Health
Proposes an interdisciplinary framework and taxonomy for responsible evaluation of AI mental health tools based on analysis of 135 publications identifying gaps in metrics, expert involvement, safety, and equity.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.