pith. sign in

arxiv: 2605.15589 · v1 · pith:VKKO2XHTnew · submitted 2026-05-15 · 💻 cs.CL

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

Pith reviewed 2026-05-20 19:13 UTC · model grok-4.3

classification 💻 cs.CL
keywords mental healthlarge language modelsknowledge graphsbenchmarkentity recognitionrelation predictiontwo-hop reasoning
0
0 comments X

The pith

Large language models perform well on recognizing mental health entities but struggle with predicting relations and two-hop reasoning in a new knowledge-graph benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MHGraphBench to evaluate how well LLMs capture mental health knowledge from a knowledge graph and apply it to structured judgments. Experiments on 15 models show they excel at entity typing but have trouble with relation prediction and reasoning steps. This matters because LLMs are being used in mental health contexts where accurate knowledge application is critical. The work also finds that providing graph snippets has mixed effects and that how answers are formatted can skew results. Overall, it positions the benchmark as measuring agreement with a specific KG slice rather than real clinical capability.

Core claim

The central discovery is a persistent recognition-to-judgment gap: leading models reach near-ceiling performance on entity typing and small relation-typing tasks but struggle with relation prediction and two-hop reasoning when evaluated on the MHGraphBench derived from PrimeKG.

What carries the argument

MHGraphBench, a benchmark with nine task families derived from PrimeKG, using KG-supported answers and controlled negative options to test entity recognition, relation judgment, and two-hop reasoning.

If this is right

  • Short snippets from the knowledge graph improve performance for some models but decrease it for others.
  • Output format reliability can substantially affect measured performance in multiple-choice evaluations.
  • The benchmark should be seen as testing agreement with a curated mental health KG slice under constrained settings, not direct clinical safety.
  • Models' limitations in complex reasoning highlight the need for careful use in mental health applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers might need to fine-tune models specifically on relation-heavy mental health tasks to close the gap.
  • Integrating this benchmark with real-world clinical data could provide a more robust test of LLM reliability.
  • Similar gaps may exist in other specialized domains, suggesting a general pattern in how LLMs handle structured knowledge.

Load-bearing premise

The mental health portion of the PrimeKG knowledge graph serves as a complete and unbiased representation of clinically important facts and relations.

What would settle it

Finding that models achieve high accuracy on relation prediction and two-hop reasoning tasks when the ground truth is independently validated by mental health experts.

Figures

Figures reproduced from arXiv: 2605.15589 by Bradley A. Malin, Congning Ni, Murat Kantarcioglu, Shelagh A. Mulvaney, Susannah L. Rose, Weixin Liu, Zhijun Yin.

Figure 1
Figure 1. Figure 1: Overview of the KG-grounded mental-health benchmark framework. Starting from 42 final psychiatric [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MHGraphBench, a benchmark derived from the mental-health slice of PrimeKG comprising nine task families for entity recognition, relation judgment, and two-hop reasoning. Experiments on 15 LLMs identify a recognition-to-judgment gap with near-ceiling entity typing but weaker relation prediction and reasoning performance, plus sensitivity to output format and KG snippets. The work explicitly caveats that results measure agreement with the curated KG slice under multiple-choice constraints rather than clinical safety.

Significance. If the PrimeKG mental-health slice can be shown to be a sufficiently validated proxy for clinically relevant knowledge, the benchmark would provide a useful, reproducible tool for quantifying structured knowledge gaps in LLMs. The use of an external public KG, controlled negative options, and explicit non-clinical interpretation are strengths that support cautious adoption for model evaluation.

major comments (2)
  1. [Abstract and benchmark construction] Abstract and benchmark construction section: The central recognition-to-judgment gap claim requires that performance differences reflect model limitations rather than uneven fidelity or coverage in the PrimeKG mental-health slice. No clinician review, inter-rater reliability metrics, or coverage audit of extracted triples (especially probabilistic relations such as symptom comorbidity) is reported, leaving open the possibility that lower relation and two-hop scores arise from KG artifacts instead of LLM capability.
  2. [Task construction and evaluation] Task construction and evaluation sections: The paper notes format sensitivity and the value of short KG snippets but provides no details on negative sampling strategy, statistical controls for task difficulty, or balancing across the nine task families. Without these, the reported performance gaps (near-ceiling entity typing vs. lower relation prediction) cannot be confidently attributed to a general recognition-to-judgment disparity.
minor comments (2)
  1. [Results] Clarify the exact number of examples per task family and the size of the relation-typing subset in the main text or a table for reproducibility.
  2. [Introduction] The abstract's final sentence on interpretation is helpful; consider moving a concise version of this caveat to the introduction as well.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed review, which highlights important considerations for strengthening the presentation of MHGraphBench. We value the recognition of the benchmark's use of a public KG and its explicit non-clinical framing. We respond point-by-point to the major comments below and have revised the manuscript to improve transparency and address the raised concerns where feasible.

read point-by-point responses
  1. Referee: [Abstract and benchmark construction] Abstract and benchmark construction section: The central recognition-to-judgment gap claim requires that performance differences reflect model limitations rather than uneven fidelity or coverage in the PrimeKG mental-health slice. No clinician review, inter-rater reliability metrics, or coverage audit of extracted triples (especially probabilistic relations such as symptom comorbidity) is reported, leaving open the possibility that lower relation and two-hop scores arise from KG artifacts instead of LLM capability.

    Authors: We appreciate the referee's point that the recognition-to-judgment gap must be attributable to model behavior rather than artifacts in the underlying KG slice. We acknowledge that the manuscript does not report clinician review, inter-rater reliability, or a dedicated coverage audit of the extracted mental-health triples from PrimeKG. PrimeKG is an established, multi-source biomedical KG, and our extraction followed its existing mental-health annotations, but we did not add independent clinical validation. In the revised version, we expand the benchmark construction section with a clearer description of the triple extraction criteria and sources. We also strengthen the limitations and interpretation sections to reiterate that results reflect agreement with the curated KG under multiple-choice constraints, not clinical validity. We note that the gap appears consistently across 15 models and is most evident in multi-hop reasoning, which is consistent with broader LLM literature on structured inference rather than isolated KG noise in probabilistic relations. These textual revisions clarify the scope without overclaiming. revision: yes

  2. Referee: [Task construction and evaluation] Task construction and evaluation sections: The paper notes format sensitivity and the value of short KG snippets but provides no details on negative sampling strategy, statistical controls for task difficulty, or balancing across the nine task families. Without these, the reported performance gaps (near-ceiling entity typing vs. lower relation prediction) cannot be confidently attributed to a general recognition-to-judgment disparity.

    Authors: We agree that additional details on task construction would improve confidence in attributing the gaps. The original manuscript describes controlled negative options drawn from the KG to ensure they are incorrect yet plausible, but we will add explicit description of the negative sampling procedure (sampling non-matching relations from entities of the same type while excluding ground-truth answers). For task difficulty, we performed informal checks during construction to avoid trivial or overly hard items, and we will now report summary statistics such as average option count and per-family accuracy baselines in the evaluation section. Regarding balancing, the nine task families reflect the natural distribution of relations in the PrimeKG mental-health slice rather than artificial equalization; we already disaggregate results by family. In revision we add a table with instance counts per family and note the design rationale. These changes support clearer attribution of the observed recognition-to-judgment pattern while preserving the benchmark's grounding in the source KG. revision: partial

standing simulated objections not resolved
  • Independent clinician review, inter-rater reliability metrics, or coverage audit of the PrimeKG mental-health slice triples

Circularity Check

0 steps flagged

Benchmark derived from external PrimeKG with self-contained evaluation

full rationale

The paper constructs MHGraphBench directly from the public PrimeKG resource and measures LLM agreement with its curated mental-health slice under multiple-choice constraints. No equations, fitted parameters, or self-referential definitions appear in the provided text; the recognition-to-judgment gap is an empirical observation against this independent KG rather than a quantity that reduces to the authors' inputs by construction. The abstract explicitly limits interpretation to agreement with the curated slice and disclaims clinical safety claims, confirming the derivation chain does not rely on self-citation load-bearing or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on PrimeKG being a high-quality source of mental-health facts and on the multiple-choice format being a valid probe of knowledge rather than format compliance. No free parameters are introduced; the main domain assumption is the fidelity of the KG slice.

axioms (2)
  • domain assumption PrimeKG supplies accurate, unbiased, and sufficiently complete mental-health entities and relations for benchmarking LLM knowledge.
    All task answers and negative options are derived from PrimeKG; any systematic error or coverage gap in that graph directly affects measured performance.
  • domain assumption Constrained multiple-choice responses with controlled negatives validly isolate knowledge from output-format artifacts.
    The paper itself notes that output-format reliability substantially influences scores, yet the benchmark still uses this interface as the primary measurement vehicle.

pith-pipeline@v0.9.0 · 5756 in / 1713 out tokens · 69163 ms · 2026-05-20T19:13:54.820883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 7 internal anchors

  1. [1]

    The Lancet Psychiatry , volume=

    Global, regional, and national burden of 12 mental disorders in 204 countries and territories, 1990--2019: a systematic analysis for the. The Lancet Psychiatry , volume=. 2022 , publisher=

  2. [2]

    Nature Medicine , volume=

    Toward expert-level medical question answering with large language models , author=. Nature Medicine , volume=. 2025 , publisher=

  3. [3]

    Scientific Data , volume=

    Building a knowledge graph to enable precision medicine , author=. Scientific Data , volume=. 2023 , publisher=

  4. [4]

    and Song, Xiang and Manchanda, Saurav and Li, Mufei and Pan, Xiaoqin and Zheng, Da and Ning, Xia and Zeng, Xiangxiang and Karypis, George , title =

    Ioannidis, Vassilis N. and Song, Xiang and Manchanda, Saurav and Li, Mufei and Pan, Xiaoqin and Zheng, Da and Ning, Xia and Zeng, Xiangxiang and Karypis, George , title =. 2020 , howpublished =

  5. [5]

    Impact of large language model (

    Iqbal, Usman and Tanweer, Afifa and Rahmanti, Annisa Ristya and Greenfield, David and Lee, Leon Tsung-Ju and Li, Yu-Chuan Jack , journal=. Impact of large language model (. 2025 , publisher=

  6. [6]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Arora, Rahul K and Wei, Jason and Hicks, Rebecca Soskin and Bowman, Preston and Qui. arXiv preprint arXiv:2505.08775 , year=

  7. [7]

    Cai, Yan and Wang, Linlin and Wang, Ye and de Melo, Gerard and Zhang, Ya and Wang, Yanfeng and He, Liang , booktitle=

  8. [8]

    Acta Psychiatrica Scandinavica , volume=

    Knowledge graphs in psychiatric research: Potential applications and future perspectives , author=. Acta Psychiatrica Scandinavica , volume=. 2025 , publisher=

  9. [9]

    Nature Communications , volume=

    Large language model powered knowledge graph construction for mental health exploration , author=. Nature Communications , volume=. 2025 , publisher=

  10. [10]

    Australasian Psychiatry , volume=

    Off-label prescribing of psychotropics in a psychiatric patient population in Australia , author=. Australasian Psychiatry , volume=. 2024 , publisher=

  11. [11]

    BMC Psychiatry , volume=

    Off-label use of psychotropic drugs in youth , author=. BMC Psychiatry , volume=. 2025 , publisher=

  12. [12]

    Saab, Khaled and Tu, Tao and Weng, Wei-Hung and Tanno, Ryutaro and Stutz, David and Wulczyn, Ellery and Zhang, Fan and Strother, Tim and Park, Chunjong and Vedadi, Elahe and Zambrano Chaves, Juanma and Hu, Szu-Yeu and Schaekermann, Mike and Kamath, Aishwarya and Cheng, Yong and Barrett, David G. T. and Cheung, Cathy and Mustafa, Basil and Palepu, Anil and...

  13. [13]

    2024 , publisher=

    Li, Jianning and Dada, Amin and Puladi, Behrus and Kleesiek, Jens and Egger, Jan , journal=. 2024 , publisher=

  14. [14]

    Psychiatry Research , volume=

    Large language models in psychiatry: Opportunities and challenges , author=. Psychiatry Research , volume=. 2024 , publisher=

  15. [15]

    arXiv preprint arXiv:2307.07697 , year=

    Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph , author=. arXiv preprint arXiv:2307.07697 , year=

  16. [16]

    arXiv preprint arXiv:2310.02166 , year=

    Large language models meet knowledge graphs to answer factoid questions , author=. arXiv preprint arXiv:2310.02166 , year=

  17. [17]

    CoRR , year=

    Beyond the answers: Reviewing the rationality of multiple choice question answering for the evaluation of large language models , author=. CoRR , year=

  18. [18]

    Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

    Large language models sensitivity to the order of options in multiple-choice questions , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

  19. [19]

    Markowitz, Elan and Galiya, Krupa and Steeg, Greg Ver and Galstyan, Aram , journal=

  20. [20]

    NPP—Digital Psychiatry and Neuroscience , volume=

    Opportunities and risks of large language models in psychiatry , author=. NPP—Digital Psychiatry and Neuroscience , volume=. 2024 , publisher=

  21. [21]

    Large language models are not robust multiple choice selectors

    Large language models are not robust multiple choice selectors , author=. arXiv preprint arXiv:2309.03882 , year=

  22. [22]

    Labrak, Yanis and Bazoge, Adrien and Morin, Emmanuel and Gourraud, Pierre-Antoine and Rouvier, Mickael and Dufour, Richard , journal=

  23. [23]

    arXiv preprint arXiv:2408.06142 , year=

    Christophe, Cl. arXiv preprint arXiv:2408.06142 , year=

  24. [24]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  25. [25]

    mistralai/Mistral-7B-Instruct-v0.3 (Hugging Face model card) , year =

  26. [26]

    The Llama 3 Herd of Models

    The. arXiv preprint arXiv:2407.21783 , year=

  27. [27]

    arXiv preprint arXiv:2410.21276 , year=

  28. [28]

    Agarwal, Vibhor and Jin, Yiqiao and Chandra, Mohit and De Choudhury, Munmun and Kumar, Srijan and Sastry, Nishanth , journal=

  29. [29]

    Can We Trust

    Zhu, Zhihong and Zhang, Yunyan and Zhuang, Xianwei and Zhang, Fan and Wan, Zhongwei and Chen, Yuyan and Long, Qingqing and Zheng, Yefeng and Wu, Xian , booktitle=. Can We Trust

  30. [30]

    arXiv preprint arXiv:2412.15115 , year =

  31. [31]

    GPT-4o mini: advancing cost-efficient intelligence , year =

  32. [32]

    2025 , howpublished =

    Introducing. 2025 , howpublished =

  33. [33]

    2026 , howpublished =

  34. [34]

    arXiv preprint arXiv:2501.12948 , year=

  35. [35]

    MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

    Chen, Zeming and Hern. arXiv preprint arXiv:2311.16079 , year=

  36. [36]

    Hugging Face repository , howpublished =

    Pal, Ankit and Sankarasubbu, Malaikannan , title =. Hugging Face repository , howpublished =. 2024 , publisher =

  37. [37]

    Song, Hoyun and Kang, Migyeong and Shin, Jisu and Kim, Jihyun and Park, Chanbi and Yoo, Hangyeol and An, Jihyun and Oh, Alice and Han, Jinyoung and Lim, KyungTae , journal=

  38. [38]

    Xiong, Zixin and Wang, Ziteng and Fan, Haotian and Zhang, Xinjie and Wang, Wenxuan , journal=

  39. [39]

    Li, Yahan and Yao, Jifan and Bunyi, John Bosco S and Frank, Adam C and Hwang, Angel Hsing-Chi and Liu, Ruishan , journal=

  40. [40]

    and Ellsworth, Scott and Abraham, Matthew and Dorfman, Elizabeth and Armitage, N

    Guo, Haoyu and Tikhanovskaya, Maria and Raccuglia, Paul and Vlaskin, Alexey and Co, Chris and Liebling, Daniel J. and Ellsworth, Scott and Abraham, Matthew and Dorfman, Elizabeth and Armitage, N. P. and Feng, Chunhan and Georges, Antoine and Gingras, Olivier and Kiese, Dominik and Kivelson, Steven A. and Oganesyan, Vadim and Ramshaw, B. J. and Sachdev, Su...