MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
Pith reviewed 2026-05-20 19:13 UTC · model grok-4.3
The pith
Large language models perform well on recognizing mental health entities but struggle with predicting relations and two-hop reasoning in a new knowledge-graph benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is a persistent recognition-to-judgment gap: leading models reach near-ceiling performance on entity typing and small relation-typing tasks but struggle with relation prediction and two-hop reasoning when evaluated on the MHGraphBench derived from PrimeKG.
What carries the argument
MHGraphBench, a benchmark with nine task families derived from PrimeKG, using KG-supported answers and controlled negative options to test entity recognition, relation judgment, and two-hop reasoning.
If this is right
- Short snippets from the knowledge graph improve performance for some models but decrease it for others.
- Output format reliability can substantially affect measured performance in multiple-choice evaluations.
- The benchmark should be seen as testing agreement with a curated mental health KG slice under constrained settings, not direct clinical safety.
- Models' limitations in complex reasoning highlight the need for careful use in mental health applications.
Where Pith is reading between the lines
- Developers might need to fine-tune models specifically on relation-heavy mental health tasks to close the gap.
- Integrating this benchmark with real-world clinical data could provide a more robust test of LLM reliability.
- Similar gaps may exist in other specialized domains, suggesting a general pattern in how LLMs handle structured knowledge.
Load-bearing premise
The mental health portion of the PrimeKG knowledge graph serves as a complete and unbiased representation of clinically important facts and relations.
What would settle it
Finding that models achieve high accuracy on relation prediction and two-hop reasoning tasks when the ground truth is independently validated by mental health experts.
Figures
read the original abstract
Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MHGraphBench, a benchmark derived from the mental-health slice of PrimeKG comprising nine task families for entity recognition, relation judgment, and two-hop reasoning. Experiments on 15 LLMs identify a recognition-to-judgment gap with near-ceiling entity typing but weaker relation prediction and reasoning performance, plus sensitivity to output format and KG snippets. The work explicitly caveats that results measure agreement with the curated KG slice under multiple-choice constraints rather than clinical safety.
Significance. If the PrimeKG mental-health slice can be shown to be a sufficiently validated proxy for clinically relevant knowledge, the benchmark would provide a useful, reproducible tool for quantifying structured knowledge gaps in LLMs. The use of an external public KG, controlled negative options, and explicit non-clinical interpretation are strengths that support cautious adoption for model evaluation.
major comments (2)
- [Abstract and benchmark construction] Abstract and benchmark construction section: The central recognition-to-judgment gap claim requires that performance differences reflect model limitations rather than uneven fidelity or coverage in the PrimeKG mental-health slice. No clinician review, inter-rater reliability metrics, or coverage audit of extracted triples (especially probabilistic relations such as symptom comorbidity) is reported, leaving open the possibility that lower relation and two-hop scores arise from KG artifacts instead of LLM capability.
- [Task construction and evaluation] Task construction and evaluation sections: The paper notes format sensitivity and the value of short KG snippets but provides no details on negative sampling strategy, statistical controls for task difficulty, or balancing across the nine task families. Without these, the reported performance gaps (near-ceiling entity typing vs. lower relation prediction) cannot be confidently attributed to a general recognition-to-judgment disparity.
minor comments (2)
- [Results] Clarify the exact number of examples per task family and the size of the relation-typing subset in the main text or a table for reproducibility.
- [Introduction] The abstract's final sentence on interpretation is helpful; consider moving a concise version of this caveat to the introduction as well.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review, which highlights important considerations for strengthening the presentation of MHGraphBench. We value the recognition of the benchmark's use of a public KG and its explicit non-clinical framing. We respond point-by-point to the major comments below and have revised the manuscript to improve transparency and address the raised concerns where feasible.
read point-by-point responses
-
Referee: [Abstract and benchmark construction] Abstract and benchmark construction section: The central recognition-to-judgment gap claim requires that performance differences reflect model limitations rather than uneven fidelity or coverage in the PrimeKG mental-health slice. No clinician review, inter-rater reliability metrics, or coverage audit of extracted triples (especially probabilistic relations such as symptom comorbidity) is reported, leaving open the possibility that lower relation and two-hop scores arise from KG artifacts instead of LLM capability.
Authors: We appreciate the referee's point that the recognition-to-judgment gap must be attributable to model behavior rather than artifacts in the underlying KG slice. We acknowledge that the manuscript does not report clinician review, inter-rater reliability, or a dedicated coverage audit of the extracted mental-health triples from PrimeKG. PrimeKG is an established, multi-source biomedical KG, and our extraction followed its existing mental-health annotations, but we did not add independent clinical validation. In the revised version, we expand the benchmark construction section with a clearer description of the triple extraction criteria and sources. We also strengthen the limitations and interpretation sections to reiterate that results reflect agreement with the curated KG under multiple-choice constraints, not clinical validity. We note that the gap appears consistently across 15 models and is most evident in multi-hop reasoning, which is consistent with broader LLM literature on structured inference rather than isolated KG noise in probabilistic relations. These textual revisions clarify the scope without overclaiming. revision: yes
-
Referee: [Task construction and evaluation] Task construction and evaluation sections: The paper notes format sensitivity and the value of short KG snippets but provides no details on negative sampling strategy, statistical controls for task difficulty, or balancing across the nine task families. Without these, the reported performance gaps (near-ceiling entity typing vs. lower relation prediction) cannot be confidently attributed to a general recognition-to-judgment disparity.
Authors: We agree that additional details on task construction would improve confidence in attributing the gaps. The original manuscript describes controlled negative options drawn from the KG to ensure they are incorrect yet plausible, but we will add explicit description of the negative sampling procedure (sampling non-matching relations from entities of the same type while excluding ground-truth answers). For task difficulty, we performed informal checks during construction to avoid trivial or overly hard items, and we will now report summary statistics such as average option count and per-family accuracy baselines in the evaluation section. Regarding balancing, the nine task families reflect the natural distribution of relations in the PrimeKG mental-health slice rather than artificial equalization; we already disaggregate results by family. In revision we add a table with instance counts per family and note the design rationale. These changes support clearer attribution of the observed recognition-to-judgment pattern while preserving the benchmark's grounding in the source KG. revision: partial
- Independent clinician review, inter-rater reliability metrics, or coverage audit of the PrimeKG mental-health slice triples
Circularity Check
Benchmark derived from external PrimeKG with self-contained evaluation
full rationale
The paper constructs MHGraphBench directly from the public PrimeKG resource and measures LLM agreement with its curated mental-health slice under multiple-choice constraints. No equations, fitted parameters, or self-referential definitions appear in the provided text; the recognition-to-judgment gap is an empirical observation against this independent KG rather than a quantity that reduces to the authors' inputs by construction. The abstract explicitly limits interpretation to agreement with the curated slice and disclaims clinical safety claims, confirming the derivation chain does not rely on self-citation load-bearing or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption PrimeKG supplies accurate, unbiased, and sufficiently complete mental-health entities and relations for benchmarking LLM knowledge.
- domain assumption Constrained multiple-choice responses with controlled negatives validly isolate knowledge from output-format artifacts.
Reference graph
Works this paper leans on
-
[1]
The Lancet Psychiatry , volume=
Global, regional, and national burden of 12 mental disorders in 204 countries and territories, 1990--2019: a systematic analysis for the. The Lancet Psychiatry , volume=. 2022 , publisher=
work page 1990
-
[2]
Toward expert-level medical question answering with large language models , author=. Nature Medicine , volume=. 2025 , publisher=
work page 2025
-
[3]
Building a knowledge graph to enable precision medicine , author=. Scientific Data , volume=. 2023 , publisher=
work page 2023
-
[4]
Ioannidis, Vassilis N. and Song, Xiang and Manchanda, Saurav and Li, Mufei and Pan, Xiaoqin and Zheng, Da and Ning, Xia and Zeng, Xiangxiang and Karypis, George , title =. 2020 , howpublished =
work page 2020
-
[5]
Impact of large language model (
Iqbal, Usman and Tanweer, Afifa and Rahmanti, Annisa Ristya and Greenfield, David and Lee, Leon Tsung-Ju and Li, Yu-Chuan Jack , journal=. Impact of large language model (. 2025 , publisher=
work page 2025
-
[6]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Arora, Rahul K and Wei, Jason and Hicks, Rebecca Soskin and Bowman, Preston and Qui. arXiv preprint arXiv:2505.08775 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Cai, Yan and Wang, Linlin and Wang, Ye and de Melo, Gerard and Zhang, Ya and Wang, Yanfeng and He, Liang , booktitle=
-
[8]
Acta Psychiatrica Scandinavica , volume=
Knowledge graphs in psychiatric research: Potential applications and future perspectives , author=. Acta Psychiatrica Scandinavica , volume=. 2025 , publisher=
work page 2025
-
[9]
Nature Communications , volume=
Large language model powered knowledge graph construction for mental health exploration , author=. Nature Communications , volume=. 2025 , publisher=
work page 2025
-
[10]
Australasian Psychiatry , volume=
Off-label prescribing of psychotropics in a psychiatric patient population in Australia , author=. Australasian Psychiatry , volume=. 2024 , publisher=
work page 2024
-
[11]
Off-label use of psychotropic drugs in youth , author=. BMC Psychiatry , volume=. 2025 , publisher=
work page 2025
-
[12]
Saab, Khaled and Tu, Tao and Weng, Wei-Hung and Tanno, Ryutaro and Stutz, David and Wulczyn, Ellery and Zhang, Fan and Strother, Tim and Park, Chunjong and Vedadi, Elahe and Zambrano Chaves, Juanma and Hu, Szu-Yeu and Schaekermann, Mike and Kamath, Aishwarya and Cheng, Yong and Barrett, David G. T. and Cheung, Cathy and Mustafa, Basil and Palepu, Anil and...
-
[13]
Li, Jianning and Dada, Amin and Puladi, Behrus and Kleesiek, Jens and Egger, Jan , journal=. 2024 , publisher=
work page 2024
-
[14]
Large language models in psychiatry: Opportunities and challenges , author=. Psychiatry Research , volume=. 2024 , publisher=
work page 2024
-
[15]
arXiv preprint arXiv:2307.07697 , year=
Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph , author=. arXiv preprint arXiv:2307.07697 , year=
-
[16]
arXiv preprint arXiv:2310.02166 , year=
Large language models meet knowledge graphs to answer factoid questions , author=. arXiv preprint arXiv:2310.02166 , year=
-
[17]
Beyond the answers: Reviewing the rationality of multiple choice question answering for the evaluation of large language models , author=. CoRR , year=
-
[18]
Findings of the Association for Computational Linguistics: NAACL 2024 , pages=
Large language models sensitivity to the order of options in multiple-choice questions , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=
work page 2024
-
[19]
Markowitz, Elan and Galiya, Krupa and Steeg, Greg Ver and Galstyan, Aram , journal=
-
[20]
NPP—Digital Psychiatry and Neuroscience , volume=
Opportunities and risks of large language models in psychiatry , author=. NPP—Digital Psychiatry and Neuroscience , volume=. 2024 , publisher=
work page 2024
-
[21]
Large language models are not robust multiple choice selectors
Large language models are not robust multiple choice selectors , author=. arXiv preprint arXiv:2309.03882 , year=
-
[22]
Labrak, Yanis and Bazoge, Adrien and Morin, Emmanuel and Gourraud, Pierre-Antoine and Rouvier, Mickael and Dufour, Richard , journal=
-
[23]
arXiv preprint arXiv:2408.06142 , year=
Christophe, Cl. arXiv preprint arXiv:2408.06142 , year=
-
[24]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
mistralai/Mistral-7B-Instruct-v0.3 (Hugging Face model card) , year =
-
[26]
The. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Agarwal, Vibhor and Jin, Yiqiao and Chandra, Mohit and De Choudhury, Munmun and Kumar, Srijan and Sastry, Nishanth , journal=
-
[29]
Zhu, Zhihong and Zhang, Yunyan and Zhuang, Xianwei and Zhang, Fan and Wan, Zhongwei and Chen, Yuyan and Long, Qingqing and Zheng, Yefeng and Wu, Xian , booktitle=. Can We Trust
-
[30]
arXiv preprint arXiv:2412.15115 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
GPT-4o mini: advancing cost-efficient intelligence , year =
- [32]
-
[33]
2026 , howpublished =
work page 2026
-
[34]
arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
Chen, Zeming and Hern. arXiv preprint arXiv:2311.16079 , year=
work page internal anchor Pith review arXiv
-
[36]
Hugging Face repository , howpublished =
Pal, Ankit and Sankarasubbu, Malaikannan , title =. Hugging Face repository , howpublished =. 2024 , publisher =
work page 2024
-
[37]
Song, Hoyun and Kang, Migyeong and Shin, Jisu and Kim, Jihyun and Park, Chanbi and Yoo, Hangyeol and An, Jihyun and Oh, Alice and Han, Jinyoung and Lim, KyungTae , journal=
-
[38]
Xiong, Zixin and Wang, Ziteng and Fan, Haotian and Zhang, Xinjie and Wang, Wenxuan , journal=
-
[39]
Li, Yahan and Yao, Jifan and Bunyi, John Bosco S and Frank, Adam C and Hwang, Angel Hsing-Chi and Liu, Ruishan , journal=
-
[40]
and Ellsworth, Scott and Abraham, Matthew and Dorfman, Elizabeth and Armitage, N
Guo, Haoyu and Tikhanovskaya, Maria and Raccuglia, Paul and Vlaskin, Alexey and Co, Chris and Liebling, Daniel J. and Ellsworth, Scott and Abraham, Matthew and Dorfman, Elizabeth and Armitage, N. P. and Feng, Chunhan and Georges, Antoine and Gingras, Olivier and Kiese, Dominik and Kivelson, Steven A. and Oganesyan, Vadim and Ramshaw, B. J. and Sachdev, Su...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.