pith. machine review for the scientific record. sign in

arxiv: 2604.02359 · v1 · submitted 2026-03-20 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM safety evaluationpsychosismental health AILLM-as-a-Judgeclinical validationautomated assessmentCohen's kappa
0
0 comments X

The pith

LLM judges match human clinicians with up to 0.75 kappa on safety of responses to psychosis users

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops seven clinician-informed safety criteria to identify when LLM answers might reinforce delusions or other risks for users showing psychosis. It assembles a dataset of model responses labeled by multiple humans to create a consensus standard. Automated evaluation is then tested by using individual LLMs as judges or groups as juries that score responses against the criteria. The strongest single LLM judge reaches substantial agreement with the human consensus, indicating that automated methods can approximate clinical review at scale.

Core claim

Seven clinician-informed safety criteria were defined to assess LLM responses to prompts indicating psychosis. A human-consensus dataset was created from multiple clinician ratings. Testing LLM-as-a-Judge and LLM-as-a-Jury setups showed the best single model (Gemini) achieving Cohen's kappa of 0.75 with human consensus, slightly above the jury at 0.74, with other models at 0.68 and 0.56. This demonstrates that automated LLM evaluators can serve as reliable, scalable proxies for clinical safety assessment in this setting.

What carries the argument

The seven clinician-informed safety criteria that operationalize risks such as reinforcing delusions or hallucinations, scored via single LLM judges or majority-vote LLM juries against human consensus labels using Cohen's kappa.

If this is right

  • Safety testing of additional LLMs can proceed at scale without repeated large-scale clinician panels.
  • High-risk responses can be automatically flagged before reaching users in production systems.
  • The criteria enable standardized, comparable safety benchmarks across different models.
  • Human clinical review effort can shift from routine scoring to resolving only ambiguous cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar criteria and judge setups could be developed for LLM safety in other mental health conditions such as severe depression or anxiety.
  • Fine-tuning the judge models on more clinician-labeled data might raise agreement above the reported 0.75 kappa.
  • Embedding these evaluators in live mental health chat tools could reduce the frequency of harmful reinforcement of psychotic symptoms.

Load-bearing premise

The seven clinician-informed safety criteria comprehensively and reliably capture the clinically relevant risks of LLM interactions with users demonstrating psychosis.

What would settle it

A new set of LLM responses to psychosis-indicative prompts rated independently by clinicians shows the LLM judge achieving kappa below 0.5 with that fresh consensus.

Figures

Figures reproduced from arXiv: 2604.02359 by Andreea Damien, Elizabeth Stade, Jacob Haimes, Markela Zeneli, May Lynn Reese, Mindy Ng.

Figure 1
Figure 1. Figure 1: Criterion-specific reliability (Cohen’s Kappa) between human consensus and Gemini, Qwen [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Criterion-specific reliability (Cohen’s Kappa) between human consensus and Jury of 3 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated assessment using an LLM as an evaluator (LLM-as-a-Judge) or taking the majority vote of several LLM judges (LLM-as-a-Jury). Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen's $\kappa_{\text{human} \times \text{gemini}} = 0.75$, $\kappa_{\text{human} \times \text{qwen}} = 0.68$, $\kappa_{\text{human} \times \text{kimi}} = 0.56$) and that the best judge slightly outperforms LLM-as-a-Jury (Cohen's $\kappa_{\text{human} \times \text{jury}} = 0.74$). Overall, these findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops and validates seven clinician-informed safety criteria for LLM responses to users with psychosis, constructs a human-consensus labeled dataset, and tests LLM-as-a-Judge (single models) versus LLM-as-a-Jury (majority vote), reporting alignment with human labels via Cohen's kappa (0.75 for Gemini, 0.68 for Qwen, 0.56 for Kimi) and noting that the best single judge slightly outperforms the jury (kappa 0.74).

Significance. If the criteria are shown to be comprehensive, the concrete kappa values and independent human consensus labels would support a scalable alternative to purely manual clinical review for LLM safety in mental health contexts, addressing a gap in validated, automatable evaluation methods.

major comments (2)
  1. [Criteria development and validation] Criteria development section: the claim of 'clinically-validated' evaluations depends on the seven criteria comprehensively capturing relevant psychosis-related risks (e.g., delusion reinforcement). The manuscript reports clinician input and consensus on these criteria but provides no literature mapping, missed-case analysis, or external review demonstrating exhaustiveness versus a convenient subset; high kappas therefore only validate agreement on the chosen subset.
  2. [Methods and dataset construction] Methods and dataset construction: the abstract and results report kappas without stating dataset size, response sampling procedure, or inter-clinician agreement during criteria validation. These omissions prevent assessment of label reliability and generalizability of the reported alignment (e.g., whether the 0.75 kappa holds on a representative sample).
minor comments (2)
  1. [Abstract] Abstract: explicitly state dataset size and sampling details to support evaluation of the kappa results.
  2. [Results] Results: clarify whether LLM judges received identical criteria definitions and examples as the clinicians.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Criteria development and validation] Criteria development section: the claim of 'clinically-validated' evaluations depends on the seven criteria comprehensively capturing relevant psychosis-related risks (e.g., delusion reinforcement). The manuscript reports clinician input and consensus on these criteria but provides no literature mapping, missed-case analysis, or external review demonstrating exhaustiveness versus a convenient subset; high kappas therefore only validate agreement on the chosen subset.

    Authors: We appreciate the referee's emphasis on distinguishing between clinician-informed criteria and fully exhaustive clinical validation. The seven criteria were derived through iterative discussions with two clinicians specializing in psychosis treatment, targeting core risks including delusion reinforcement, hallucination encouragement, and inadequate redirection to professional care. While the manuscript does not include a systematic literature mapping or missed-case analysis, the criteria reflect consensus on the most immediate safety concerns for this population. The reported alignment metrics demonstrate that LLMs can reliably apply these specific criteria in line with human experts. We will revise the criteria development section and abstract to clarify that the criteria are clinician-informed rather than claiming comprehensive exhaustiveness, and we will add an explicit limitations paragraph acknowledging the absence of a full literature review or external validation study. revision: partial

  2. Referee: [Methods and dataset construction] Methods and dataset construction: the abstract and results report kappas without stating dataset size, response sampling procedure, or inter-clinician agreement during criteria validation. These omissions prevent assessment of label reliability and generalizability of the reported alignment (e.g., whether the 0.75 kappa holds on a representative sample).

    Authors: We agree that these methodological details are essential for evaluating reliability and should have been included. We will expand the methods section to report the dataset size, the response sampling procedure (including how prompts were generated and selected), and the inter-clinician agreement statistics obtained during the consensus labeling process. These additions will also be reflected in the abstract and results to support assessment of generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical agreement metrics against independent human consensus labels

full rationale

The paper's core results consist of Cohen's kappa values measuring alignment between LLM-as-a-Judge outputs and a separately constructed human-consensus dataset on seven clinician-informed criteria. These kappas are computed as straightforward inter-rater agreement statistics and do not involve any parameter fitting, self-referential definitions, or predictions that reduce by construction to the paper's own inputs. Criteria development draws on external clinician input, but the validation chain remains open to independent human labels rather than closing on internal consistency or self-citation. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the clinical validity of the seven new safety criteria and the assumption that the constructed human-consensus dataset is representative of real user interactions.

axioms (1)
  • standard math Cohen's kappa is an appropriate and sufficient measure of agreement between LLM judges and human consensus labels
    Used throughout to quantify alignment; standard in inter-rater reliability studies.

pith-pipeline@v0.9.0 · 5586 in / 1184 out tokens · 43339 ms · 2026-05-15T09:02:39.237104+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

    cs.LG 2026-04 unverdicted novelty 6.0

    A calibrated three-model LLM jury scores medical diagnoses and clinical reasoning on real hospital cases with higher agreement to primary expert panels and fewer severe errors than human re-scoring panels.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Leveraging llms for mental health: Detection and recommendations from social discussions, March 2025

    Vaishali Aggarwal, Sachin Thukral, Krushil Patel, and Arnab Chatterjee. Leveraging llms for mental health: Detection and recommendations from social discussions, March 2025. URL http://arxiv.org/ abs/2503.01442. arXiv:2503.01442 [cs]

  2. [2]

    Lawsuit: A chatbot hinted a kid should kill his parents over screen time limits, December

    Bobby Allyn. Lawsuit: A chatbot hinted a kid should kill his parents over screen time limits, December

  3. [3]

    URL https://www.npr.org/2024/12/10/nx-s1-5222574/kids-character-ai-lawsuit . NPR

  4. [4]

    American Psychiatric Publishing, Washington, DC, 5th edition, 2013

    American Psychiatric Association.Diagnostic and Statistical Manual of Mental Disorders (DSM -5). American Psychiatric Publishing, Washington, DC, 5th edition, 2013. ISBN 978-0890425558. URLhttps: //doi.org/10.1176/appi.books.9780890425596. Standard reference for psychiatric diagnoses

  5. [5]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, May

  6. [6]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    URLhttp://arxiv.org/abs/2505.08775. arXiv:2505.08775 [cs]

  7. [7]

    Reference-guided verdict: Llms-as-judges in automatic evaluation of free-form text.arXiv preprint arXiv:2408.09235, 2024

    Sher Badshah and Hassan Sajjad. Reference-guided verdict: Llms-as-judges in automatic evaluation of free-form text.arXiv preprint arXiv:2408.09235, 2024

  8. [8]

    Bosch, and Emiel Krahmer

    Erkan Basar, Xin Sun, Iris Hendrickx, Jan de Wit, Tibor Bosse, Gert-Jan De Bruijn, Jos A. Bosch, and Emiel Krahmer. How well can large language models reflect? a human evaluation of LLM-generated reflections for motivational interviewing dialogues. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert,...

  9. [9]

    Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi, Asad Aali, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah, ...

  10. [10]

    Exploring the efficacy of robotic assistants with chatgpt and claude in enhancing adhd therapy: Innovating treatment paradigms

    Santiago Berrezueta-Guzman, Mohanad Kandil, María-Luisa Martín-Ruiz, Iván Pau de la Cruz, and Stephan Krusche. Exploring the efficacy of robotic assistants with chatgpt and claude in enhancing adhd therapy: Innovating treatment paradigms. In2024 International Conference on Intelligent Environments (IE), pages 25–32, 2024. doi: 10.1109/IE61493.2024.10599903

  11. [11]

    Springer Nature, 2018

    Lisa Bortolotti.Delusions in context. Springer Nature, 2018

  12. [12]

    Ai chatbots for mental health: A scoping review of effectiveness, feasibility, and applications.Applied Sciences, 14 (13):5889, July 2024

    Mirko Casu, Sergio Triscari, Sebastiano Battiato, Luca Guarnera, and Pasquale Caponnetto. Ai chatbots for mental health: A scoping review of effectiveness, feasibility, and applications.Applied Sciences, 14 (13):5889, July 2024. ISSN 2076-3417. doi: 10.3390/app14135889

  13. [13]

    Correll, Corine Sau Man Wong, Ryan Sai Ting Chu, Vivian Shi Cheng Fung, Gabbie Hou Sem Wong, Janet Hiu Ching Lei, and Wing Chung Chang

    Joe Kwun Nam Chan, Christoph U. Correll, Corine Sau Man Wong, Ryan Sai Ting Chu, Vivian Shi Cheng Fung, Gabbie Hou Sem Wong, Janet Hiu Ching Lei, and Wing Chung Chang. Life expectancy and years of potential life lost in people with mental disorders: a systematic review and meta-analysis. eClinicalMedicine, 65, November 2023. ISSN 2589-5370. doi: 10.1016/j...

  14. [14]

    Humans or llms as the judge? a study on judgement biases, September 2024

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases, September 2024. URL http://arxiv.org/abs/2402.10669. arXiv:2402.10669 [cs]

  15. [15]

    To chat or bot to chat: Ethical issues with using chatbots in mental health.DIGITAL HEALTH, 9: 20552076231183542, January 2023

    Simon Coghlan, Kobi Leins, Susie Sheldrick, Marc Cheong, Piers Gooding, and Simon D’Alfonso. To chat or bot to chat: Ethical issues with using chatbots in mental health.DIGITAL HEALTH, 9: 20552076231183542, January 2023. ISSN 2055-2076, 2055-2076. doi: 10.1177/20552076231183542

  16. [16]

    Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge.medRxiv, pages 2025–04, 2025

    Emma Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, John Caskey, Madeline Oguss, Graham Wills, Guanhua Chen, Dmitriy Dligach, et al. Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge.medRxiv, pages 2025–04, 2025

  17. [17]

    Corey Curran, Nafis Neehal, Keerthiram Murugesan, and Kristin P. Bennett. Examining trustworthiness of llm-as-a-judge systems in a clinical trial design benchmark. In2024 IEEE International Confer- ence on Big Data (BigData), page 4627–4631, Washington, DC, USA, December 2024. IEEE. ISBN 9798350362480. doi: 10.1109/BigData62323.2024.10825592. URL https://...

  18. [18]

    LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA

    Yella Diekmann, Chase Fensore, Rodrigo Carrillo-Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Megha Shah, and Joyce Ho. LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedic...

  19. [19]

    Attacks, defenses and evaluations for llm conversation safety: A survey.arXiv preprint arXiv:2402.09283, 2024

    Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. Attacks, defenses and evaluations for llm conversation safety: A survey.arXiv preprint arXiv:2402.09283, 2024

  20. [20]

    Cong Doanh Duong, Thanh Tung Dao, Trong Nghia Vu, Thi Viet Nga Ngo, and Quang Yen Tran. Compulsive chatgpt usage, anxiety, burnout, and sleep disturbance: A serial mediation model based on stimulus-organism-response perspective.Acta Psychologica, 251:104622, November 2024. ISSN 00016918. doi: 10.1016/j.actpsy.2024.104622

  21. [21]

    Liu, Valdemar Danry, Eunhae Lee, Samantha W

    Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranuta- porn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, and Sandhini Agarwal. How ai and human behaviors shape psychosocial effects of chatbot use: A longitudinal randomized controlled study, March

  22. [22]

    Liu, Valdemar Danry, Eunhae Lee, Samantha W

    URLhttp://arxiv.org/abs/2503.17473. arXiv:2503.17473 [cs]

  23. [23]

    Appraising the performance of chatgpt in psychiatry using 100 clinical case vignettes.Asian Journal of Psychiatry, 89: 103770, November 2023

    Russell Franco D’Souza, Shabbir Amanullah, Mary Mathew, and Krishna Mohan Surapaneni. Appraising the performance of chatgpt in psychiatry using 100 clinical case vignettes.Asian Journal of Psychiatry, 89: 103770, November 2023. ISSN 18762018. doi: 10.1016/j.ajp.2023.103770

  24. [24]

    Evaluating generative ai responses to real-world drug-related questions.Psychiatry research, 339:116058, 2024

    Salvatore Giorgi, Kelsey Isman, Tingting Liu, Zachary Fried, Joao Sedoc, and Brenda Curtis. Evaluating generative ai responses to real-world drug-related questions.Psychiatry research, 339:116058, 2024. 11

  25. [25]

    The framework for ai tool assessment in mental health (faita- mental health): a scale for evaluating ai-powered mental health tools.World Psychiatry, 23(3):444, 2024

    Ashleigh Golden and Elias Aboujaoude. The framework for ai tool assessment in mental health (faita- mental health): a scale for evaluating ai-powered mental health tools.World Psychiatry, 23(3):444, 2024

  26. [26]

    Risks from language models for automated mental healthcare: Ethics and structure for implementation.medRxiv, 2024

    Declan Grabb, Max Lamparth, and Nina Vasan. Risks from language models for automated mental healthcare: Ethics and structure for implementation.medRxiv, 2024. doi: 10.1101/2024.04.07.24305462. URLhttps://www.medrxiv.org/content/early/2024/04/08/2024.04.07.24305462

  27. [27]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, March 2025. URL http://arxiv.org/abs/2411.15594. arXiv:2411.15594 [cs]

  28. [28]

    Soullmate: An adaptive llm-driven system for advanced mental health support and assessment, based on a systematic application survey, October 2024

    Qiming Guo, Jinwen Tang, Wenbo Sun, Haoteng Tang, Yi Shang, and Wenlu Wang. Soullmate: An adaptive llm-driven system for advanced mental health support and assessment, based on a systematic application survey, October 2024. URL http://arxiv.org/abs/2410.11859. arXiv:2410.11859 [cs]

  29. [29]

    it listens better than my therapist

    Anna-Carolina Haensch. “it listens better than my therapist”: Exploring social media discourse on llms as mental health tool, April 2025. URLhttp://arxiv.org/abs/2504.12337. arXiv:2504.12337 [cs]

  30. [30]

    Medsafetybench: Evaluating and improving the medical safety of large language models

    Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. Medsafetybench: Evaluating and improving the medical safety of large language models. In A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems, volume 37, page 33423–33454. Curran Asso- ciates, Inc., ...

  31. [31]

    A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, April 2025

    Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, April 2025. ISSN 2398-6352. doi: 10.1038/s41746-025-01611-4

  32. [32]

    Shunsen Huang, Xiaoxiong Lai, Li Ke, Yajun Li, Huanlei Wang, Xinmei Zhao, Xinran Dai, and Yun Wang. Ai technology panic—is ai dependence bad for mental health? a cross-lagged panel model and the mediating roles of motivations for ai use among adolescents.Psychology Research and Behavior Management, V olume 17:1087–1102, March 2024. ISSN 1179-1578. doi: 10...

  33. [33]

    Intima: A benchmark for human-ai companionship behavior, August 2025

    Lucie-Aimée Kaffee, Giada Pistilli, and Yacine Jernite. Intima: A benchmark for human-ai companionship behavior, August 2025. URLhttp://arxiv.org/abs/2508.09998. arXiv:2508.09998 [cs]

  34. [34]

    Mental health app evaluation: updating the american psychiatric association’s framework through a stakeholder-engaged workshop.Psychiatric Services, 72(9):1095–1098, 2021

    Sarah Lagan, Margaret R Emerson, Darlene King, Sonia Matwin, Steven R Chan, Stephon Proctor, Julia Tartaglia, Karen L Fortuna, Patrick Aquino, Robert Walker, et al. Mental health app evaluation: updating the american psychiatric association’s framework through a stakeholder-engaged workshop.Psychiatric Services, 72(9):1095–1098, 2021

  35. [35]

    Psy-llm: Scaling up global mental health psychological services with ai-based large language models, September 2023

    Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, and Ziqi Wang. Psy-llm: Scaling up global mental health psychological services with ai-based large language models, September 2023. URL http://arxiv.org/abs/2307.11991. arXiv:2307.11991 [cs]

  36. [36]

    Cognitive behavioral therapy for psychosis (cbtp): An introductory manual for clinicians

    Yulia Landa. Cognitive behavioral therapy for psychosis (cbtp): An introductory manual for clinicians. Technical report, Mental Illness Research, Education and Clinical Centers (MIRECC) at the James J. Peters V A Medical Center, 2017. URL https://www.mirecc.va.gov/visn2/docs/CBTp_Manual_ VA_Yulia_Landa_2017.pdf. V A Medical Center

  37. [37]

    J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, March 1977. ISSN 0006-341X. Research Support, U.S. Gov’t, Non -P.H.S.; Research Support, U.S. Gov’t, P.H.S

  38. [38]

    Improving automatic evaluation of large language models (LLMs) in biomedical relation extraction via LLMs-as-the-judge

    Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, and Jimmy Huang. Improving automatic evaluation of large language models (LLMs) in biomedical relation extraction via LLMs-as-the-judge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Assoc...

  39. [39]

    The opportunities and risks of large language models in mental health.JMIR Mental Health, 11:e59479–e59479, July 2024

    Hannah R Lawrence, Renee A Schneider, Susan B Rubin, Maja J Matari´c, Daniel J McDuff, and Megan Jones Bell. The opportunities and risks of large language models in mental health.JMIR Mental Health, 11:e59479–e59479, July 2024. ISSN 2368-7959. doi: 10.2196/59479. 12

  40. [40]

    Chain of risks evaluation (core): A framework for safer large language models in public mental health.Psychiatry and Clinical Neurosciences, 79(6):299–305, 2025

    Lingyu Li, Shuqi Kong, Haiquan Zhao, Chunbo Li, Yan Teng, and Yingchun Wang. Chain of risks evaluation (core): A framework for safer large language models in public mental health.Psychiatry and Clinical Neurosciences, 79(6):299–305, 2025

  41. [41]

    Leveraging llms as meta-judges: A multi-agent framework for evaluating llm judgments.arXiv preprint arXiv:2504.17087, 2025

    Yuran Li, Jama Hussein Mohamud, Chongren Sun, Di Wu, and Benoit Boulet. Leveraging llms as meta-judges: A multi-agent framework for evaluating llm judgments.arXiv preprint arXiv:2504.17087, 2025

  42. [42]

    Liu, Pat Pataranutaporn, and Pattie Maes

    Auren R. Liu, Pat Pataranutaporn, and Pattie Maes. Chatbot companionship: A mixed-methods study of companion chatbot usage patterns and their relationship to loneliness in active users, August 2025. URL http://arxiv.org/abs/2410.21596. arXiv:2410.21596 [cs]

  43. [43]

    Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support

    Zilin Ma, Yiyang Mei, and Zhaoyuan Su. Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. InAMIA Annual Symposium Proceedings, volume 2023, page 1105, 2024

  44. [44]

    McGrath, Sukanta Saha, Ali Al-Hamzawi, Jordi Alonso, Evelyn J

    John J. McGrath, Sukanta Saha, Ali Al-Hamzawi, Jordi Alonso, Evelyn J. Bromet, Ronny Bruffaerts, José Miguel Caldas-de Almeida, Wai Tat Chiu, Peter De Jonge, John Fayyad, Silvia Florescu, Oye Gureje, Josep Maria Haro, Chiyi Hu, Viviane Kovess-Masfety, Jean Pierre Lepine, Carmen C. W. Lim, Maria Elena Medina Mora, Fernando Navarro-Mateu, Susana Ochoa, Nanc...

  45. [45]

    Ong, and Nick Haber

    Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, page 599–627, Athens Greece, June 2025. ACM. ISBN 97984...

  46. [46]

    Prevalence of psychotic disorders and its association with methodological issues

    Berta Moreno-Küstner, Carlos Martín, and Loly Pastor. Prevalence of psychotic disorders and its association with methodological issues. a systematic review and meta-analyses.PLOS ONE, 13(4):e0195687, April

  47. [47]

    doi: 10.1371/journal.pone.0195687

    ISSN 1932-6203. doi: 10.1371/journal.pone.0195687

  48. [48]

    Delusions by design? how everyday ais might be fuelling psychosis (and what can be done about it)

    Hamilton Morrin, Luke Nicholls, Michael Levin, Jenny Yiend, Udita Iyengar, Francesca DelGuidice, Sagnik Bhattacharya, Stefania Tognin, James MacCabe, Ricardo Twumasi, Ben Alderson-Day, and {Thomas A.} Pollak. Delusions by design? how everyday ais might be fuelling psychosis (and what can be done about it). Workingpaper, PsyArXiv, July 2025

  49. [49]

    A. G. Nevarez-Flores, K. Sanderson, M. Breslin, V . J. Carr, V . A. Morgan, and A. L. Neil. Systematic review of global functioning and quality of life in people with psychotic disorders.Epidemiology and Psychiatric Sciences, 28(1):31–44, 2019. doi: 10.1017/S2045796018000549

  50. [50]

    Sycophancy in gpt-4o: what happened and what we’re doing about it, apr 2025

    OpenAI. Sycophancy in gpt-4o: what happened and what we’re doing about it, apr 2025. URL https: //openai.com/index/sycophancy-in-gpt-4o/. OpenAI Blog

  51. [51]

    Bounds, Angela Jun, Jaesu Han, Robert M

    Jung In Park, Mahyar Abbasian, Iman Azimi, Dawn T. Bounds, Angela Jun, Jaesu Han, Robert M. McCarron, Jessica Borelli, Parmida Safavi, Sanaz Mirbaha, Jia Li, Mona Mahmoudi, Carmen Wiedenhoeft, and Amir M. Rahmani. Building trust in mental health chatbots: Safety metrics and llm-based evaluation tools, 2025. URLhttps://arxiv.org/abs/2408.04650

  52. [52]

    An ai chatbot pushed a teen to kill himself, a lawsuit against its creator alleges, oct 2024

    Kate Payne. An ai chatbot pushed a teen to kill himself, a lawsuit against its creator alleges, oct 2024. URL https://apnews.com/article/ chatbot-ai-lawsuit-suicide-teen-artificial-intelligence-9d48adc572100822fdbc3c90d1456bd0 . AP News

  53. [53]

    Perlis, Joseph F

    Roy H. Perlis, Joseph F. Goldberg, Michael J. Ostacher, and Christopher D. Schneck. Clinical decision support for bipolar depression using large language models.Neuropsychopharmacology, 49(9):1412–1416, August 2024. ISSN 1740-634X. doi: 10.1038/s41386-024-01841-2

  54. [54]

    Large language models as mental health resources: Patterns of use in the united states, March 2025

    Tony Rousmaniere, Yimeng Zhang, Xu Li, and Siddharth Shah. Large language models as mental health resources: Patterns of use in the united states, March 2025. URLhttps://osf.io/q8m7g_v1

  55. [55]

    The efficacy of conversational ai in rectifying the theory-of-mind and autonomy biases: Comparative analysis

    Marcin Rz ˛ adeczka, Anna Sterna, Julia Stoli´nska, Paulina Kaczy ´nska, and Marcin Moskalewicz. The efficacy of conversational ai in rectifying the theory-of-mind and autonomy biases: Comparative analysis. JMIR Ment Health, 12:e64396, February 2025. ISSN 2368-7959. doi: 10.2196/64396. 13

  56. [56]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

  57. [57]

    Judging the judges: A systematic study of position bias in LLM-as-a- judge,

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge, April 2025. URL http://arxiv.org/ abs/2406.07791. arXiv:2406.07791 [cs]

  58. [58]

    Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera y Arcas,...

  59. [59]

    Can we use chatgpt for mental health and substance use education? examining its quality and potential harms

    Sophia Spallek, Louise Birrell, Stephanie Kershaw, Emma Krogh Devine, and Louise Thornton. Can we use chatgpt for mental health and substance use education? examining its quality and potential harms. JMIR Medical Education, 9:e51243, November 2023. ISSN 2369-3762. doi: 10.2196/51243

  60. [60]

    Stade, Johannes C

    Elizabeth C. Stade, Johannes C. Eichstaedt, Jane P. Kim, and Shannon Wiltsey Stirman. Readiness evaluation for artificial intelligence-mental health deployment and implementation (readi): A review and proposed framework.Technology, Mind, and Behavior, 6(2), April 2025. ISSN 2689-0208. doi: 10.1037/tmb0000163. URLhttps://tmb.apaopen.org/pub/8gyddorx

  61. [61]

    Current real-world use of large language models for mental health, 2025

    Elizabeth C Stade, Zoe M Tait, Samuel T Campione, and Shannon Wiltsey Stirman. Current real-world use of large language models for mental health, 2025

  62. [62]

    Chilton, and Kathleen McKeown

    Melanie Subbiah, Sean Zhang, Lydia B. Chilton, and Kathleen McKeown. Reading subtext: Evaluating large language models on short story summarization with writers.Transactions of the Association for Computational Linguistics, 12:1290–1310, October 2024. ISSN 2307-387X. doi: 10.1162/tacl_a_00702

  63. [63]

    Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A

    Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A. Metoyer. Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks. InProceedings of the 30th International Conference on Intelligent User Interfaces, page 952–966, Cagliari Italy, March 2025. ACM. ISBN 9798400713064. ...

  64. [64]

    Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges. In Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eliya Habba, Itay Itzhak, Simon Mille, Yotam Perlitz, Enrico Santus, João Sedoc, Michal Shm...

  65. [65]

    Metaphor understanding challenge dataset for LLMs

    Xiaoyu Tong, Rochelle Choenni, Martha Lewis, and Ekaterina Shutova. Metaphor understanding challenge dataset for LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3517–3536, Bangkok, Thailand, August 2024. Association for C...

  66. [66]

    Therapeutic management of schizophrenia and substance use disorders dual diagnosis- clinical vignettes.Romanian Journal of Military Medicine, 121(2):26–34, 2018

    Octavian Vasiliu. Therapeutic management of schizophrenia and substance use disorders dual diagnosis- clinical vignettes.Romanian Journal of Military Medicine, 121(2):26–34, 2018

  67. [67]

    Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

    Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorod- sky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

  68. [68]

    Cross-validation metrics for evaluating classification performance on imbalanced data

    Ni Wayan Surya Wardhani, Masithoh Yessi Rochayani, Atiek Iriany, Agus Dwi Sulistyono, and Prayudi Lestantyo. Cross-validation metrics for evaluating classification performance on imbalanced data. In2019 International Conference on Computer , Control, Informatics and its Applications (IC3INA), pages 14–18,

  69. [69]

    doi: 10.1109/IC3INA48034.2019.8949568

  70. [70]

    CRC Press, 2017

    Barry Wright, Subodh Dave, and Nisha Dogra.100 cases in psychiatry. CRC Press, 2017. 14

  71. [71]

    Chawla, and Xiangliang Zhang

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge, October 2024. URL http://arxiv.org/abs/2410.02736. arXiv:2410.02736 [cs]

  72. [72]

    Development and validation the problematic chatgpt use scale: a preliminary report.Current Psychology, 43(31):26080–26092, August 2024

    Sen-Chi Yu, Hong-Ren Chen, and Yu-Wen Yang. Development and validation the problematic chatgpt use scale: a preliminary report.Current Psychology, 43(31):26080–26092, August 2024. ISSN 1046-1310, 1936-4733. doi: 10.1007/s12144-024-06259-z

  73. [73]

    I know that sounds hard to believe

    Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, Kui Ren, and Jingyi Wang. S-eval: Towards automated and comprehensive safety evaluation for large language models.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728971. URLhttps://doi.org/10.1145/3728971. 15 A Stimu...