arxiv: 2604.02359 · v1 · submitted 2026-03-20 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

May Lynn Reese , Markela Zeneli , Mindy Ng , Jacob Haimes , Andreea Damien , Elizabeth Stade

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM safety evaluationpsychosismental health AILLM-as-a-Judgeclinical validationautomated assessmentCohen's kappa

0 comments

The pith

LLM judges match human clinicians with up to 0.75 kappa on safety of responses to psychosis users

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops seven clinician-informed safety criteria to identify when LLM answers might reinforce delusions or other risks for users showing psychosis. It assembles a dataset of model responses labeled by multiple humans to create a consensus standard. Automated evaluation is then tested by using individual LLMs as judges or groups as juries that score responses against the criteria. The strongest single LLM judge reaches substantial agreement with the human consensus, indicating that automated methods can approximate clinical review at scale.

Core claim

Seven clinician-informed safety criteria were defined to assess LLM responses to prompts indicating psychosis. A human-consensus dataset was created from multiple clinician ratings. Testing LLM-as-a-Judge and LLM-as-a-Jury setups showed the best single model (Gemini) achieving Cohen's kappa of 0.75 with human consensus, slightly above the jury at 0.74, with other models at 0.68 and 0.56. This demonstrates that automated LLM evaluators can serve as reliable, scalable proxies for clinical safety assessment in this setting.

What carries the argument

The seven clinician-informed safety criteria that operationalize risks such as reinforcing delusions or hallucinations, scored via single LLM judges or majority-vote LLM juries against human consensus labels using Cohen's kappa.

If this is right

Safety testing of additional LLMs can proceed at scale without repeated large-scale clinician panels.
High-risk responses can be automatically flagged before reaching users in production systems.
The criteria enable standardized, comparable safety benchmarks across different models.
Human clinical review effort can shift from routine scoring to resolving only ambiguous cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar criteria and judge setups could be developed for LLM safety in other mental health conditions such as severe depression or anxiety.
Fine-tuning the judge models on more clinician-labeled data might raise agreement above the reported 0.75 kappa.
Embedding these evaluators in live mental health chat tools could reduce the frequency of harmful reinforcement of psychotic symptoms.

Load-bearing premise

The seven clinician-informed safety criteria comprehensively and reliably capture the clinically relevant risks of LLM interactions with users demonstrating psychosis.

What would settle it

A new set of LLM responses to psychosis-indicative prompts rated independently by clinicians shows the LLM judge achieving kappa below 0.5 with that fresh consensus.

Figures

Figures reproduced from arXiv: 2604.02359 by Andreea Damien, Elizabeth Stade, Jacob Haimes, Markela Zeneli, May Lynn Reese, Mindy Ng.

**Figure 2.** Figure 2: Criterion-specific reliability (Cohen’s Kappa) between human consensus and Jury of 3 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated assessment using an LLM as an evaluator (LLM-as-a-Judge) or taking the majority vote of several LLM judges (LLM-as-a-Jury). Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen's $\kappa_{\text{human} \times \text{gemini}} = 0.75$, $\kappa_{\text{human} \times \text{qwen}} = 0.68$, $\kappa_{\text{human} \times \text{kimi}} = 0.56$) and that the best judge slightly outperforms LLM-as-a-Jury (Cohen's $\kappa_{\text{human} \times \text{jury}} = 0.74$). Overall, these findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM judges reach decent agreement with clinicians on a set of psychosis safety criteria, but the criteria's completeness is unproven.

read the letter

The core finding is that an LLM judge can reach Cohen's kappa of 0.75 with human clinicians on safety evaluations for responses to users showing psychosis signs. That's the practical takeaway. They developed seven clinician-informed criteria, created a human-consensus labeled dataset of model responses, and compared LLM-as-a-Judge against that consensus, also testing an LLM-as-a-Jury setup. The single best judge edges out the jury slightly. This approach is new for this specific clinical domain, with reported alignment metrics that weren't in the prior work they cite. The paper does well by using actual clinician input to define the criteria instead of generic rules and by providing direct kappa values for the comparisons. It tackles a genuine issue where LLMs might reinforce delusions in vulnerable users, and the empirical results give a starting point for scalable checks. The main limitation is that the seven criteria have no demonstrated completeness for all psychosis-related risks. Clinician input is there, but without an independent check like mapping to existing literature or analyzing missed cases, the agreement only applies to the selected criteria. The abstract also omits dataset size, sampling method, and details on how the criteria were validated among clinicians, which makes it tougher to assess the full strength of the claims. If the full paper fills those in with solid numbers, it strengthens the case. This work is aimed at researchers focused on LLM safety in mental health settings. Someone looking for concrete examples of clinician-validated eval criteria and judge performance would find it useful. I would recommend sending it to peer review. The topic matters and the reported results are worth referee scrutiny, even with the gaps on criteria coverage.

Referee Report

2 major / 2 minor

Summary. The paper develops and validates seven clinician-informed safety criteria for LLM responses to users with psychosis, constructs a human-consensus labeled dataset, and tests LLM-as-a-Judge (single models) versus LLM-as-a-Jury (majority vote), reporting alignment with human labels via Cohen's kappa (0.75 for Gemini, 0.68 for Qwen, 0.56 for Kimi) and noting that the best single judge slightly outperforms the jury (kappa 0.74).

Significance. If the criteria are shown to be comprehensive, the concrete kappa values and independent human consensus labels would support a scalable alternative to purely manual clinical review for LLM safety in mental health contexts, addressing a gap in validated, automatable evaluation methods.

major comments (2)

[Criteria development and validation] Criteria development section: the claim of 'clinically-validated' evaluations depends on the seven criteria comprehensively capturing relevant psychosis-related risks (e.g., delusion reinforcement). The manuscript reports clinician input and consensus on these criteria but provides no literature mapping, missed-case analysis, or external review demonstrating exhaustiveness versus a convenient subset; high kappas therefore only validate agreement on the chosen subset.
[Methods and dataset construction] Methods and dataset construction: the abstract and results report kappas without stating dataset size, response sampling procedure, or inter-clinician agreement during criteria validation. These omissions prevent assessment of label reliability and generalizability of the reported alignment (e.g., whether the 0.75 kappa holds on a representative sample).

minor comments (2)

[Abstract] Abstract: explicitly state dataset size and sampling details to support evaluation of the kappa results.
[Results] Results: clarify whether LLM judges received identical criteria definitions and examples as the clinicians.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Criteria development and validation] Criteria development section: the claim of 'clinically-validated' evaluations depends on the seven criteria comprehensively capturing relevant psychosis-related risks (e.g., delusion reinforcement). The manuscript reports clinician input and consensus on these criteria but provides no literature mapping, missed-case analysis, or external review demonstrating exhaustiveness versus a convenient subset; high kappas therefore only validate agreement on the chosen subset.

Authors: We appreciate the referee's emphasis on distinguishing between clinician-informed criteria and fully exhaustive clinical validation. The seven criteria were derived through iterative discussions with two clinicians specializing in psychosis treatment, targeting core risks including delusion reinforcement, hallucination encouragement, and inadequate redirection to professional care. While the manuscript does not include a systematic literature mapping or missed-case analysis, the criteria reflect consensus on the most immediate safety concerns for this population. The reported alignment metrics demonstrate that LLMs can reliably apply these specific criteria in line with human experts. We will revise the criteria development section and abstract to clarify that the criteria are clinician-informed rather than claiming comprehensive exhaustiveness, and we will add an explicit limitations paragraph acknowledging the absence of a full literature review or external validation study. revision: partial
Referee: [Methods and dataset construction] Methods and dataset construction: the abstract and results report kappas without stating dataset size, response sampling procedure, or inter-clinician agreement during criteria validation. These omissions prevent assessment of label reliability and generalizability of the reported alignment (e.g., whether the 0.75 kappa holds on a representative sample).

Authors: We agree that these methodological details are essential for evaluating reliability and should have been included. We will expand the methods section to report the dataset size, the response sampling procedure (including how prompts were generated and selected), and the inter-clinician agreement statistics obtained during the consensus labeling process. These additions will also be reflected in the abstract and results to support assessment of generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical agreement metrics against independent human consensus labels

full rationale

The paper's core results consist of Cohen's kappa values measuring alignment between LLM-as-a-Judge outputs and a separately constructed human-consensus dataset on seven clinician-informed criteria. These kappas are computed as straightforward inter-rater agreement statistics and do not involve any parameter fitting, self-referential definitions, or predictions that reduce by construction to the paper's own inputs. Criteria development draws on external clinician input, but the validation chain remains open to independent human labels rather than closing on internal consistency or self-citation. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the clinical validity of the seven new safety criteria and the assumption that the constructed human-consensus dataset is representative of real user interactions.

axioms (1)

standard math Cohen's kappa is an appropriate and sufficient measure of agreement between LLM judges and human consensus labels
Used throughout to quantify alignment; standard in inter-rater reliability studies.

pith-pipeline@v0.9.0 · 5586 in / 1184 out tokens · 43339 ms · 2026-05-15T09:02:39.237104+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

developing and validating seven clinician-informed safety criteria... Cohen's κ_human×gemini = 0.75
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorem unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

operationalize evaluation criteria for assessing the safety of LLM responses

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?
cs.LG 2026-04 unverdicted novelty 6.0

A calibrated three-model LLM jury scores medical diagnoses and clinical reasoning on real hospital cases with higher agreement to primary expert panels and fewer severe errors than human re-scoring panels.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Leveraging llms for mental health: Detection and recommendations from social discussions, March 2025

Vaishali Aggarwal, Sachin Thukral, Krushil Patel, and Arnab Chatterjee. Leveraging llms for mental health: Detection and recommendations from social discussions, March 2025. URL http://arxiv.org/ abs/2503.01442. arXiv:2503.01442 [cs]

work page arXiv 2025
[2]

Lawsuit: A chatbot hinted a kid should kill his parents over screen time limits, December

Bobby Allyn. Lawsuit: A chatbot hinted a kid should kill his parents over screen time limits, December

work page
[3]

URL https://www.npr.org/2024/12/10/nx-s1-5222574/kids-character-ai-lawsuit . NPR

work page 2024
[4]

American Psychiatric Publishing, Washington, DC, 5th edition, 2013

American Psychiatric Association.Diagnostic and Statistical Manual of Mental Disorders (DSM -5). American Psychiatric Publishing, Washington, DC, 5th edition, 2013. ISBN 978-0890425558. URLhttps: //doi.org/10.1176/appi.books.9780890425596. Standard reference for psychiatric diagnoses

work page doi:10.1176/appi.books.9780890425596 2013
[5]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, May

work page
[6]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

URLhttp://arxiv.org/abs/2505.08775. arXiv:2505.08775 [cs]

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Reference-guided verdict: Llms-as-judges in automatic evaluation of free-form text.arXiv preprint arXiv:2408.09235, 2024

Sher Badshah and Hassan Sajjad. Reference-guided verdict: Llms-as-judges in automatic evaluation of free-form text.arXiv preprint arXiv:2408.09235, 2024

work page arXiv 2024
[8]

Bosch, and Emiel Krahmer

Erkan Basar, Xin Sun, Iris Hendrickx, Jan de Wit, Tibor Bosse, Gert-Jan De Bruijn, Jos A. Bosch, and Emiel Krahmer. How well can large language models reflect? a human evaluation of LLM-generated reflections for motivational interviewing dialogues. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert,...

work page 1964
[9]

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi, Asad Aali, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah, ...

work page arXiv 2025
[10]

Exploring the efficacy of robotic assistants with chatgpt and claude in enhancing adhd therapy: Innovating treatment paradigms

Santiago Berrezueta-Guzman, Mohanad Kandil, María-Luisa Martín-Ruiz, Iván Pau de la Cruz, and Stephan Krusche. Exploring the efficacy of robotic assistants with chatgpt and claude in enhancing adhd therapy: Innovating treatment paradigms. In2024 International Conference on Intelligent Environments (IE), pages 25–32, 2024. doi: 10.1109/IE61493.2024.10599903

work page doi:10.1109/ie61493.2024.10599903 2024
[11]

Springer Nature, 2018

Lisa Bortolotti.Delusions in context. Springer Nature, 2018

work page 2018
[12]

Ai chatbots for mental health: A scoping review of effectiveness, feasibility, and applications.Applied Sciences, 14 (13):5889, July 2024

Mirko Casu, Sergio Triscari, Sebastiano Battiato, Luca Guarnera, and Pasquale Caponnetto. Ai chatbots for mental health: A scoping review of effectiveness, feasibility, and applications.Applied Sciences, 14 (13):5889, July 2024. ISSN 2076-3417. doi: 10.3390/app14135889

work page doi:10.3390/app14135889 2024
[13]

Correll, Corine Sau Man Wong, Ryan Sai Ting Chu, Vivian Shi Cheng Fung, Gabbie Hou Sem Wong, Janet Hiu Ching Lei, and Wing Chung Chang

Joe Kwun Nam Chan, Christoph U. Correll, Corine Sau Man Wong, Ryan Sai Ting Chu, Vivian Shi Cheng Fung, Gabbie Hou Sem Wong, Janet Hiu Ching Lei, and Wing Chung Chang. Life expectancy and years of potential life lost in people with mental disorders: a systematic review and meta-analysis. eClinicalMedicine, 65, November 2023. ISSN 2589-5370. doi: 10.1016/j...

work page doi:10.1016/j.eclinm.2023.102294 2023
[14]

Humans or llms as the judge? a study on judgement biases, September 2024

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases, September 2024. URL http://arxiv.org/abs/2402.10669. arXiv:2402.10669 [cs]

work page arXiv 2024
[15]

To chat or bot to chat: Ethical issues with using chatbots in mental health.DIGITAL HEALTH, 9: 20552076231183542, January 2023

Simon Coghlan, Kobi Leins, Susie Sheldrick, Marc Cheong, Piers Gooding, and Simon D’Alfonso. To chat or bot to chat: Ethical issues with using chatbots in mental health.DIGITAL HEALTH, 9: 20552076231183542, January 2023. ISSN 2055-2076, 2055-2076. doi: 10.1177/20552076231183542

work page doi:10.1177/20552076231183542 2023
[16]

Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge.medRxiv, pages 2025–04, 2025

Emma Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, John Caskey, Madeline Oguss, Graham Wills, Guanhua Chen, Dmitriy Dligach, et al. Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge.medRxiv, pages 2025–04, 2025

work page 2025
[17]

Corey Curran, Nafis Neehal, Keerthiram Murugesan, and Kristin P. Bennett. Examining trustworthiness of llm-as-a-judge systems in a clinical trial design benchmark. In2024 IEEE International Confer- ence on Big Data (BigData), page 4627–4631, Washington, DC, USA, December 2024. IEEE. ISBN 9798350362480. doi: 10.1109/BigData62323.2024.10825592. URL https://...

work page doi:10.1109/bigdata62323.2024.10825592 2024
[18]

LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA

Yella Diekmann, Chase Fensore, Rodrigo Carrillo-Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Megha Shah, and Joyce Ho. LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedic...

work page doi:10.18653/v1/2025.bionlp-1.19 2025
[19]

Attacks, defenses and evaluations for llm conversation safety: A survey.arXiv preprint arXiv:2402.09283, 2024

Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. Attacks, defenses and evaluations for llm conversation safety: A survey.arXiv preprint arXiv:2402.09283, 2024

work page arXiv 2024
[20]

Cong Doanh Duong, Thanh Tung Dao, Trong Nghia Vu, Thi Viet Nga Ngo, and Quang Yen Tran. Compulsive chatgpt usage, anxiety, burnout, and sleep disturbance: A serial mediation model based on stimulus-organism-response perspective.Acta Psychologica, 251:104622, November 2024. ISSN 00016918. doi: 10.1016/j.actpsy.2024.104622

work page doi:10.1016/j.actpsy.2024.104622 2024
[21]

Liu, Valdemar Danry, Eunhae Lee, Samantha W

Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranuta- porn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, and Sandhini Agarwal. How ai and human behaviors shape psychosocial effects of chatbot use: A longitudinal randomized controlled study, March

work page
[22]

Liu, Valdemar Danry, Eunhae Lee, Samantha W

URLhttp://arxiv.org/abs/2503.17473. arXiv:2503.17473 [cs]

work page arXiv
[23]

Appraising the performance of chatgpt in psychiatry using 100 clinical case vignettes.Asian Journal of Psychiatry, 89: 103770, November 2023

Russell Franco D’Souza, Shabbir Amanullah, Mary Mathew, and Krishna Mohan Surapaneni. Appraising the performance of chatgpt in psychiatry using 100 clinical case vignettes.Asian Journal of Psychiatry, 89: 103770, November 2023. ISSN 18762018. doi: 10.1016/j.ajp.2023.103770

work page doi:10.1016/j.ajp.2023.103770 2023
[24]

Evaluating generative ai responses to real-world drug-related questions.Psychiatry research, 339:116058, 2024

Salvatore Giorgi, Kelsey Isman, Tingting Liu, Zachary Fried, Joao Sedoc, and Brenda Curtis. Evaluating generative ai responses to real-world drug-related questions.Psychiatry research, 339:116058, 2024. 11

work page 2024
[25]

The framework for ai tool assessment in mental health (faita- mental health): a scale for evaluating ai-powered mental health tools.World Psychiatry, 23(3):444, 2024

Ashleigh Golden and Elias Aboujaoude. The framework for ai tool assessment in mental health (faita- mental health): a scale for evaluating ai-powered mental health tools.World Psychiatry, 23(3):444, 2024

work page 2024
[26]

Risks from language models for automated mental healthcare: Ethics and structure for implementation.medRxiv, 2024

Declan Grabb, Max Lamparth, and Nina Vasan. Risks from language models for automated mental healthcare: Ethics and structure for implementation.medRxiv, 2024. doi: 10.1101/2024.04.07.24305462. URLhttps://www.medrxiv.org/content/early/2024/04/08/2024.04.07.24305462

work page doi:10.1101/2024.04.07.24305462 2024
[27]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, March 2025. URL http://arxiv.org/abs/2411.15594. arXiv:2411.15594 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Soullmate: An adaptive llm-driven system for advanced mental health support and assessment, based on a systematic application survey, October 2024

Qiming Guo, Jinwen Tang, Wenbo Sun, Haoteng Tang, Yi Shang, and Wenlu Wang. Soullmate: An adaptive llm-driven system for advanced mental health support and assessment, based on a systematic application survey, October 2024. URL http://arxiv.org/abs/2410.11859. arXiv:2410.11859 [cs]

work page arXiv 2024
[29]

it listens better than my therapist

Anna-Carolina Haensch. “it listens better than my therapist”: Exploring social media discourse on llms as mental health tool, April 2025. URLhttp://arxiv.org/abs/2504.12337. arXiv:2504.12337 [cs]

work page arXiv 2025
[30]

Medsafetybench: Evaluating and improving the medical safety of large language models

Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. Medsafetybench: Evaluating and improving the medical safety of large language models. In A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems, volume 37, page 33423–33454. Curran Asso- ciates, Inc., ...

work page 2024
[31]

A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, April 2025

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, April 2025. ISSN 2398-6352. doi: 10.1038/s41746-025-01611-4

work page doi:10.1038/s41746-025-01611-4 2025
[32]

Shunsen Huang, Xiaoxiong Lai, Li Ke, Yajun Li, Huanlei Wang, Xinmei Zhao, Xinran Dai, and Yun Wang. Ai technology panic—is ai dependence bad for mental health? a cross-lagged panel model and the mediating roles of motivations for ai use among adolescents.Psychology Research and Behavior Management, V olume 17:1087–1102, March 2024. ISSN 1179-1578. doi: 10...

work page doi:10.2147/prbm.s440889 2024
[33]

Intima: A benchmark for human-ai companionship behavior, August 2025

Lucie-Aimée Kaffee, Giada Pistilli, and Yacine Jernite. Intima: A benchmark for human-ai companionship behavior, August 2025. URLhttp://arxiv.org/abs/2508.09998. arXiv:2508.09998 [cs]

work page arXiv 2025
[34]

Mental health app evaluation: updating the american psychiatric association’s framework through a stakeholder-engaged workshop.Psychiatric Services, 72(9):1095–1098, 2021

Sarah Lagan, Margaret R Emerson, Darlene King, Sonia Matwin, Steven R Chan, Stephon Proctor, Julia Tartaglia, Karen L Fortuna, Patrick Aquino, Robert Walker, et al. Mental health app evaluation: updating the american psychiatric association’s framework through a stakeholder-engaged workshop.Psychiatric Services, 72(9):1095–1098, 2021

work page 2021
[35]

Psy-llm: Scaling up global mental health psychological services with ai-based large language models, September 2023

Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, and Ziqi Wang. Psy-llm: Scaling up global mental health psychological services with ai-based large language models, September 2023. URL http://arxiv.org/abs/2307.11991. arXiv:2307.11991 [cs]

work page arXiv 2023
[36]

Cognitive behavioral therapy for psychosis (cbtp): An introductory manual for clinicians

Yulia Landa. Cognitive behavioral therapy for psychosis (cbtp): An introductory manual for clinicians. Technical report, Mental Illness Research, Education and Clinical Centers (MIRECC) at the James J. Peters V A Medical Center, 2017. URL https://www.mirecc.va.gov/visn2/docs/CBTp_Manual_ VA_Yulia_Landa_2017.pdf. V A Medical Center

work page 2017
[37]

J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, March 1977. ISSN 0006-341X. Research Support, U.S. Gov’t, Non -P.H.S.; Research Support, U.S. Gov’t, P.H.S

work page 1977
[38]

Improving automatic evaluation of large language models (LLMs) in biomedical relation extraction via LLMs-as-the-judge

Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, and Jimmy Huang. Improving automatic evaluation of large language models (LLMs) in biomedical relation extraction via LLMs-as-the-judge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Assoc...

work page doi:10.18653/v1/2025.acl-long.1238 2025
[39]

The opportunities and risks of large language models in mental health.JMIR Mental Health, 11:e59479–e59479, July 2024

Hannah R Lawrence, Renee A Schneider, Susan B Rubin, Maja J Matari´c, Daniel J McDuff, and Megan Jones Bell. The opportunities and risks of large language models in mental health.JMIR Mental Health, 11:e59479–e59479, July 2024. ISSN 2368-7959. doi: 10.2196/59479. 12

work page doi:10.2196/59479 2024
[40]

Chain of risks evaluation (core): A framework for safer large language models in public mental health.Psychiatry and Clinical Neurosciences, 79(6):299–305, 2025

Lingyu Li, Shuqi Kong, Haiquan Zhao, Chunbo Li, Yan Teng, and Yingchun Wang. Chain of risks evaluation (core): A framework for safer large language models in public mental health.Psychiatry and Clinical Neurosciences, 79(6):299–305, 2025

work page 2025
[41]

Leveraging llms as meta-judges: A multi-agent framework for evaluating llm judgments.arXiv preprint arXiv:2504.17087, 2025

Yuran Li, Jama Hussein Mohamud, Chongren Sun, Di Wu, and Benoit Boulet. Leveraging llms as meta-judges: A multi-agent framework for evaluating llm judgments.arXiv preprint arXiv:2504.17087, 2025

work page arXiv 2025
[42]

Liu, Pat Pataranutaporn, and Pattie Maes

Auren R. Liu, Pat Pataranutaporn, and Pattie Maes. Chatbot companionship: A mixed-methods study of companion chatbot usage patterns and their relationship to loneliness in active users, August 2025. URL http://arxiv.org/abs/2410.21596. arXiv:2410.21596 [cs]

work page arXiv 2025
[43]

Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support

Zilin Ma, Yiyang Mei, and Zhaoyuan Su. Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. InAMIA Annual Symposium Proceedings, volume 2023, page 1105, 2024

work page 2023
[44]

McGrath, Sukanta Saha, Ali Al-Hamzawi, Jordi Alonso, Evelyn J

John J. McGrath, Sukanta Saha, Ali Al-Hamzawi, Jordi Alonso, Evelyn J. Bromet, Ronny Bruffaerts, José Miguel Caldas-de Almeida, Wai Tat Chiu, Peter De Jonge, John Fayyad, Silvia Florescu, Oye Gureje, Josep Maria Haro, Chiyi Hu, Viviane Kovess-Masfety, Jean Pierre Lepine, Carmen C. W. Lim, Maria Elena Medina Mora, Fernando Navarro-Mateu, Susana Ochoa, Nanc...

work page doi:10.1001/jamapsychiatry.2015.0575 2015
[45]

Ong, and Nick Haber

Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, page 599–627, Athens Greece, June 2025. ACM. ISBN 97984...

work page doi:10.1145/3715275.3732039 2025
[46]

Prevalence of psychotic disorders and its association with methodological issues

Berta Moreno-Küstner, Carlos Martín, and Loly Pastor. Prevalence of psychotic disorders and its association with methodological issues. a systematic review and meta-analyses.PLOS ONE, 13(4):e0195687, April

work page
[47]

doi: 10.1371/journal.pone.0195687

ISSN 1932-6203. doi: 10.1371/journal.pone.0195687

work page doi:10.1371/journal.pone.0195687 1932
[48]

Delusions by design? how everyday ais might be fuelling psychosis (and what can be done about it)

Hamilton Morrin, Luke Nicholls, Michael Levin, Jenny Yiend, Udita Iyengar, Francesca DelGuidice, Sagnik Bhattacharya, Stefania Tognin, James MacCabe, Ricardo Twumasi, Ben Alderson-Day, and {Thomas A.} Pollak. Delusions by design? how everyday ais might be fuelling psychosis (and what can be done about it). Workingpaper, PsyArXiv, July 2025

work page 2025
[49]

A. G. Nevarez-Flores, K. Sanderson, M. Breslin, V . J. Carr, V . A. Morgan, and A. L. Neil. Systematic review of global functioning and quality of life in people with psychotic disorders.Epidemiology and Psychiatric Sciences, 28(1):31–44, 2019. doi: 10.1017/S2045796018000549

work page doi:10.1017/s2045796018000549 2019
[50]

Sycophancy in gpt-4o: what happened and what we’re doing about it, apr 2025

OpenAI. Sycophancy in gpt-4o: what happened and what we’re doing about it, apr 2025. URL https: //openai.com/index/sycophancy-in-gpt-4o/. OpenAI Blog

work page 2025
[51]

Bounds, Angela Jun, Jaesu Han, Robert M

Jung In Park, Mahyar Abbasian, Iman Azimi, Dawn T. Bounds, Angela Jun, Jaesu Han, Robert M. McCarron, Jessica Borelli, Parmida Safavi, Sanaz Mirbaha, Jia Li, Mona Mahmoudi, Carmen Wiedenhoeft, and Amir M. Rahmani. Building trust in mental health chatbots: Safety metrics and llm-based evaluation tools, 2025. URLhttps://arxiv.org/abs/2408.04650

work page arXiv 2025
[52]

An ai chatbot pushed a teen to kill himself, a lawsuit against its creator alleges, oct 2024

Kate Payne. An ai chatbot pushed a teen to kill himself, a lawsuit against its creator alleges, oct 2024. URL https://apnews.com/article/ chatbot-ai-lawsuit-suicide-teen-artificial-intelligence-9d48adc572100822fdbc3c90d1456bd0 . AP News

work page 2024
[53]

Perlis, Joseph F

Roy H. Perlis, Joseph F. Goldberg, Michael J. Ostacher, and Christopher D. Schneck. Clinical decision support for bipolar depression using large language models.Neuropsychopharmacology, 49(9):1412–1416, August 2024. ISSN 1740-634X. doi: 10.1038/s41386-024-01841-2

work page doi:10.1038/s41386-024-01841-2 2024
[54]

Large language models as mental health resources: Patterns of use in the united states, March 2025

Tony Rousmaniere, Yimeng Zhang, Xu Li, and Siddharth Shah. Large language models as mental health resources: Patterns of use in the united states, March 2025. URLhttps://osf.io/q8m7g_v1

work page 2025
[55]

The efficacy of conversational ai in rectifying the theory-of-mind and autonomy biases: Comparative analysis

Marcin Rz ˛ adeczka, Anna Sterna, Julia Stoli´nska, Paulina Kaczy ´nska, and Marcin Moskalewicz. The efficacy of conversational ai in rectifying the theory-of-mind and autonomy biases: Comparative analysis. JMIR Ment Health, 12:e64396, February 2025. ISSN 2368-7959. doi: 10.2196/64396. 13

work page doi:10.2196/64396 2025
[56]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Judging the judges: A systematic study of position bias in LLM-as-a- judge,

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge, April 2025. URL http://arxiv.org/ abs/2406.07791. arXiv:2406.07791 [cs]

work page arXiv 2025
[58]

Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera y Arcas,...

work page doi:10.1038/s41591-024-03423-7 2025
[59]

Can we use chatgpt for mental health and substance use education? examining its quality and potential harms

Sophia Spallek, Louise Birrell, Stephanie Kershaw, Emma Krogh Devine, and Louise Thornton. Can we use chatgpt for mental health and substance use education? examining its quality and potential harms. JMIR Medical Education, 9:e51243, November 2023. ISSN 2369-3762. doi: 10.2196/51243

work page doi:10.2196/51243 2023
[60]

Stade, Johannes C

Elizabeth C. Stade, Johannes C. Eichstaedt, Jane P. Kim, and Shannon Wiltsey Stirman. Readiness evaluation for artificial intelligence-mental health deployment and implementation (readi): A review and proposed framework.Technology, Mind, and Behavior, 6(2), April 2025. ISSN 2689-0208. doi: 10.1037/tmb0000163. URLhttps://tmb.apaopen.org/pub/8gyddorx

work page doi:10.1037/tmb0000163 2025
[61]

Current real-world use of large language models for mental health, 2025

Elizabeth C Stade, Zoe M Tait, Samuel T Campione, and Shannon Wiltsey Stirman. Current real-world use of large language models for mental health, 2025

work page 2025
[62]

Chilton, and Kathleen McKeown

Melanie Subbiah, Sean Zhang, Lydia B. Chilton, and Kathleen McKeown. Reading subtext: Evaluating large language models on short story summarization with writers.Transactions of the Association for Computational Linguistics, 12:1290–1310, October 2024. ISSN 2307-387X. doi: 10.1162/tacl_a_00702

work page doi:10.1162/tacl_a_00702 2024
[63]

Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A

Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A. Metoyer. Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks. InProceedings of the 30th International Conference on Intelligent User Interfaces, page 952–966, Cagliari Italy, March 2025. ACM. ISBN 9798400713064. ...

work page doi:10.1145/3708359.3712091 2025
[64]

Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges. In Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eliya Habba, Itay Itzhak, Simon Mille, Yotam Perlitz, Enrico Santus, João Sedoc, Michal Shm...

work page 2025
[65]

Metaphor understanding challenge dataset for LLMs

Xiaoyu Tong, Rochelle Choenni, Martha Lewis, and Ekaterina Shutova. Metaphor understanding challenge dataset for LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3517–3536, Bangkok, Thailand, August 2024. Association for C...

work page 2024
[66]

Therapeutic management of schizophrenia and substance use disorders dual diagnosis- clinical vignettes.Romanian Journal of Military Medicine, 121(2):26–34, 2018

Octavian Vasiliu. Therapeutic management of schizophrenia and substance use disorders dual diagnosis- clinical vignettes.Romanian Journal of Military Medicine, 121(2):26–34, 2018

work page 2018
[67]

Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorod- sky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

work page arXiv 2024
[68]

Cross-validation metrics for evaluating classification performance on imbalanced data

Ni Wayan Surya Wardhani, Masithoh Yessi Rochayani, Atiek Iriany, Agus Dwi Sulistyono, and Prayudi Lestantyo. Cross-validation metrics for evaluating classification performance on imbalanced data. In2019 International Conference on Computer , Control, Informatics and its Applications (IC3INA), pages 14–18,

work page
[69]

doi: 10.1109/IC3INA48034.2019.8949568

work page doi:10.1109/ic3ina48034.2019.8949568 2019
[70]

CRC Press, 2017

Barry Wright, Subodh Dave, and Nisha Dogra.100 cases in psychiatry. CRC Press, 2017. 14

work page 2017
[71]

Chawla, and Xiangliang Zhang

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge, October 2024. URL http://arxiv.org/abs/2410.02736. arXiv:2410.02736 [cs]

work page arXiv 2024
[72]

Development and validation the problematic chatgpt use scale: a preliminary report.Current Psychology, 43(31):26080–26092, August 2024

Sen-Chi Yu, Hong-Ren Chen, and Yu-Wen Yang. Development and validation the problematic chatgpt use scale: a preliminary report.Current Psychology, 43(31):26080–26092, August 2024. ISSN 1046-1310, 1936-4733. doi: 10.1007/s12144-024-06259-z

work page doi:10.1007/s12144-024-06259-z 2024
[73]

I know that sounds hard to believe

Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, Kui Ren, and Jingyi Wang. S-eval: Towards automated and comprehensive safety evaluation for large language models.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728971. URLhttps://doi.org/10.1145/3728971. 15 A Stimu...

work page doi:10.1145/3728971 2025