Recognition: 2 theorem links
· Lean TheoremUsing LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
Pith reviewed 2026-05-15 09:02 UTC · model grok-4.3
The pith
LLM judges match human clinicians with up to 0.75 kappa on safety of responses to psychosis users
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Seven clinician-informed safety criteria were defined to assess LLM responses to prompts indicating psychosis. A human-consensus dataset was created from multiple clinician ratings. Testing LLM-as-a-Judge and LLM-as-a-Jury setups showed the best single model (Gemini) achieving Cohen's kappa of 0.75 with human consensus, slightly above the jury at 0.74, with other models at 0.68 and 0.56. This demonstrates that automated LLM evaluators can serve as reliable, scalable proxies for clinical safety assessment in this setting.
What carries the argument
The seven clinician-informed safety criteria that operationalize risks such as reinforcing delusions or hallucinations, scored via single LLM judges or majority-vote LLM juries against human consensus labels using Cohen's kappa.
If this is right
- Safety testing of additional LLMs can proceed at scale without repeated large-scale clinician panels.
- High-risk responses can be automatically flagged before reaching users in production systems.
- The criteria enable standardized, comparable safety benchmarks across different models.
- Human clinical review effort can shift from routine scoring to resolving only ambiguous cases.
Where Pith is reading between the lines
- Similar criteria and judge setups could be developed for LLM safety in other mental health conditions such as severe depression or anxiety.
- Fine-tuning the judge models on more clinician-labeled data might raise agreement above the reported 0.75 kappa.
- Embedding these evaluators in live mental health chat tools could reduce the frequency of harmful reinforcement of psychotic symptoms.
Load-bearing premise
The seven clinician-informed safety criteria comprehensively and reliably capture the clinically relevant risks of LLM interactions with users demonstrating psychosis.
What would settle it
A new set of LLM responses to psychosis-indicative prompts rated independently by clinicians shows the LLM judge achieving kappa below 0.5 with that fresh consensus.
Figures
read the original abstract
General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated assessment using an LLM as an evaluator (LLM-as-a-Judge) or taking the majority vote of several LLM judges (LLM-as-a-Jury). Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen's $\kappa_{\text{human} \times \text{gemini}} = 0.75$, $\kappa_{\text{human} \times \text{qwen}} = 0.68$, $\kappa_{\text{human} \times \text{kimi}} = 0.56$) and that the best judge slightly outperforms LLM-as-a-Jury (Cohen's $\kappa_{\text{human} \times \text{jury}} = 0.74$). Overall, these findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops and validates seven clinician-informed safety criteria for LLM responses to users with psychosis, constructs a human-consensus labeled dataset, and tests LLM-as-a-Judge (single models) versus LLM-as-a-Jury (majority vote), reporting alignment with human labels via Cohen's kappa (0.75 for Gemini, 0.68 for Qwen, 0.56 for Kimi) and noting that the best single judge slightly outperforms the jury (kappa 0.74).
Significance. If the criteria are shown to be comprehensive, the concrete kappa values and independent human consensus labels would support a scalable alternative to purely manual clinical review for LLM safety in mental health contexts, addressing a gap in validated, automatable evaluation methods.
major comments (2)
- [Criteria development and validation] Criteria development section: the claim of 'clinically-validated' evaluations depends on the seven criteria comprehensively capturing relevant psychosis-related risks (e.g., delusion reinforcement). The manuscript reports clinician input and consensus on these criteria but provides no literature mapping, missed-case analysis, or external review demonstrating exhaustiveness versus a convenient subset; high kappas therefore only validate agreement on the chosen subset.
- [Methods and dataset construction] Methods and dataset construction: the abstract and results report kappas without stating dataset size, response sampling procedure, or inter-clinician agreement during criteria validation. These omissions prevent assessment of label reliability and generalizability of the reported alignment (e.g., whether the 0.75 kappa holds on a representative sample).
minor comments (2)
- [Abstract] Abstract: explicitly state dataset size and sampling details to support evaluation of the kappa results.
- [Results] Results: clarify whether LLM judges received identical criteria definitions and examples as the clinicians.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Criteria development and validation] Criteria development section: the claim of 'clinically-validated' evaluations depends on the seven criteria comprehensively capturing relevant psychosis-related risks (e.g., delusion reinforcement). The manuscript reports clinician input and consensus on these criteria but provides no literature mapping, missed-case analysis, or external review demonstrating exhaustiveness versus a convenient subset; high kappas therefore only validate agreement on the chosen subset.
Authors: We appreciate the referee's emphasis on distinguishing between clinician-informed criteria and fully exhaustive clinical validation. The seven criteria were derived through iterative discussions with two clinicians specializing in psychosis treatment, targeting core risks including delusion reinforcement, hallucination encouragement, and inadequate redirection to professional care. While the manuscript does not include a systematic literature mapping or missed-case analysis, the criteria reflect consensus on the most immediate safety concerns for this population. The reported alignment metrics demonstrate that LLMs can reliably apply these specific criteria in line with human experts. We will revise the criteria development section and abstract to clarify that the criteria are clinician-informed rather than claiming comprehensive exhaustiveness, and we will add an explicit limitations paragraph acknowledging the absence of a full literature review or external validation study. revision: partial
-
Referee: [Methods and dataset construction] Methods and dataset construction: the abstract and results report kappas without stating dataset size, response sampling procedure, or inter-clinician agreement during criteria validation. These omissions prevent assessment of label reliability and generalizability of the reported alignment (e.g., whether the 0.75 kappa holds on a representative sample).
Authors: We agree that these methodological details are essential for evaluating reliability and should have been included. We will expand the methods section to report the dataset size, the response sampling procedure (including how prompts were generated and selected), and the inter-clinician agreement statistics obtained during the consensus labeling process. These additions will also be reflected in the abstract and results to support assessment of generalizability. revision: yes
Circularity Check
No circularity; results are direct empirical agreement metrics against independent human consensus labels
full rationale
The paper's core results consist of Cohen's kappa values measuring alignment between LLM-as-a-Judge outputs and a separately constructed human-consensus dataset on seven clinician-informed criteria. These kappas are computed as straightforward inter-rater agreement statistics and do not involve any parameter fitting, self-referential definitions, or predictions that reduce by construction to the paper's own inputs. Criteria development draws on external clinician input, but the validation chain remains open to independent human labels rather than closing on internal consistency or self-citation. No load-bearing step matches any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Cohen's kappa is an appropriate and sufficient measure of agreement between LLM judges and human consensus labels
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
developing and validating seven clinician-informed safety criteria... Cohen's κ_human×gemini = 0.75
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery theorem unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
operationalize evaluation criteria for assessing the safety of LLM responses
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?
A calibrated three-model LLM jury scores medical diagnoses and clinical reasoning on real hospital cases with higher agreement to primary expert panels and fewer severe errors than human re-scoring panels.
Reference graph
Works this paper leans on
-
[1]
Leveraging llms for mental health: Detection and recommendations from social discussions, March 2025
Vaishali Aggarwal, Sachin Thukral, Krushil Patel, and Arnab Chatterjee. Leveraging llms for mental health: Detection and recommendations from social discussions, March 2025. URL http://arxiv.org/ abs/2503.01442. arXiv:2503.01442 [cs]
-
[2]
Lawsuit: A chatbot hinted a kid should kill his parents over screen time limits, December
Bobby Allyn. Lawsuit: A chatbot hinted a kid should kill his parents over screen time limits, December
-
[3]
URL https://www.npr.org/2024/12/10/nx-s1-5222574/kids-character-ai-lawsuit . NPR
work page 2024
-
[4]
American Psychiatric Publishing, Washington, DC, 5th edition, 2013
American Psychiatric Association.Diagnostic and Statistical Manual of Mental Disorders (DSM -5). American Psychiatric Publishing, Washington, DC, 5th edition, 2013. ISBN 978-0890425558. URLhttps: //doi.org/10.1176/appi.books.9780890425596. Standard reference for psychiatric diagnoses
-
[5]
Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, May
-
[6]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
URLhttp://arxiv.org/abs/2505.08775. arXiv:2505.08775 [cs]
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Sher Badshah and Hassan Sajjad. Reference-guided verdict: Llms-as-judges in automatic evaluation of free-form text.arXiv preprint arXiv:2408.09235, 2024
-
[8]
Erkan Basar, Xin Sun, Iris Hendrickx, Jan de Wit, Tibor Bosse, Gert-Jan De Bruijn, Jos A. Bosch, and Emiel Krahmer. How well can large language models reflect? a human evaluation of LLM-generated reflections for motivational interviewing dialogues. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert,...
work page 1964
-
[9]
Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi, Asad Aali, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah, ...
-
[10]
Santiago Berrezueta-Guzman, Mohanad Kandil, María-Luisa Martín-Ruiz, Iván Pau de la Cruz, and Stephan Krusche. Exploring the efficacy of robotic assistants with chatgpt and claude in enhancing adhd therapy: Innovating treatment paradigms. In2024 International Conference on Intelligent Environments (IE), pages 25–32, 2024. doi: 10.1109/IE61493.2024.10599903
- [11]
-
[12]
Mirko Casu, Sergio Triscari, Sebastiano Battiato, Luca Guarnera, and Pasquale Caponnetto. Ai chatbots for mental health: A scoping review of effectiveness, feasibility, and applications.Applied Sciences, 14 (13):5889, July 2024. ISSN 2076-3417. doi: 10.3390/app14135889
-
[13]
Joe Kwun Nam Chan, Christoph U. Correll, Corine Sau Man Wong, Ryan Sai Ting Chu, Vivian Shi Cheng Fung, Gabbie Hou Sem Wong, Janet Hiu Ching Lei, and Wing Chung Chang. Life expectancy and years of potential life lost in people with mental disorders: a systematic review and meta-analysis. eClinicalMedicine, 65, November 2023. ISSN 2589-5370. doi: 10.1016/j...
-
[14]
Humans or llms as the judge? a study on judgement biases, September 2024
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases, September 2024. URL http://arxiv.org/abs/2402.10669. arXiv:2402.10669 [cs]
-
[15]
Simon Coghlan, Kobi Leins, Susie Sheldrick, Marc Cheong, Piers Gooding, and Simon D’Alfonso. To chat or bot to chat: Ethical issues with using chatbots in mental health.DIGITAL HEALTH, 9: 20552076231183542, January 2023. ISSN 2055-2076, 2055-2076. doi: 10.1177/20552076231183542
-
[16]
Emma Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, John Caskey, Madeline Oguss, Graham Wills, Guanhua Chen, Dmitriy Dligach, et al. Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge.medRxiv, pages 2025–04, 2025
work page 2025
-
[17]
Corey Curran, Nafis Neehal, Keerthiram Murugesan, and Kristin P. Bennett. Examining trustworthiness of llm-as-a-judge systems in a clinical trial design benchmark. In2024 IEEE International Confer- ence on Big Data (BigData), page 4627–4631, Washington, DC, USA, December 2024. IEEE. ISBN 9798350362480. doi: 10.1109/BigData62323.2024.10825592. URL https://...
-
[18]
LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA
Yella Diekmann, Chase Fensore, Rodrigo Carrillo-Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Megha Shah, and Joyce Ho. LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedic...
-
[19]
Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. Attacks, defenses and evaluations for llm conversation safety: A survey.arXiv preprint arXiv:2402.09283, 2024
-
[20]
Cong Doanh Duong, Thanh Tung Dao, Trong Nghia Vu, Thi Viet Nga Ngo, and Quang Yen Tran. Compulsive chatgpt usage, anxiety, burnout, and sleep disturbance: A serial mediation model based on stimulus-organism-response perspective.Acta Psychologica, 251:104622, November 2024. ISSN 00016918. doi: 10.1016/j.actpsy.2024.104622
-
[21]
Liu, Valdemar Danry, Eunhae Lee, Samantha W
Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranuta- porn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, and Sandhini Agarwal. How ai and human behaviors shape psychosocial effects of chatbot use: A longitudinal randomized controlled study, March
-
[22]
Liu, Valdemar Danry, Eunhae Lee, Samantha W
URLhttp://arxiv.org/abs/2503.17473. arXiv:2503.17473 [cs]
-
[23]
Russell Franco D’Souza, Shabbir Amanullah, Mary Mathew, and Krishna Mohan Surapaneni. Appraising the performance of chatgpt in psychiatry using 100 clinical case vignettes.Asian Journal of Psychiatry, 89: 103770, November 2023. ISSN 18762018. doi: 10.1016/j.ajp.2023.103770
-
[24]
Salvatore Giorgi, Kelsey Isman, Tingting Liu, Zachary Fried, Joao Sedoc, and Brenda Curtis. Evaluating generative ai responses to real-world drug-related questions.Psychiatry research, 339:116058, 2024. 11
work page 2024
-
[25]
Ashleigh Golden and Elias Aboujaoude. The framework for ai tool assessment in mental health (faita- mental health): a scale for evaluating ai-powered mental health tools.World Psychiatry, 23(3):444, 2024
work page 2024
-
[26]
Declan Grabb, Max Lamparth, and Nina Vasan. Risks from language models for automated mental healthcare: Ethics and structure for implementation.medRxiv, 2024. doi: 10.1101/2024.04.07.24305462. URLhttps://www.medrxiv.org/content/early/2024/04/08/2024.04.07.24305462
-
[27]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, March 2025. URL http://arxiv.org/abs/2411.15594. arXiv:2411.15594 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Qiming Guo, Jinwen Tang, Wenbo Sun, Haoteng Tang, Yi Shang, and Wenlu Wang. Soullmate: An adaptive llm-driven system for advanced mental health support and assessment, based on a systematic application survey, October 2024. URL http://arxiv.org/abs/2410.11859. arXiv:2410.11859 [cs]
-
[29]
it listens better than my therapist
Anna-Carolina Haensch. “it listens better than my therapist”: Exploring social media discourse on llms as mental health tool, April 2025. URLhttp://arxiv.org/abs/2504.12337. arXiv:2504.12337 [cs]
-
[30]
Medsafetybench: Evaluating and improving the medical safety of large language models
Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. Medsafetybench: Evaluating and improving the medical safety of large language models. In A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems, volume 37, page 33423–33454. Curran Asso- ciates, Inc., ...
work page 2024
-
[31]
Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, April 2025. ISSN 2398-6352. doi: 10.1038/s41746-025-01611-4
-
[32]
Shunsen Huang, Xiaoxiong Lai, Li Ke, Yajun Li, Huanlei Wang, Xinmei Zhao, Xinran Dai, and Yun Wang. Ai technology panic—is ai dependence bad for mental health? a cross-lagged panel model and the mediating roles of motivations for ai use among adolescents.Psychology Research and Behavior Management, V olume 17:1087–1102, March 2024. ISSN 1179-1578. doi: 10...
-
[33]
Intima: A benchmark for human-ai companionship behavior, August 2025
Lucie-Aimée Kaffee, Giada Pistilli, and Yacine Jernite. Intima: A benchmark for human-ai companionship behavior, August 2025. URLhttp://arxiv.org/abs/2508.09998. arXiv:2508.09998 [cs]
-
[34]
Sarah Lagan, Margaret R Emerson, Darlene King, Sonia Matwin, Steven R Chan, Stephon Proctor, Julia Tartaglia, Karen L Fortuna, Patrick Aquino, Robert Walker, et al. Mental health app evaluation: updating the american psychiatric association’s framework through a stakeholder-engaged workshop.Psychiatric Services, 72(9):1095–1098, 2021
work page 2021
-
[35]
Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, and Ziqi Wang. Psy-llm: Scaling up global mental health psychological services with ai-based large language models, September 2023. URL http://arxiv.org/abs/2307.11991. arXiv:2307.11991 [cs]
-
[36]
Cognitive behavioral therapy for psychosis (cbtp): An introductory manual for clinicians
Yulia Landa. Cognitive behavioral therapy for psychosis (cbtp): An introductory manual for clinicians. Technical report, Mental Illness Research, Education and Clinical Centers (MIRECC) at the James J. Peters V A Medical Center, 2017. URL https://www.mirecc.va.gov/visn2/docs/CBTp_Manual_ VA_Yulia_Landa_2017.pdf. V A Medical Center
work page 2017
-
[37]
J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, March 1977. ISSN 0006-341X. Research Support, U.S. Gov’t, Non -P.H.S.; Research Support, U.S. Gov’t, P.H.S
work page 1977
-
[38]
Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, and Jimmy Huang. Improving automatic evaluation of large language models (LLMs) in biomedical relation extraction via LLMs-as-the-judge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Assoc...
-
[39]
Hannah R Lawrence, Renee A Schneider, Susan B Rubin, Maja J Matari´c, Daniel J McDuff, and Megan Jones Bell. The opportunities and risks of large language models in mental health.JMIR Mental Health, 11:e59479–e59479, July 2024. ISSN 2368-7959. doi: 10.2196/59479. 12
-
[40]
Lingyu Li, Shuqi Kong, Haiquan Zhao, Chunbo Li, Yan Teng, and Yingchun Wang. Chain of risks evaluation (core): A framework for safer large language models in public mental health.Psychiatry and Clinical Neurosciences, 79(6):299–305, 2025
work page 2025
-
[41]
Yuran Li, Jama Hussein Mohamud, Chongren Sun, Di Wu, and Benoit Boulet. Leveraging llms as meta-judges: A multi-agent framework for evaluating llm judgments.arXiv preprint arXiv:2504.17087, 2025
-
[42]
Liu, Pat Pataranutaporn, and Pattie Maes
Auren R. Liu, Pat Pataranutaporn, and Pattie Maes. Chatbot companionship: A mixed-methods study of companion chatbot usage patterns and their relationship to loneliness in active users, August 2025. URL http://arxiv.org/abs/2410.21596. arXiv:2410.21596 [cs]
-
[43]
Zilin Ma, Yiyang Mei, and Zhaoyuan Su. Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. InAMIA Annual Symposium Proceedings, volume 2023, page 1105, 2024
work page 2023
-
[44]
McGrath, Sukanta Saha, Ali Al-Hamzawi, Jordi Alonso, Evelyn J
John J. McGrath, Sukanta Saha, Ali Al-Hamzawi, Jordi Alonso, Evelyn J. Bromet, Ronny Bruffaerts, José Miguel Caldas-de Almeida, Wai Tat Chiu, Peter De Jonge, John Fayyad, Silvia Florescu, Oye Gureje, Josep Maria Haro, Chiyi Hu, Viviane Kovess-Masfety, Jean Pierre Lepine, Carmen C. W. Lim, Maria Elena Medina Mora, Fernando Navarro-Mateu, Susana Ochoa, Nanc...
-
[45]
Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, page 599–627, Athens Greece, June 2025. ACM. ISBN 97984...
-
[46]
Prevalence of psychotic disorders and its association with methodological issues
Berta Moreno-Küstner, Carlos Martín, and Loly Pastor. Prevalence of psychotic disorders and its association with methodological issues. a systematic review and meta-analyses.PLOS ONE, 13(4):e0195687, April
-
[47]
doi: 10.1371/journal.pone.0195687
ISSN 1932-6203. doi: 10.1371/journal.pone.0195687
-
[48]
Delusions by design? how everyday ais might be fuelling psychosis (and what can be done about it)
Hamilton Morrin, Luke Nicholls, Michael Levin, Jenny Yiend, Udita Iyengar, Francesca DelGuidice, Sagnik Bhattacharya, Stefania Tognin, James MacCabe, Ricardo Twumasi, Ben Alderson-Day, and {Thomas A.} Pollak. Delusions by design? how everyday ais might be fuelling psychosis (and what can be done about it). Workingpaper, PsyArXiv, July 2025
work page 2025
-
[49]
A. G. Nevarez-Flores, K. Sanderson, M. Breslin, V . J. Carr, V . A. Morgan, and A. L. Neil. Systematic review of global functioning and quality of life in people with psychotic disorders.Epidemiology and Psychiatric Sciences, 28(1):31–44, 2019. doi: 10.1017/S2045796018000549
-
[50]
Sycophancy in gpt-4o: what happened and what we’re doing about it, apr 2025
OpenAI. Sycophancy in gpt-4o: what happened and what we’re doing about it, apr 2025. URL https: //openai.com/index/sycophancy-in-gpt-4o/. OpenAI Blog
work page 2025
-
[51]
Bounds, Angela Jun, Jaesu Han, Robert M
Jung In Park, Mahyar Abbasian, Iman Azimi, Dawn T. Bounds, Angela Jun, Jaesu Han, Robert M. McCarron, Jessica Borelli, Parmida Safavi, Sanaz Mirbaha, Jia Li, Mona Mahmoudi, Carmen Wiedenhoeft, and Amir M. Rahmani. Building trust in mental health chatbots: Safety metrics and llm-based evaluation tools, 2025. URLhttps://arxiv.org/abs/2408.04650
-
[52]
An ai chatbot pushed a teen to kill himself, a lawsuit against its creator alleges, oct 2024
Kate Payne. An ai chatbot pushed a teen to kill himself, a lawsuit against its creator alleges, oct 2024. URL https://apnews.com/article/ chatbot-ai-lawsuit-suicide-teen-artificial-intelligence-9d48adc572100822fdbc3c90d1456bd0 . AP News
work page 2024
-
[53]
Roy H. Perlis, Joseph F. Goldberg, Michael J. Ostacher, and Christopher D. Schneck. Clinical decision support for bipolar depression using large language models.Neuropsychopharmacology, 49(9):1412–1416, August 2024. ISSN 1740-634X. doi: 10.1038/s41386-024-01841-2
-
[54]
Large language models as mental health resources: Patterns of use in the united states, March 2025
Tony Rousmaniere, Yimeng Zhang, Xu Li, and Siddharth Shah. Large language models as mental health resources: Patterns of use in the united states, March 2025. URLhttps://osf.io/q8m7g_v1
work page 2025
-
[55]
Marcin Rz ˛ adeczka, Anna Sterna, Julia Stoli´nska, Paulina Kaczy ´nska, and Marcin Moskalewicz. The efficacy of conversational ai in rectifying the theory-of-mind and autonomy biases: Comparative analysis. JMIR Ment Health, 12:e64396, February 2025. ISSN 2368-7959. doi: 10.2196/64396. 13
-
[56]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Judging the judges: A systematic study of position bias in LLM-as-a- judge,
Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge, April 2025. URL http://arxiv.org/ abs/2406.07791. arXiv:2406.07791 [cs]
-
[58]
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera y Arcas,...
-
[59]
Sophia Spallek, Louise Birrell, Stephanie Kershaw, Emma Krogh Devine, and Louise Thornton. Can we use chatgpt for mental health and substance use education? examining its quality and potential harms. JMIR Medical Education, 9:e51243, November 2023. ISSN 2369-3762. doi: 10.2196/51243
-
[60]
Elizabeth C. Stade, Johannes C. Eichstaedt, Jane P. Kim, and Shannon Wiltsey Stirman. Readiness evaluation for artificial intelligence-mental health deployment and implementation (readi): A review and proposed framework.Technology, Mind, and Behavior, 6(2), April 2025. ISSN 2689-0208. doi: 10.1037/tmb0000163. URLhttps://tmb.apaopen.org/pub/8gyddorx
-
[61]
Current real-world use of large language models for mental health, 2025
Elizabeth C Stade, Zoe M Tait, Samuel T Campione, and Shannon Wiltsey Stirman. Current real-world use of large language models for mental health, 2025
work page 2025
-
[62]
Melanie Subbiah, Sean Zhang, Lydia B. Chilton, and Kathleen McKeown. Reading subtext: Evaluating large language models on short story summarization with writers.Transactions of the Association for Computational Linguistics, 12:1290–1310, October 2024. ISSN 2307-387X. doi: 10.1162/tacl_a_00702
-
[63]
Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A
Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A. Metoyer. Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks. InProceedings of the 30th International Conference on Intelligent User Interfaces, page 952–966, Cagliari Italy, March 2025. ACM. ISBN 9798400713064. ...
-
[64]
Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges. In Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eliya Habba, Itay Itzhak, Simon Mille, Yotam Perlitz, Enrico Santus, João Sedoc, Michal Shm...
work page 2025
-
[65]
Metaphor understanding challenge dataset for LLMs
Xiaoyu Tong, Rochelle Choenni, Martha Lewis, and Ekaterina Shutova. Metaphor understanding challenge dataset for LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3517–3536, Bangkok, Thailand, August 2024. Association for C...
work page 2024
-
[66]
Octavian Vasiliu. Therapeutic management of schizophrenia and substance use disorders dual diagnosis- clinical vignettes.Romanian Journal of Military Medicine, 121(2):26–34, 2018
work page 2018
-
[67]
Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorod- sky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024
-
[68]
Cross-validation metrics for evaluating classification performance on imbalanced data
Ni Wayan Surya Wardhani, Masithoh Yessi Rochayani, Atiek Iriany, Agus Dwi Sulistyono, and Prayudi Lestantyo. Cross-validation metrics for evaluating classification performance on imbalanced data. In2019 International Conference on Computer , Control, Informatics and its Applications (IC3INA), pages 14–18,
-
[69]
doi: 10.1109/IC3INA48034.2019.8949568
-
[70]
Barry Wright, Subodh Dave, and Nisha Dogra.100 cases in psychiatry. CRC Press, 2017. 14
work page 2017
-
[71]
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge, October 2024. URL http://arxiv.org/abs/2410.02736. arXiv:2410.02736 [cs]
-
[72]
Sen-Chi Yu, Hong-Ren Chen, and Yu-Wen Yang. Development and validation the problematic chatgpt use scale: a preliminary report.Current Psychology, 43(31):26080–26092, August 2024. ISSN 1046-1310, 1936-4733. doi: 10.1007/s12144-024-06259-z
-
[73]
I know that sounds hard to believe
Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, Kui Ren, and Jingyi Wang. S-eval: Towards automated and comprehensive safety evaluation for large language models.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728971. URLhttps://doi.org/10.1145/3728971. 15 A Stimu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.