pith. sign in

arxiv: 2605.13318 · v2 · pith:GXOPKCXNnew · submitted 2026-05-13 · 💻 cs.AI · cs.ET

VERA-MH: Validation of Ethical and Responsible AI in Mental Health

Pith reviewed 2026-05-20 21:49 UTC · model grok-4.3

classification 💻 cs.AI cs.ET
keywords mental health AIchatbot safetysuicidal ideationevaluation frameworkLLM judgeclinical validationresponsible AIcrisis response
0
0 comments X

The pith

VERA-MH introduces a clinically-validated evaluation to assess the safety of mental health chatbots around suicidal ideation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop VERA-MH to fill the gap in testing AI chatbots that users might turn to for mental health help. The framework first creates simulated conversations by role-playing users with personas that incorporate clinical insights on risk factors, demographics, and disclosure styles. Next, it employs an LLM as judge guided by a flow-based rubric of yes-no questions to score the chatbot's replies consistently. Finally, it aggregates scores across conversations to rate the overall safety of the model. If this approach works as intended, it gives a concrete way to spot and address unsafe responses before they reach real users.

Core claim

VERA-MH evaluates chatbot safety in mental health support by simulating conversations with clinically developed user personas, judging responses using an LLM-as-a-Judge and a flow-structured clinical rubric, and aggregating results to produce model ratings, with results provided for four leading LLM providers.

What carries the argument

The three-step VERA-MH process of conversation simulation using clinical personas, judging with a flow-based rubric, and result aggregation.

Load-bearing premise

Clinically developed user personas and the flow-based rubric accurately capture real-world crisis disclosure patterns and the ways chatbots fail to respond safely.

What would settle it

A direct comparison of VERA-MH's LLM judge scores with ratings given by human mental health experts on identical conversation transcripts.

Figures

Figures reproduced from arXiv: 2605.13318 by Adam M. Chekroud, Emily Van Ark, Josh Gieringer, Kate H. Bentley, Luca Belli, Matt Hawrilenko, Millard Brown, Nilu Zhao, Pradip Thachile.

Figure 1
Figure 1. Figure 1: Results of the experiments. For each dimension, the Non Relevant column is computed as a [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Results of the experiments focused on Gemini models. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results of the experiments focused on GPT5.X family of models. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results of the experiments focused on Grok models. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results of the experiments focused on Claude Opus models. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Results of the experiments focused on Claude Sonnet models. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of the conversational length of both user- and chatbot model. Users’ responses [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Chatbot usage has increased, including in fields for which they were never developed for--notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers' consistency and highlight models' failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces VERA-MH, a three-step framework for evaluating chatbot safety in mental health contexts with a focus on suicidal ideation risks. The steps are conversation simulation via role-playing user personas developed under clinical guidance (incorporating risk factors, demographics, and disclosure patterns), conversation judging using an LLM-as-a-Judge paired with a flow-based rubric of sequential Yes/No questions, and aggregation of results to produce model ratings. The authors apply the framework to four leading LLM providers and present the resulting evaluations.

Significance. If the clinical grounding of the personas and rubric can be substantiated with reliability and validity evidence, VERA-MH would offer a structured, reproducible method for surfacing failure modes in AI systems handling crisis disclosures, which could inform safer deployment practices and regulatory guidance in high-stakes domains.

major comments (3)
  1. [Abstract] Abstract: The manuscript claims VERA-MH is 'clinically-validated' and that personas were 'developed under clinical guidance' to represent real crisis patterns, yet no inter-rater agreement statistics for the rubric, no comparison against real crisis transcripts or clinician annotations, and no external validation that aggregated LLM-as-a-Judge scores predict actual safety failures are reported. This evidence is load-bearing for the central claim that the framework reliably identifies unsafe chatbot behavior.
  2. [Conversation simulation and judging steps] Conversation simulation and judging steps: The flow-based rubric is presented as improving consistency via sequential Yes/No questions, but without reported agreement metrics between the LLM judge and human clinicians or ablation tests showing that the rubric distinguishes safe from unsafe responses better than simpler alternatives, the mapping from simulated conversations to real-world risk remains unverified.
  3. [Results] Results for the four LLM providers: The evaluations are described at a high level with no quantitative metrics (e.g., failure rates per persona category), error analysis, statistical comparisons across models, or sensitivity checks on persona variations, making it impossible to assess whether the framework produces actionable or reproducible safety signals.
minor comments (2)
  1. [Abstract] Abstract: Typo in 'how well chatbots can responds to users' (should be 'respond').
  2. [General] General: The paper would benefit from explicit discussion of how VERA-MH relates to or improves upon prior AI safety benchmarks for conversational agents in healthcare.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed review of our manuscript on VERA-MH. We address each major comment point by point below, indicating where revisions will be made to improve clarity, evidence, and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript claims VERA-MH is 'clinically-validated' and that personas were 'developed under clinical guidance' to represent real crisis patterns, yet no inter-rater agreement statistics for the rubric, no comparison against real crisis transcripts or clinician annotations, and no external validation that aggregated LLM-as-a-Judge scores predict actual safety failures are reported. This evidence is load-bearing for the central claim that the framework reliably identifies unsafe chatbot behavior.

    Authors: We acknowledge that the phrasing 'clinically-validated' in the abstract and introduction may overstate the empirical validation provided. The personas and rubric were developed through iterative consultation with mental health clinicians to incorporate risk factors, demographics, and disclosure patterns, but the manuscript does not include quantitative inter-rater agreement statistics or direct comparisons to real crisis transcripts. We will revise the abstract and relevant sections to use more precise language (e.g., 'developed under clinical guidance') and add a limitations subsection explicitly discussing the absence of external validation against real-world data and the ethical barriers to such comparisons. revision: partial

  2. Referee: [Conversation simulation and judging steps] Conversation simulation and judging steps: The flow-based rubric is presented as improving consistency via sequential Yes/No questions, but without reported agreement metrics between the LLM judge and human clinicians or ablation tests showing that the rubric distinguishes safe from unsafe responses better than simpler alternatives, the mapping from simulated conversations to real-world risk remains unverified.

    Authors: The flow-based structure was chosen to promote consistency by decomposing judgments into sequential binary decisions aligned with clinical risk assessment practices. We agree that additional evidence would strengthen this. In the revision, we will include any pilot agreement metrics between the LLM judge and clinician annotations where available, along with an ablation comparing the sequential rubric to a holistic single-prompt alternative to demonstrate its advantages in distinguishing response safety. revision: yes

  3. Referee: [Results] Results for the four LLM providers: The evaluations are described at a high level with no quantitative metrics (e.g., failure rates per persona category), error analysis, statistical comparisons across models, or sensitivity checks on persona variations, making it impossible to assess whether the framework produces actionable or reproducible safety signals.

    Authors: We recognize that the current results presentation is high-level and would benefit from greater granularity to allow readers to evaluate the framework's outputs. We will expand the results section to report quantitative failure rates broken down by persona categories, include systematic error analysis of common failure modes, add statistical comparisons across the four models, and incorporate sensitivity checks on variations in persona parameters. revision: yes

standing simulated objections not resolved
  • Direct comparisons against real crisis transcripts or clinician annotations on actual patient data cannot be provided due to ethical, privacy, and regulatory restrictions on accessing and using such sensitive mental health information.

Circularity Check

0 steps flagged

VERA-MH is an independent evaluation framework with no circular derivation

full rationale

The paper presents VERA-MH as a three-step evaluation process (conversation simulation via personas, LLM-as-Judge with flow-based rubric, and aggregation) developed under clinical guidance. No equations, fitted parameters, predictions, or self-citations appear in the abstract or described structure that would reduce any result to its own inputs by construction. The framework is offered as a standalone tool for assessing chatbot responses to suicidal ideation scenarios rather than a derivation whose central claim loops back to unverified assumptions within the same work. This is the expected non-finding for a methods paper that does not claim first-principles derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes clinical guidance produces representative personas and that the rubric flow improves consistency, but these are not quantified.

axioms (1)
  • domain assumption Clinically-developed personas and rubric accurately capture real crisis scenarios and failure modes
    Stated in abstract as basis for simulation and judging steps

pith-pipeline@v0.9.0 · 5803 in / 1197 out tokens · 54956 ms · 2026-05-20T21:49:42.354967+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

  1. [1]

    Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R

    William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R. McKee,The illusion of artificial inclusion, Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (New York, NY , USA), CHI ’24, Association for Computing Machinery, 2024

  2. [2]

    Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, In- ioluwa Deborah Raji, and Travis Zack,Position: Medical large language model benchmarks should prioritize construct validity, Forty-second International Conference on Machine Learning Position Paper Track, 2025

  3. [3]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal,Healthbench: Evaluating large language models towards improved human health, 2025. 9

  4. [4]

    3873–3896

    Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Prathiba Dhanesh, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi,When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM evaluation, Pro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Li...

  5. [5]

    Nadeem Badshah,Teenager died after asking chatgpt for ‘most successful’ way to take his life, inquest told, 2026

  6. [6]

    Jan Batzner, Leshem Choshen, Avijit Ghosh, Sree Harsha Nelaturu, Anastassia Kornilova, Damian Stachura, Yifan Mai, Asaf Yehudai, Anka Reuel, Irene Solaiman, and Stella Biderman, Every eval ever: Toward a common language for ai eval reporting, February 2026, Blog Post, EvalEval Coalition

  7. [7]

    Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May...

  8. [8]

    Luca Belli, Kate Bentley, Will Alexander, Emily Ward, Matt Hawrilenko, Kelly Johnston, Mill Brown, and Adam Chekroud,Vera-mh concept paper, 2026

  9. [9]

    Bentley, Luca Belli, Adam M

    Kate H. Bentley, Luca Belli, Adam M. Chekroud, Emily J. Ward, Emily R. Dworkin, Emily Van Ark, Kelly M. Johnston, Will Alexander, Millard Brown, and Matt Hawrilenko,Vera-mh: Reliability and validity of an open-source ai safety evaluation in mental health, 2026

  10. [10]

    Torous,Chatgpt and mental healthcare: balancing benefits with risks of harms, BMJ Mental Health26(2023)

    Charlotte R Blease and John B. Torous,Chatgpt and mental healthcare: balancing benefits with risks of harms, BMJ Mental Health26(2023)

  11. [11]

    Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman,Nuanced metrics for measuring unintended bias with real data for text classification, Companion Proceed- ings of The 2019 World Wide Web Conference (New York, NY , USA), WWW ’19, Association for Computing Machinery, 2019, p. 491–500

  12. [12]

    Danah Boyd and Kate Crawford,Critical questions for big data, Information, Communication & Society15(2012), 662 – 679

  13. [13]

    now, they are sounding an alarm about ai chatbots, 2025

    Rhitu Chatterjee,Their teenage sons died by suicide. now, they are sounding an alarm about ai chatbots, 2025

  14. [14]

    Kimberlé Williams Crenshaw,Mapping the margins: intersectionality, identity politics, and violence against women of color, Stanford Law Review43(1991), 1241–1299

  15. [15]

    Meehl,Construct validity in psychological tests., Psychologi- cal bulletin52 4(1955), 281–302

    Lee Joseph Cronbach and Paul E. Meehl,Construct validity in psychological tests., Psychologi- cal bulletin52 4(1955), 281–302

  16. [16]

    Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang,The participatory turn in ai design: Theoretical foundations and the current state of practice, Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (2023)

  17. [17]

    Gazi, Bryce Hill, Carla Gorban, Carolyn I

    Bridget Dwyer, Matthew Flathers, Akane Sano, Allison Dempsey, Andrea Cipriani, Asim H. Gazi, Bryce Hill, Carla Gorban, Carolyn I. Rodriguez, Charles Stromeyer, Darlene King, Eden Rozenblit, Gillian Strudwick, Jake Linardon, Jiaee Cheong, Joe Firth, Julian Herpertz, Julian Schwarz, Khai The Truong, Margaret Emerson, Martin P. Paulus, Michelle Patriquin, Yi...

  18. [18]

    1, 850–864

    Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca,Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation, Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8(2025), no. 1, 850–864

  19. [19]

    Center for AI Standards and Innovation/NIST,Practices for automated benchmark evaluations of language models, 2026

  20. [20]

    The European Center for Not-for Profit Law Stichting (ECNL) and SocietyInside,Framework for meaningful engagement 2.0, 2025

  21. [21]

    American Foundation for Suicide Prevention,Suicide statistics, 2024

  22. [22]

    Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam,Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, J. Artif. Int. Res.77(2023)

  23. [23]

    Charles A. E. Goodhart,Problems of monetary management: The uk experience, 1984

  24. [24]

    1838–1849

    Gabriel Grill,Constructing capabilities: The politics of testing infrastructures for generative ai, Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’24, Association for Computing Machinery, 2024, p. 1838–1849

  25. [25]

    Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie Griffith, Dylan M Asmar, Sanmi Koyejo, Michael S. Bernstein, and Mykel John Kochenderfer,More than marketing? on the information value of ai benchmarks for practitioners, Proceedings of the 30th International Conference on Intelligent User Interfaces (New York, NY , USA), IUI ’25, Associat...

  26. [26]

    Matthew Holmes, Thiago Lacerda, and Reva Schwartz,Making ai evaluation deployment relevant through context specification, 2026

  27. [27]

    Clifton, and John B

    Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David A. Clifton, and John B. Torous,A scoping review of large language models for generative tasks in mental health care, NPJ Digital Medicine8(2025)

  28. [28]

    Amnesty International,The social atrocity: Meta and the right to remedy for the rohingya, 2022

  29. [29]

    Abigail Z. Jacobs and Hanna Wallach,Measurement and fairness, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’21, Association for Computing Machinery, 2021, p. 375–385

  30. [30]

    Andrea Kang, Jun Yu Chen, Zoe Lee-Youngzie, and Shuhao Fu,Synthetic data generation with llm for improved depression prediction, ArXivabs/2411.17672(2024)

  31. [31]

    Anjali Kantharuban, Jeremiah Milbauer, Emma Strubell, and Graham Neubig,Stereotype or personalization? user identity biases chatbot recommendations, ArXivabs/2410.05613(2024)

  32. [32]

    Torous, and Marlon Danilewitz,Use of large-language models for therapy: Promise and perils., Annals of internal medicine (2026)

    Robert A Kleinman, John B. Torous, and Marlon Danilewitz,Use of large-language models for therapy: Promise and perils., Annals of internal medicine (2026)

  33. [33]

    McBain, Robert Bozick, Melissa Diliberti, Li Ang Zhang, Fang Zhang, Alyssa Burnett, Aaron Kofner, Benjamin Rader, Joshua Breslau, Bradley D

    Ryan K. McBain, Robert Bozick, Melissa Diliberti, Li Ang Zhang, Fang Zhang, Alyssa Burnett, Aaron Kofner, Benjamin Rader, Joshua Breslau, Bradley D. Stein, Ateev Mehrotra, Lori Uscher Pines, Jonathan Cantor, and Hao Yu,Use of generative ai for mental health advice among us adolescents and young adults, JAMA Network Open8(2025), no. 11, e2542281–e2542281

  34. [34]

    Common Sense Media,Social ai companions, 2024

  35. [35]

    Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber,Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers., Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’25, Association for Computing...

  36. [36]

    Adrian O’Dowd,Chatgpt: More than a million users show signs of mental health distress and mania each week, internal data suggest, BMJ391(2025)

  37. [37]

    Will Orr and Edward B. Kang,Ai as a sport: On the competitive epistemologies of benchmarking, Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’24, Association for Computing Machinery, 2024, p. 1875–1884

  38. [38]

    Ruby Ostrow and Adam Lopez,Llms reproduce stereotypes of sexual and gender minorities, 2025

  39. [39]

    Vedanta S P and Madhav Rao,Psychsynth: Advancing mental health ai through synthetic data generation and curriculum training, 2024 9th International Conference on Computer Science and Engineering (UBMK), 2024, pp. 1–6

  40. [40]

    Guerreiro, Pedro Henrique Martins, António Farinhas, and Ricardo Rei,Mindeval: Benchmarking language models on multi-turn mental health support, 2025

    José Pombal, Maya D’Eon, Nuno M. Guerreiro, Pedro Henrique Martins, António Farinhas, and Ricardo Rei,Mindeval: Benchmarking language models on multi-turn mental health support, 2025

  41. [41]

    Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan,Towards a science of ai agent reliability, 2026

  42. [42]

    Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada,Ai and the everything in the whole wide world benchmark, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (J. Vanschoren and S. Yeung, eds.), vol. 1, 2021

  43. [43]

    Inioluwa Deborah Raji, Roxana Daneshjou, and Emily Alsentzer,It’s time to bench the medical exam benchmark, NEJM AI (2025)

  44. [44]

    1, 1200–1217

    Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Ramona Comanescu, Canfer Akbulut, Tom Stepleton, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, William Isaac, and Laura Weidinger,Gaps in the safety evaluation of generative ai, Proceedings of the AAAI/ACM Conference on AI, Ethi...

  45. [45]

    you’re just ready:’ parents say chatgpt encouraged son to kill himself, 2025

    Ed Lavandera Rob Kuznia, Allison Gordon,‘you’re not rushing. you’re just ready:’ parents say chatgpt encouraged son to kill himself, 2025

  46. [46]

    Selbst, Danah Boyd, Sorelle A

    Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi,Fairness and abstraction in sociotechnical systems, Proceedings of the Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAT* ’19, Association for Computing Machinery, 2019, p. 59–68

  47. [47]

    Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker,The leaderboard illusion, The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

  48. [48]

    Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, and KyungTae Lim,Mentalbench: A benchmark for evaluating psychiatric diagnostic capability of large language models, 2026

  49. [49]

    Thomas and David Uminsky,Reliance on metrics is a fundamental challenge for ai, Patterns3(2022), no

    Rachel L. Thomas and David Uminsky,Reliance on metrics is a fundamental challenge for ai, Patterns3(2022), no. 5, 100476

  50. [50]

    Pranav Narayanan Venkit, Jiayi Li, Yingfan Zhou, Sarah Michele Rajtmajer, and Shomir Wilson, A tale of two identities: An ethical audit of ai-crafted synthetic personas, AAAI Conference on Artificial Intelligence, 2026

  51. [51]

    Chiu, Jiayin Zhi, Shaun M

    Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M Murphy, Nev Jones, Kate V Hardy, Hong Shen, Fei Fang, and Zhiyu Chen,PATIENT- ψ: Using large language models to simulate patients for training mental health professionals, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (M...

  52. [52]

    Murphy,Synthetic patient and interview transcript creator: an essential tool for llms in mental health, Frontiers in Digital HealthV olume 7 - 2025(2025)

    Aleyna Warner, Jeffrey LeDue, Yutong Cao, Joseph Tham, and Timothy H. Murphy,Synthetic patient and interview transcript creator: an essential tool for llms in mental health, Frontiers in Digital HealthV olume 7 - 2025(2025)

  53. [53]

    Bertolacci, Emily Rosenblad, Sama Ghoba, Matthew Cun- ningham, Kevin Shunji Ikuta, Madeline E Moberg, Vincent Mougin, Chieh Han, Eve E

    Nicole Davis Weaver, Gregory J. Bertolacci, Emily Rosenblad, Sama Ghoba, Matthew Cun- ningham, Kevin Shunji Ikuta, Madeline E Moberg, Vincent Mougin, Chieh Han, Eve E. Wool, Yohannes Abate, Habeeb Omoponle Adewuyi, Qorinah Estiningtyas Sakilah Ad- nani, Leticia Akua Adzigbli, Aanuoluwapo Adeyimika Afolabi, Suneth Buddhika Agampodi, Bright Opoku Ahinkorah,...

  54. [54]

    Sociotechnical safety evaluation of generative ai systems,

    Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, and William S. Isaac,Sociotechnical safety evaluation of generative ai systems, ArXivabs/2310.11986(2023)

  55. [55]

    5367–5378

    Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, and Li Shen,Mentalchat16k: A benchmark dataset for conversational mental health assistance, Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2 (New York, NY , USA), KDD ’25, Association for Com...

  56. [56]

    Nadine Yousif,Parents of teenager who took his own life sue openai, 2025

  57. [57]

    10, e2519941123

    Aliah Zewail, Alexandra Figueroa, Jesse Graham, and Mohammad Atari,Moral stereotyping in large language models, Proceedings of the National Academy of Sciences123(2026), no. 10, e2519941123

  58. [58]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica, Judging llm-as-a-judge with mt-bench and chatbot arena, Proceedings of the 37th International Conference on Neural Information Processing Systems (Red Hook, NY , USA), NIPS ’23,...