VERA-MH: Validation of Ethical and Responsible AI in Mental Health

Adam M. Chekroud; Emily Van Ark; Josh Gieringer; Kate H. Bentley; Luca Belli; Matt Hawrilenko; Millard Brown; Nilu Zhao; Pradip Thachile

arxiv: 2605.13318 · v2 · pith:GXOPKCXNnew · submitted 2026-05-13 · 💻 cs.AI · cs.ET

VERA-MH: Validation of Ethical and Responsible AI in Mental Health

Luca Belli , Kate H. Bentley , Josh Gieringer , Emily Van Ark , Nilu Zhao , Pradip Thachile , Matt Hawrilenko , Millard Brown

show 1 more author

Adam M. Chekroud

This is my paper

Pith reviewed 2026-05-20 21:49 UTC · model grok-4.3

classification 💻 cs.AI cs.ET

keywords mental health AIchatbot safetysuicidal ideationevaluation frameworkLLM judgeclinical validationresponsible AIcrisis response

0 comments

The pith

VERA-MH introduces a clinically-validated evaluation to assess the safety of mental health chatbots around suicidal ideation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop VERA-MH to fill the gap in testing AI chatbots that users might turn to for mental health help. The framework first creates simulated conversations by role-playing users with personas that incorporate clinical insights on risk factors, demographics, and disclosure styles. Next, it employs an LLM as judge guided by a flow-based rubric of yes-no questions to score the chatbot's replies consistently. Finally, it aggregates scores across conversations to rate the overall safety of the model. If this approach works as intended, it gives a concrete way to spot and address unsafe responses before they reach real users.

Core claim

VERA-MH evaluates chatbot safety in mental health support by simulating conversations with clinically developed user personas, judging responses using an LLM-as-a-Judge and a flow-structured clinical rubric, and aggregating results to produce model ratings, with results provided for four leading LLM providers.

What carries the argument

The three-step VERA-MH process of conversation simulation using clinical personas, judging with a flow-based rubric, and result aggregation.

Load-bearing premise

Clinically developed user personas and the flow-based rubric accurately capture real-world crisis disclosure patterns and the ways chatbots fail to respond safely.

What would settle it

A direct comparison of VERA-MH's LLM judge scores with ratings given by human mental health experts on identical conversation transcripts.

Figures

Figures reproduced from arXiv: 2605.13318 by Adam M. Chekroud, Emily Van Ark, Josh Gieringer, Kate H. Bentley, Luca Belli, Matt Hawrilenko, Millard Brown, Nilu Zhao, Pradip Thachile.

**Figure 2.** Figure 2: Results of the experiments focused on Gemini models. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Results of the experiments focused on GPT5.X family of models. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Results of the experiments focused on Grok models. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Results of the experiments focused on Claude Opus models. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Results of the experiments focused on Claude Sonnet models. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of the conversational length of both user- and chatbot model. Users’ responses [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Chatbot usage has increased, including in fields for which they were never developed for--notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers' consistency and highlight models' failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VERA-MH gives a clear three-stage pipeline for testing mental health chatbots on suicidal ideation risks, but the clinical personas and flow rubric lack any shown validation against real crisis data or human judgments.

read the letter

The main thing to know is that this paper lays out VERA-MH, a repeatable framework for checking how chatbots handle users who might disclose suicidal thoughts. It runs simulated conversations with clinically guided personas, judges them via an LLM using a sequential yes/no rubric, and aggregates the scores. They apply the whole thing to four major providers and report the outcomes. That pipeline is the concrete new piece, and the flow rubric is a sensible attempt to make judging more consistent and to surface specific failure points rather than just an overall score. Clinical input on the personas to include risk factors and demographics is also a reasonable step beyond purely synthetic setups. The paper does a straightforward job describing the method and giving comparative results, which at least moves the conversation past ad-hoc testing in a high-stakes area. The structure is easy to follow and could serve as a template for others working on safety evaluations. The soft spots sit in the missing checks. The abstract and methods description give no inter-rater reliability numbers for the rubric, no comparison of the simulated personas to actual crisis transcripts or clinician notes, and no test of whether the LLM judge's decisions line up with human experts or predict real unsafe behavior. Without those, the scores for the four providers are hard to interpret as reliable signals of safety. The results feel preliminary as a result. This is the kind of paper that would interest people building or regulating AI tools for mental health support. Readers who need a practical starting point for structured testing in crisis scenarios could adapt the pipeline, though they would probably have to add their own validation layers. It is worth sending for peer review because the problem is timely, the proposal is specific, and referees can ask for the empirical grounding that is currently absent. I would recommend review rather than desk rejection, with the expectation that the authors strengthen the validation sections.

Referee Report

3 major / 2 minor

Summary. The paper introduces VERA-MH, a three-step framework for evaluating chatbot safety in mental health contexts with a focus on suicidal ideation risks. The steps are conversation simulation via role-playing user personas developed under clinical guidance (incorporating risk factors, demographics, and disclosure patterns), conversation judging using an LLM-as-a-Judge paired with a flow-based rubric of sequential Yes/No questions, and aggregation of results to produce model ratings. The authors apply the framework to four leading LLM providers and present the resulting evaluations.

Significance. If the clinical grounding of the personas and rubric can be substantiated with reliability and validity evidence, VERA-MH would offer a structured, reproducible method for surfacing failure modes in AI systems handling crisis disclosures, which could inform safer deployment practices and regulatory guidance in high-stakes domains.

major comments (3)

[Abstract] Abstract: The manuscript claims VERA-MH is 'clinically-validated' and that personas were 'developed under clinical guidance' to represent real crisis patterns, yet no inter-rater agreement statistics for the rubric, no comparison against real crisis transcripts or clinician annotations, and no external validation that aggregated LLM-as-a-Judge scores predict actual safety failures are reported. This evidence is load-bearing for the central claim that the framework reliably identifies unsafe chatbot behavior.
[Conversation simulation and judging steps] Conversation simulation and judging steps: The flow-based rubric is presented as improving consistency via sequential Yes/No questions, but without reported agreement metrics between the LLM judge and human clinicians or ablation tests showing that the rubric distinguishes safe from unsafe responses better than simpler alternatives, the mapping from simulated conversations to real-world risk remains unverified.
[Results] Results for the four LLM providers: The evaluations are described at a high level with no quantitative metrics (e.g., failure rates per persona category), error analysis, statistical comparisons across models, or sensitivity checks on persona variations, making it impossible to assess whether the framework produces actionable or reproducible safety signals.

minor comments (2)

[Abstract] Abstract: Typo in 'how well chatbots can responds to users' (should be 'respond').
[General] General: The paper would benefit from explicit discussion of how VERA-MH relates to or improves upon prior AI safety benchmarks for conversational agents in healthcare.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed review of our manuscript on VERA-MH. We address each major comment point by point below, indicating where revisions will be made to improve clarity, evidence, and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript claims VERA-MH is 'clinically-validated' and that personas were 'developed under clinical guidance' to represent real crisis patterns, yet no inter-rater agreement statistics for the rubric, no comparison against real crisis transcripts or clinician annotations, and no external validation that aggregated LLM-as-a-Judge scores predict actual safety failures are reported. This evidence is load-bearing for the central claim that the framework reliably identifies unsafe chatbot behavior.

Authors: We acknowledge that the phrasing 'clinically-validated' in the abstract and introduction may overstate the empirical validation provided. The personas and rubric were developed through iterative consultation with mental health clinicians to incorporate risk factors, demographics, and disclosure patterns, but the manuscript does not include quantitative inter-rater agreement statistics or direct comparisons to real crisis transcripts. We will revise the abstract and relevant sections to use more precise language (e.g., 'developed under clinical guidance') and add a limitations subsection explicitly discussing the absence of external validation against real-world data and the ethical barriers to such comparisons. revision: partial
Referee: [Conversation simulation and judging steps] Conversation simulation and judging steps: The flow-based rubric is presented as improving consistency via sequential Yes/No questions, but without reported agreement metrics between the LLM judge and human clinicians or ablation tests showing that the rubric distinguishes safe from unsafe responses better than simpler alternatives, the mapping from simulated conversations to real-world risk remains unverified.

Authors: The flow-based structure was chosen to promote consistency by decomposing judgments into sequential binary decisions aligned with clinical risk assessment practices. We agree that additional evidence would strengthen this. In the revision, we will include any pilot agreement metrics between the LLM judge and clinician annotations where available, along with an ablation comparing the sequential rubric to a holistic single-prompt alternative to demonstrate its advantages in distinguishing response safety. revision: yes
Referee: [Results] Results for the four LLM providers: The evaluations are described at a high level with no quantitative metrics (e.g., failure rates per persona category), error analysis, statistical comparisons across models, or sensitivity checks on persona variations, making it impossible to assess whether the framework produces actionable or reproducible safety signals.

Authors: We recognize that the current results presentation is high-level and would benefit from greater granularity to allow readers to evaluate the framework's outputs. We will expand the results section to report quantitative failure rates broken down by persona categories, include systematic error analysis of common failure modes, add statistical comparisons across the four models, and incorporate sensitivity checks on variations in persona parameters. revision: yes

standing simulated objections not resolved

Direct comparisons against real crisis transcripts or clinician annotations on actual patient data cannot be provided due to ethical, privacy, and regulatory restrictions on accessing and using such sensitive mental health information.

Circularity Check

0 steps flagged

VERA-MH is an independent evaluation framework with no circular derivation

full rationale

The paper presents VERA-MH as a three-step evaluation process (conversation simulation via personas, LLM-as-Judge with flow-based rubric, and aggregation) developed under clinical guidance. No equations, fitted parameters, predictions, or self-citations appear in the abstract or described structure that would reduce any result to its own inputs by construction. The framework is offered as a standalone tool for assessing chatbot responses to suicidal ideation scenarios rather than a derivation whose central claim loops back to unverified assumptions within the same work. This is the expected non-finding for a methods paper that does not claim first-principles derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes clinical guidance produces representative personas and that the rubric flow improves consistency, but these are not quantified.

axioms (1)

domain assumption Clinically-developed personas and rubric accurately capture real crisis scenarios and failure modes
Stated in abstract as basis for simulation and judging steps

pith-pipeline@v0.9.0 · 5803 in / 1197 out tokens · 54956 ms · 2026-05-20T21:49:42.354967+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, Cost/FunctionalEquation.lean, Foundation/AlexanderDuality.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating... personas... clinically-developed rubric... flow... single Yes/No question

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

[1]

Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R

William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R. McKee,The illusion of artificial inclusion, Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (New York, NY , USA), CHI ’24, Association for Computing Machinery, 2024

work page 2024
[2]

Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, In- ioluwa Deborah Raji, and Travis Zack,Position: Medical large language model benchmarks should prioritize construct validity, Forty-second International Conference on Machine Learning Position Paper Track, 2025

work page 2025
[3]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal,Healthbench: Evaluating large language models towards improved human health, 2025. 9

work page 2025
[4]

3873–3896

Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Prathiba Dhanesh, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi,When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM evaluation, Pro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Li...

work page 2026
[5]

Nadeem Badshah,Teenager died after asking chatgpt for ‘most successful’ way to take his life, inquest told, 2026

work page 2026
[6]

Jan Batzner, Leshem Choshen, Avijit Ghosh, Sree Harsha Nelaturu, Anastassia Kornilova, Damian Stachura, Yifan Mai, Asaf Yehudai, Anka Reuel, Irene Solaiman, and Stella Biderman, Every eval ever: Toward a common language for ai eval reporting, February 2026, Blog Post, EvalEval Coalition

work page 2026
[7]

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May...

work page 2026
[8]

Luca Belli, Kate Bentley, Will Alexander, Emily Ward, Matt Hawrilenko, Kelly Johnston, Mill Brown, and Adam Chekroud,Vera-mh concept paper, 2026

work page 2026
[9]

Bentley, Luca Belli, Adam M

Kate H. Bentley, Luca Belli, Adam M. Chekroud, Emily J. Ward, Emily R. Dworkin, Emily Van Ark, Kelly M. Johnston, Will Alexander, Millard Brown, and Matt Hawrilenko,Vera-mh: Reliability and validity of an open-source ai safety evaluation in mental health, 2026

work page 2026
[10]

Torous,Chatgpt and mental healthcare: balancing benefits with risks of harms, BMJ Mental Health26(2023)

Charlotte R Blease and John B. Torous,Chatgpt and mental healthcare: balancing benefits with risks of harms, BMJ Mental Health26(2023)

work page 2023
[11]

Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman,Nuanced metrics for measuring unintended bias with real data for text classification, Companion Proceed- ings of The 2019 World Wide Web Conference (New York, NY , USA), WWW ’19, Association for Computing Machinery, 2019, p. 491–500

work page 2019
[12]

Danah Boyd and Kate Crawford,Critical questions for big data, Information, Communication & Society15(2012), 662 – 679

work page 2012
[13]

now, they are sounding an alarm about ai chatbots, 2025

Rhitu Chatterjee,Their teenage sons died by suicide. now, they are sounding an alarm about ai chatbots, 2025

work page 2025
[14]

Kimberlé Williams Crenshaw,Mapping the margins: intersectionality, identity politics, and violence against women of color, Stanford Law Review43(1991), 1241–1299

work page 1991
[15]

Meehl,Construct validity in psychological tests., Psychologi- cal bulletin52 4(1955), 281–302

Lee Joseph Cronbach and Paul E. Meehl,Construct validity in psychological tests., Psychologi- cal bulletin52 4(1955), 281–302

work page 1955
[16]

Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang,The participatory turn in ai design: Theoretical foundations and the current state of practice, Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (2023)

work page 2023
[17]

Gazi, Bryce Hill, Carla Gorban, Carolyn I

Bridget Dwyer, Matthew Flathers, Akane Sano, Allison Dempsey, Andrea Cipriani, Asim H. Gazi, Bryce Hill, Carla Gorban, Carolyn I. Rodriguez, Charles Stromeyer, Darlene King, Eden Rozenblit, Gillian Strudwick, Jake Linardon, Jiaee Cheong, Joe Firth, Julian Herpertz, Julian Schwarz, Khai The Truong, Margaret Emerson, Martin P. Paulus, Michelle Patriquin, Yi...

work page 2025
[18]

1, 850–864

Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca,Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation, Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8(2025), no. 1, 850–864

work page 2025
[19]

Center for AI Standards and Innovation/NIST,Practices for automated benchmark evaluations of language models, 2026

work page 2026
[20]

The European Center for Not-for Profit Law Stichting (ECNL) and SocietyInside,Framework for meaningful engagement 2.0, 2025

work page 2025
[21]

American Foundation for Suicide Prevention,Suicide statistics, 2024

work page 2024
[22]

Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam,Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, J. Artif. Int. Res.77(2023)

work page 2023
[23]

Charles A. E. Goodhart,Problems of monetary management: The uk experience, 1984

work page 1984
[24]

1838–1849

Gabriel Grill,Constructing capabilities: The politics of testing infrastructures for generative ai, Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’24, Association for Computing Machinery, 2024, p. 1838–1849

work page 2024
[25]

Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie Griffith, Dylan M Asmar, Sanmi Koyejo, Michael S. Bernstein, and Mykel John Kochenderfer,More than marketing? on the information value of ai benchmarks for practitioners, Proceedings of the 30th International Conference on Intelligent User Interfaces (New York, NY , USA), IUI ’25, Associat...

work page 2025
[26]

Matthew Holmes, Thiago Lacerda, and Reva Schwartz,Making ai evaluation deployment relevant through context specification, 2026

work page 2026
[27]

Clifton, and John B

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David A. Clifton, and John B. Torous,A scoping review of large language models for generative tasks in mental health care, NPJ Digital Medicine8(2025)

work page 2025
[28]

Amnesty International,The social atrocity: Meta and the right to remedy for the rohingya, 2022

work page 2022
[29]

Abigail Z. Jacobs and Hanna Wallach,Measurement and fairness, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’21, Association for Computing Machinery, 2021, p. 375–385

work page 2021
[30]

Andrea Kang, Jun Yu Chen, Zoe Lee-Youngzie, and Shuhao Fu,Synthetic data generation with llm for improved depression prediction, ArXivabs/2411.17672(2024)

work page arXiv 2024
[31]

Anjali Kantharuban, Jeremiah Milbauer, Emma Strubell, and Graham Neubig,Stereotype or personalization? user identity biases chatbot recommendations, ArXivabs/2410.05613(2024)

work page arXiv 2024
[32]

Torous, and Marlon Danilewitz,Use of large-language models for therapy: Promise and perils., Annals of internal medicine (2026)

Robert A Kleinman, John B. Torous, and Marlon Danilewitz,Use of large-language models for therapy: Promise and perils., Annals of internal medicine (2026)

work page 2026
[33]

McBain, Robert Bozick, Melissa Diliberti, Li Ang Zhang, Fang Zhang, Alyssa Burnett, Aaron Kofner, Benjamin Rader, Joshua Breslau, Bradley D

Ryan K. McBain, Robert Bozick, Melissa Diliberti, Li Ang Zhang, Fang Zhang, Alyssa Burnett, Aaron Kofner, Benjamin Rader, Joshua Breslau, Bradley D. Stein, Ateev Mehrotra, Lori Uscher Pines, Jonathan Cantor, and Hao Yu,Use of generative ai for mental health advice among us adolescents and young adults, JAMA Network Open8(2025), no. 11, e2542281–e2542281

work page 2025
[34]

Common Sense Media,Social ai companions, 2024

work page 2024
[35]

Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber,Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers., Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’25, Association for Computing...

work page 2025
[36]

Adrian O’Dowd,Chatgpt: More than a million users show signs of mental health distress and mania each week, internal data suggest, BMJ391(2025)

work page 2025
[37]

Will Orr and Edward B. Kang,Ai as a sport: On the competitive epistemologies of benchmarking, Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’24, Association for Computing Machinery, 2024, p. 1875–1884

work page 2024
[38]

Ruby Ostrow and Adam Lopez,Llms reproduce stereotypes of sexual and gender minorities, 2025

work page 2025
[39]

Vedanta S P and Madhav Rao,Psychsynth: Advancing mental health ai through synthetic data generation and curriculum training, 2024 9th International Conference on Computer Science and Engineering (UBMK), 2024, pp. 1–6

work page 2024
[40]

Guerreiro, Pedro Henrique Martins, António Farinhas, and Ricardo Rei,Mindeval: Benchmarking language models on multi-turn mental health support, 2025

José Pombal, Maya D’Eon, Nuno M. Guerreiro, Pedro Henrique Martins, António Farinhas, and Ricardo Rei,Mindeval: Benchmarking language models on multi-turn mental health support, 2025

work page 2025
[41]

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan,Towards a science of ai agent reliability, 2026

work page 2026
[42]

Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada,Ai and the everything in the whole wide world benchmark, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (J. Vanschoren and S. Yeung, eds.), vol. 1, 2021

work page 2021
[43]

Inioluwa Deborah Raji, Roxana Daneshjou, and Emily Alsentzer,It’s time to bench the medical exam benchmark, NEJM AI (2025)

work page 2025
[44]

1, 1200–1217

Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Ramona Comanescu, Canfer Akbulut, Tom Stepleton, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, William Isaac, and Laura Weidinger,Gaps in the safety evaluation of generative ai, Proceedings of the AAAI/ACM Conference on AI, Ethi...

work page 2024
[45]

you’re just ready:’ parents say chatgpt encouraged son to kill himself, 2025

Ed Lavandera Rob Kuznia, Allison Gordon,‘you’re not rushing. you’re just ready:’ parents say chatgpt encouraged son to kill himself, 2025

work page 2025
[46]

Selbst, Danah Boyd, Sorelle A

Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi,Fairness and abstraction in sociotechnical systems, Proceedings of the Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAT* ’19, Association for Computing Machinery, 2019, p. 59–68

work page 2019
[47]

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker,The leaderboard illusion, The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

work page 2026
[48]

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, and KyungTae Lim,Mentalbench: A benchmark for evaluating psychiatric diagnostic capability of large language models, 2026

work page 2026
[49]

Thomas and David Uminsky,Reliance on metrics is a fundamental challenge for ai, Patterns3(2022), no

Rachel L. Thomas and David Uminsky,Reliance on metrics is a fundamental challenge for ai, Patterns3(2022), no. 5, 100476

work page 2022
[50]

Pranav Narayanan Venkit, Jiayi Li, Yingfan Zhou, Sarah Michele Rajtmajer, and Shomir Wilson, A tale of two identities: An ethical audit of ai-crafted synthetic personas, AAAI Conference on Artificial Intelligence, 2026

work page 2026
[51]

Chiu, Jiayin Zhi, Shaun M

Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M Murphy, Nev Jones, Kate V Hardy, Hong Shen, Fei Fang, and Zhiyu Chen,PATIENT- ψ: Using large language models to simulate patients for training mental health professionals, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (M...

work page 2024
[52]

Murphy,Synthetic patient and interview transcript creator: an essential tool for llms in mental health, Frontiers in Digital HealthV olume 7 - 2025(2025)

Aleyna Warner, Jeffrey LeDue, Yutong Cao, Joseph Tham, and Timothy H. Murphy,Synthetic patient and interview transcript creator: an essential tool for llms in mental health, Frontiers in Digital HealthV olume 7 - 2025(2025)

work page 2025
[53]

Bertolacci, Emily Rosenblad, Sama Ghoba, Matthew Cun- ningham, Kevin Shunji Ikuta, Madeline E Moberg, Vincent Mougin, Chieh Han, Eve E

Nicole Davis Weaver, Gregory J. Bertolacci, Emily Rosenblad, Sama Ghoba, Matthew Cun- ningham, Kevin Shunji Ikuta, Madeline E Moberg, Vincent Mougin, Chieh Han, Eve E. Wool, Yohannes Abate, Habeeb Omoponle Adewuyi, Qorinah Estiningtyas Sakilah Ad- nani, Leticia Akua Adzigbli, Aanuoluwapo Adeyimika Afolabi, Suneth Buddhika Agampodi, Bright Opoku Ahinkorah,...

work page 1990
[54]

Sociotechnical safety evaluation of generative ai systems,

Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, and William S. Isaac,Sociotechnical safety evaluation of generative ai systems, ArXivabs/2310.11986(2023)

work page arXiv 2023
[55]

5367–5378

Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, and Li Shen,Mentalchat16k: A benchmark dataset for conversational mental health assistance, Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2 (New York, NY , USA), KDD ’25, Association for Com...

work page 2025
[56]

Nadine Yousif,Parents of teenager who took his own life sue openai, 2025

work page 2025
[57]

10, e2519941123

Aliah Zewail, Alexandra Figueroa, Jesse Graham, and Mohammad Atari,Moral stereotyping in large language models, Proceedings of the National Academy of Sciences123(2026), no. 10, e2519941123

work page 2026
[58]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica, Judging llm-as-a-judge with mt-bench and chatbot arena, Proceedings of the 37th International Conference on Neural Information Processing Systems (Red Hook, NY , USA), NIPS ’23,...

work page 2023

[1] [1]

Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R

William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R. McKee,The illusion of artificial inclusion, Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (New York, NY , USA), CHI ’24, Association for Computing Machinery, 2024

work page 2024

[2] [2]

Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, In- ioluwa Deborah Raji, and Travis Zack,Position: Medical large language model benchmarks should prioritize construct validity, Forty-second International Conference on Machine Learning Position Paper Track, 2025

work page 2025

[3] [3]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal,Healthbench: Evaluating large language models towards improved human health, 2025. 9

work page 2025

[4] [4]

3873–3896

Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Prathiba Dhanesh, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi,When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM evaluation, Pro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Li...

work page 2026

[5] [5]

Nadeem Badshah,Teenager died after asking chatgpt for ‘most successful’ way to take his life, inquest told, 2026

work page 2026

[6] [6]

Jan Batzner, Leshem Choshen, Avijit Ghosh, Sree Harsha Nelaturu, Anastassia Kornilova, Damian Stachura, Yifan Mai, Asaf Yehudai, Anka Reuel, Irene Solaiman, and Stella Biderman, Every eval ever: Toward a common language for ai eval reporting, February 2026, Blog Post, EvalEval Coalition

work page 2026

[7] [7]

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May...

work page 2026

[8] [8]

Luca Belli, Kate Bentley, Will Alexander, Emily Ward, Matt Hawrilenko, Kelly Johnston, Mill Brown, and Adam Chekroud,Vera-mh concept paper, 2026

work page 2026

[9] [9]

Bentley, Luca Belli, Adam M

Kate H. Bentley, Luca Belli, Adam M. Chekroud, Emily J. Ward, Emily R. Dworkin, Emily Van Ark, Kelly M. Johnston, Will Alexander, Millard Brown, and Matt Hawrilenko,Vera-mh: Reliability and validity of an open-source ai safety evaluation in mental health, 2026

work page 2026

[10] [10]

Torous,Chatgpt and mental healthcare: balancing benefits with risks of harms, BMJ Mental Health26(2023)

Charlotte R Blease and John B. Torous,Chatgpt and mental healthcare: balancing benefits with risks of harms, BMJ Mental Health26(2023)

work page 2023

[11] [11]

Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman,Nuanced metrics for measuring unintended bias with real data for text classification, Companion Proceed- ings of The 2019 World Wide Web Conference (New York, NY , USA), WWW ’19, Association for Computing Machinery, 2019, p. 491–500

work page 2019

[12] [12]

Danah Boyd and Kate Crawford,Critical questions for big data, Information, Communication & Society15(2012), 662 – 679

work page 2012

[13] [13]

now, they are sounding an alarm about ai chatbots, 2025

Rhitu Chatterjee,Their teenage sons died by suicide. now, they are sounding an alarm about ai chatbots, 2025

work page 2025

[14] [14]

Kimberlé Williams Crenshaw,Mapping the margins: intersectionality, identity politics, and violence against women of color, Stanford Law Review43(1991), 1241–1299

work page 1991

[15] [15]

Meehl,Construct validity in psychological tests., Psychologi- cal bulletin52 4(1955), 281–302

Lee Joseph Cronbach and Paul E. Meehl,Construct validity in psychological tests., Psychologi- cal bulletin52 4(1955), 281–302

work page 1955

[16] [16]

Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang,The participatory turn in ai design: Theoretical foundations and the current state of practice, Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (2023)

work page 2023

[17] [17]

Gazi, Bryce Hill, Carla Gorban, Carolyn I

Bridget Dwyer, Matthew Flathers, Akane Sano, Allison Dempsey, Andrea Cipriani, Asim H. Gazi, Bryce Hill, Carla Gorban, Carolyn I. Rodriguez, Charles Stromeyer, Darlene King, Eden Rozenblit, Gillian Strudwick, Jake Linardon, Jiaee Cheong, Joe Firth, Julian Herpertz, Julian Schwarz, Khai The Truong, Margaret Emerson, Martin P. Paulus, Michelle Patriquin, Yi...

work page 2025

[18] [18]

1, 850–864

Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca,Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation, Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8(2025), no. 1, 850–864

work page 2025

[19] [19]

Center for AI Standards and Innovation/NIST,Practices for automated benchmark evaluations of language models, 2026

work page 2026

[20] [20]

The European Center for Not-for Profit Law Stichting (ECNL) and SocietyInside,Framework for meaningful engagement 2.0, 2025

work page 2025

[21] [21]

American Foundation for Suicide Prevention,Suicide statistics, 2024

work page 2024

[22] [22]

Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam,Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, J. Artif. Int. Res.77(2023)

work page 2023

[23] [23]

Charles A. E. Goodhart,Problems of monetary management: The uk experience, 1984

work page 1984

[24] [24]

1838–1849

Gabriel Grill,Constructing capabilities: The politics of testing infrastructures for generative ai, Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’24, Association for Computing Machinery, 2024, p. 1838–1849

work page 2024

[25] [25]

Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie Griffith, Dylan M Asmar, Sanmi Koyejo, Michael S. Bernstein, and Mykel John Kochenderfer,More than marketing? on the information value of ai benchmarks for practitioners, Proceedings of the 30th International Conference on Intelligent User Interfaces (New York, NY , USA), IUI ’25, Associat...

work page 2025

[26] [26]

Matthew Holmes, Thiago Lacerda, and Reva Schwartz,Making ai evaluation deployment relevant through context specification, 2026

work page 2026

[27] [27]

Clifton, and John B

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David A. Clifton, and John B. Torous,A scoping review of large language models for generative tasks in mental health care, NPJ Digital Medicine8(2025)

work page 2025

[28] [28]

Amnesty International,The social atrocity: Meta and the right to remedy for the rohingya, 2022

work page 2022

[29] [29]

Abigail Z. Jacobs and Hanna Wallach,Measurement and fairness, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’21, Association for Computing Machinery, 2021, p. 375–385

work page 2021

[30] [30]

Andrea Kang, Jun Yu Chen, Zoe Lee-Youngzie, and Shuhao Fu,Synthetic data generation with llm for improved depression prediction, ArXivabs/2411.17672(2024)

work page arXiv 2024

[31] [31]

Anjali Kantharuban, Jeremiah Milbauer, Emma Strubell, and Graham Neubig,Stereotype or personalization? user identity biases chatbot recommendations, ArXivabs/2410.05613(2024)

work page arXiv 2024

[32] [32]

Torous, and Marlon Danilewitz,Use of large-language models for therapy: Promise and perils., Annals of internal medicine (2026)

Robert A Kleinman, John B. Torous, and Marlon Danilewitz,Use of large-language models for therapy: Promise and perils., Annals of internal medicine (2026)

work page 2026

[33] [33]

McBain, Robert Bozick, Melissa Diliberti, Li Ang Zhang, Fang Zhang, Alyssa Burnett, Aaron Kofner, Benjamin Rader, Joshua Breslau, Bradley D

Ryan K. McBain, Robert Bozick, Melissa Diliberti, Li Ang Zhang, Fang Zhang, Alyssa Burnett, Aaron Kofner, Benjamin Rader, Joshua Breslau, Bradley D. Stein, Ateev Mehrotra, Lori Uscher Pines, Jonathan Cantor, and Hao Yu,Use of generative ai for mental health advice among us adolescents and young adults, JAMA Network Open8(2025), no. 11, e2542281–e2542281

work page 2025

[34] [34]

Common Sense Media,Social ai companions, 2024

work page 2024

[35] [35]

Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber,Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers., Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’25, Association for Computing...

work page 2025

[36] [36]

Adrian O’Dowd,Chatgpt: More than a million users show signs of mental health distress and mania each week, internal data suggest, BMJ391(2025)

work page 2025

[37] [37]

Will Orr and Edward B. Kang,Ai as a sport: On the competitive epistemologies of benchmarking, Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAccT ’24, Association for Computing Machinery, 2024, p. 1875–1884

work page 2024

[38] [38]

Ruby Ostrow and Adam Lopez,Llms reproduce stereotypes of sexual and gender minorities, 2025

work page 2025

[39] [39]

Vedanta S P and Madhav Rao,Psychsynth: Advancing mental health ai through synthetic data generation and curriculum training, 2024 9th International Conference on Computer Science and Engineering (UBMK), 2024, pp. 1–6

work page 2024

[40] [40]

Guerreiro, Pedro Henrique Martins, António Farinhas, and Ricardo Rei,Mindeval: Benchmarking language models on multi-turn mental health support, 2025

José Pombal, Maya D’Eon, Nuno M. Guerreiro, Pedro Henrique Martins, António Farinhas, and Ricardo Rei,Mindeval: Benchmarking language models on multi-turn mental health support, 2025

work page 2025

[41] [41]

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan,Towards a science of ai agent reliability, 2026

work page 2026

[42] [42]

Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada,Ai and the everything in the whole wide world benchmark, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (J. Vanschoren and S. Yeung, eds.), vol. 1, 2021

work page 2021

[43] [43]

Inioluwa Deborah Raji, Roxana Daneshjou, and Emily Alsentzer,It’s time to bench the medical exam benchmark, NEJM AI (2025)

work page 2025

[44] [44]

1, 1200–1217

Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Ramona Comanescu, Canfer Akbulut, Tom Stepleton, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, William Isaac, and Laura Weidinger,Gaps in the safety evaluation of generative ai, Proceedings of the AAAI/ACM Conference on AI, Ethi...

work page 2024

[45] [45]

you’re just ready:’ parents say chatgpt encouraged son to kill himself, 2025

Ed Lavandera Rob Kuznia, Allison Gordon,‘you’re not rushing. you’re just ready:’ parents say chatgpt encouraged son to kill himself, 2025

work page 2025

[46] [46]

Selbst, Danah Boyd, Sorelle A

Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi,Fairness and abstraction in sociotechnical systems, Proceedings of the Conference on Fairness, Accountability, and Transparency (New York, NY , USA), FAT* ’19, Association for Computing Machinery, 2019, p. 59–68

work page 2019

[47] [47]

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker,The leaderboard illusion, The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026

work page 2026

[48] [48]

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, and KyungTae Lim,Mentalbench: A benchmark for evaluating psychiatric diagnostic capability of large language models, 2026

work page 2026

[49] [49]

Thomas and David Uminsky,Reliance on metrics is a fundamental challenge for ai, Patterns3(2022), no

Rachel L. Thomas and David Uminsky,Reliance on metrics is a fundamental challenge for ai, Patterns3(2022), no. 5, 100476

work page 2022

[50] [50]

Pranav Narayanan Venkit, Jiayi Li, Yingfan Zhou, Sarah Michele Rajtmajer, and Shomir Wilson, A tale of two identities: An ethical audit of ai-crafted synthetic personas, AAAI Conference on Artificial Intelligence, 2026

work page 2026

[51] [51]

Chiu, Jiayin Zhi, Shaun M

Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M Murphy, Nev Jones, Kate V Hardy, Hong Shen, Fei Fang, and Zhiyu Chen,PATIENT- ψ: Using large language models to simulate patients for training mental health professionals, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (M...

work page 2024

[52] [52]

Murphy,Synthetic patient and interview transcript creator: an essential tool for llms in mental health, Frontiers in Digital HealthV olume 7 - 2025(2025)

Aleyna Warner, Jeffrey LeDue, Yutong Cao, Joseph Tham, and Timothy H. Murphy,Synthetic patient and interview transcript creator: an essential tool for llms in mental health, Frontiers in Digital HealthV olume 7 - 2025(2025)

work page 2025

[53] [53]

Bertolacci, Emily Rosenblad, Sama Ghoba, Matthew Cun- ningham, Kevin Shunji Ikuta, Madeline E Moberg, Vincent Mougin, Chieh Han, Eve E

Nicole Davis Weaver, Gregory J. Bertolacci, Emily Rosenblad, Sama Ghoba, Matthew Cun- ningham, Kevin Shunji Ikuta, Madeline E Moberg, Vincent Mougin, Chieh Han, Eve E. Wool, Yohannes Abate, Habeeb Omoponle Adewuyi, Qorinah Estiningtyas Sakilah Ad- nani, Leticia Akua Adzigbli, Aanuoluwapo Adeyimika Afolabi, Suneth Buddhika Agampodi, Bright Opoku Ahinkorah,...

work page 1990

[54] [54]

Sociotechnical safety evaluation of generative ai systems,

Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, and William S. Isaac,Sociotechnical safety evaluation of generative ai systems, ArXivabs/2310.11986(2023)

work page arXiv 2023

[55] [55]

5367–5378

Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, and Li Shen,Mentalchat16k: A benchmark dataset for conversational mental health assistance, Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2 (New York, NY , USA), KDD ’25, Association for Com...

work page 2025

[56] [56]

Nadine Yousif,Parents of teenager who took his own life sue openai, 2025

work page 2025

[57] [57]

10, e2519941123

Aliah Zewail, Alexandra Figueroa, Jesse Graham, and Mohammad Atari,Moral stereotyping in large language models, Proceedings of the National Academy of Sciences123(2026), no. 10, e2519941123

work page 2026

[58] [58]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica, Judging llm-as-a-judge with mt-bench and chatbot arena, Proceedings of the 37th International Conference on Neural Information Processing Systems (Red Hook, NY , USA), NIPS ’23,...

work page 2023