VERA-MH Concept Paper

Adam M. Chekroud; Emily Ward; Kate H. Bentley; Kelly Johnston; Luca Belli; Matt Hawrilenko; Mill Brown; Will Alexander

arxiv: 2510.15297 · v4 · pith:IRIHYWCDnew · submitted 2025-10-17 · 💻 cs.CY · cs.AI· cs.HC· cs.SI

VERA-MH Concept Paper

Luca Belli , Kate H. Bentley , Will Alexander , Emily Ward , Matt Hawrilenko , Kelly Johnston , Mill Brown , Adam M. Chekroud This is my paper

Pith reviewed 2026-05-18 06:57 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HCcs.SI

keywords AI safety evaluationmental health chatbotssuicide riskautomated assessmentclinician rubricuser-agent simulationjudge-agent scoring

0 comments

The pith

VERA-MH automates safety checks for mental health AI chatbots by simulating patient conversations and scoring responses against a clinician rubric focused on suicide risk.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VERA-MH as a way to test how safely AI chatbots handle mental health discussions, beginning with suicide risk. Practicing clinicians built a rubric from established best practices, and the system relies on an AI user-agent to role-play patients with varying risk levels during conversations with the chatbot under review. A separate judge-agent then applies the rubric to score each exchange, after which the scores are combined into an overall safety rating. This setup is meant to allow faster and more consistent testing than manual review alone. The authors note that the method remains under clinical validation to confirm the simulations and scores track real clinician judgments.

Core claim

VERA-MH automates the safety evaluation of AI chatbots in mental health contexts by using a user-agent to simulate conversations with personas at defined risk levels, passing those dialogues to a judge-agent that scores them against a clinician-developed rubric for suicide risk management, and aggregating the individual scores into a final assessment of the chatbot.

What carries the argument

The combination of a user-agent for persona-driven conversation simulation, a judge-agent for rubric-based scoring, and score aggregation across multiple simulated sessions.

If this is right

Preliminary runs on models such as GPT-5, Claude Opus, and Claude Sonnet can already surface safety gaps using the initial rubric.
Refined scoring outputs can directly inform design changes to reduce unsafe responses in mental health chatbots.
Expanded clinical validation will test whether user-agents produce believable patient behavior and whether judge-agent scores match expert assessment.
Community input on both the technical pipeline and the clinical rubric will shape further iterations of the evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Once validated, the same simulation-plus-rubric approach could be applied to other high-stakes domains where AI systems interact with vulnerable users.
Aggregated scores might eventually serve as one data point in regulatory or procurement decisions for mental health AI tools.
Extending the rubric to additional clinical risks, such as self-harm or crisis escalation, would test the framework's broader applicability.

Load-bearing premise

The AI user-agent and judge-agent can generate realistic patient behaviors and produce rubric scores that align with how practicing clinicians would judge the same conversations.

What would settle it

A head-to-head study in which practicing clinicians independently score the same set of simulated conversations and their judgments differ substantially and consistently from the judge-agent outputs.

Figures

Figures reproduced from arXiv: 2510.15297 by Adam M. Chekroud, Emily Ward, Kate H. Bentley, Kelly Johnston, Luca Belli, Matt Hawrilenko, Mill Brown, Will Alexander.

**Figure 1.** Figure 1: VERA-MH overall design. In mental health, however, this approach proves insufficient. Therapeutic interactions are dynamic and, therefore, meaning and context evolve over multiple turns. These nuances pose a significant challenge to static, pre-scripted evaluations. As a result, evaluating mental health LLMs based on static datasets and single-turn conversations can lead to an incomplete or even misleadin… view at source ↗

**Figure 3.** Figure 3: Evaluation of Claude Opus as provider [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of ChatGPT-5 as provider. To determine credibility of the user-agents, we calculated the average score (across all clinicians) of how realistic (on a 5- point scale ranging from Not at All Realistic to Very Realistic) the simulated users were. Clinicians rated the user-agent as highly realistic overall (average = 4.1). This is promising, but there’s still some room for improvement as we continue… view at source ↗

**Figure 6.** Figure 6: Confusion matrix showing agreement/disagreement between clinician and judge-agent ratings across the four response options. Rows represent the judge-agent’s ratings, and columns represent clinicians’ ratings. Each cell shows the number and percentage of all judge-agent vs. clinician rating pairs that fell into that specific combination. the earlier version of our judge-agent was a more lenient evaluator o… view at source ↗

read the original abstract

We introduce VERA-MH (Validation of Ethical and Responsible AI in Mental Health), an automated evaluation of the safety of AI chatbots used in mental health contexts, with an initial focus on suicide risk. Practicing clinicians and academic experts developed a rubric informed by best practices for suicide risk management for the evaluation. To fully automate the process, we used two ancillary AI agents. A user-agent model simulates users engaging in a mental health-based conversation with the chatbot under evaluation. The user-agent role-plays specific personas with pre-defined risk levels and other features. Simulated conversations are then passed to a judge-agent who scores them based on the rubric. The final evaluation of the chatbot being tested is obtained by aggregating the scoring of each conversation. VERA-MH is actively under development and undergoing rigorous validation by mental health clinicians to ensure user-agents realistically act as patients and that the judge-agent accurately scores the AI chatbot. To date we have conducted preliminary evaluation of GPT-5, Claude Opus and Claude Sonnet using initial versions of the VERA-MH rubric and used the findings for further design development. Next steps will include more robust clinical validation and iteration, as well as refining actionable scoring. We are seeking feedback from the community on both the technical and clinical aspects of our evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces VERA-MH, an automated evaluation framework for assessing the safety of AI chatbots in mental health contexts, with initial focus on suicide risk. It centers on a clinician-developed rubric, a user-agent that simulates patient personas with predefined risk levels to generate conversations, a judge-agent that applies the rubric to score those conversations, and aggregation of scores to produce an overall evaluation of the target chatbot. The work is presented as a concept paper noting that user-agent realism and judge-agent accuracy are still undergoing clinical validation, with only preliminary runs on models such as GPT-5, Claude Opus, and Claude Sonnet used for design iteration.

Significance. If the ongoing validation establishes that simulated conversations are realistic and judge-agent scores align with practicing clinicians, VERA-MH would supply a scalable, reproducible pipeline for safety assessment in a high-stakes domain. The explicit clinician involvement in rubric creation and the paper's transparency about its preliminary status and validation needs are clear strengths that position the framework as a constructive contribution to responsible AI development in healthcare.

major comments (1)

Abstract and framework description: the aggregation step that converts per-conversation rubric scores into a final chatbot evaluation is described only at a high level; without specifying the aggregation rule (e.g., mean, weighted sum, threshold logic), the central claim that VERA-MH yields an actionable safety assessment remains difficult to evaluate or reproduce.

minor comments (2)

The manuscript would benefit from a concise diagram or pseudocode outlining the full pipeline (user-agent generation, conversation flow, judge-agent prompting, aggregation) to improve clarity for readers outside the immediate team.
Add a short related-work paragraph situating VERA-MH against existing AI safety benchmarks or rubric-based evaluations in clinical AI to help readers assess novelty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of VERA-MH as a scalable evaluation framework. We appreciate the recommendation for minor revision and address the single major comment below.

read point-by-point responses

Referee: Abstract and framework description: the aggregation step that converts per-conversation rubric scores into a final chatbot evaluation is described only at a high level; without specifying the aggregation rule (e.g., mean, weighted sum, threshold logic), the central claim that VERA-MH yields an actionable safety assessment remains difficult to evaluate or reproduce.

Authors: We agree that greater specificity on the aggregation step would improve clarity and reproducibility. As noted in the manuscript, VERA-MH remains a concept paper with actionable scoring still under refinement through clinical validation. In the revised version we will expand the framework description to state that the current preliminary aggregation computes the mean rubric score across conversations for each simulated persona and risk level, while explicitly noting that weighted or threshold-based rules are planned for later iterations once validation data are available. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a concept paper that introduces the VERA-MH framework as a methodological proposal: a clinician-developed rubric, user-agent simulation of pre-defined personas, judge-agent rubric scoring, and aggregation of results. No mathematical derivations, equations, fitted parameters, or predictions appear in the text. The authors explicitly state that user-agent realism and judge-agent accuracy remain under clinical validation, with only preliminary runs used for design iteration. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central claim is the description of the automated pipeline itself, which does not reduce to its own inputs by construction and stands as an independent framework proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about the fidelity of AI simulation and scoring rather than new mathematical constructs or fitted parameters.

axioms (2)

domain assumption AI agents can realistically role-play patients with pre-defined suicide risk levels and other features in a way that matches real user behavior.
Invoked in the description of the user-agent model and noted as requiring clinical validation.
domain assumption The judge-agent can accurately apply the clinician-developed rubric to score conversations without systematic bias relative to human experts.
Central to the automated scoring step and stated as undergoing rigorous validation.

pith-pipeline@v0.9.0 · 5783 in / 1447 out tokens · 49050 ms · 2026-05-18T06:57:26.265976+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce VERA-MH … rubric … user-agent … judge-agent … aggregating the scoring of each conversation.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

five dimensions … Best practice / Missed opportunity / Actively damaging / Not relevant

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deborah Raji, and Travis Zack. 2025. Medical Large Language Model Benchmarks Should Prioritize Construct Validity.ArXivabs/2503.10694 (2025). https://api.semanticscholar.org/CorpusID:277043062

work page arXiv 2025
[2]

Education Development Center. [n. d.]. Zero Suicide Toolkit. https://zerosuicide. edc.org/toolkit. Accessed 2025-10-16

work page 2025
[3]

Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gómez, and David Fernández-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evalu- ation.ArXivabs/2502.06559 (2025). https://api.semanticscholar.org/CorpusID: 276249219

work page arXiv 2025
[4]

McBain, Jonathan H

Ryan K. McBain, Jonathan H. Cantor, L. Angela Zhang, Olivia Baker, Fei Zhang, Anna Burnett, Aaron Kofner, Joshua Breslau, Bradley D. Stein, Ateev Mehrotra, and Hao Yu. 2025. Evaluation of alignment between large language models and expert clinicians in suicide risk assessment.Psychiatric Services(2025). doi:10.1176/appi.ps.20250086 Advance online publication

work page doi:10.1176/appi.ps.20250086 2025
[5]

McBain, Jonathan H

Ryan K. McBain, Jonathan H. Cantor, L. Angela Zhang, Olivia Baker, Fei Zhang, Andrew Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, and Hao Yu. 2025. Competency of large language models in evaluating appropri- ate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research27 (2025), e67891. doi:10.2196/67891

work page doi:10.2196/67891 2025
[6]

Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. 2025. Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence.IEEE Transactions on Artificial Intelligence(2025), 1–18. doi:10.1109/tai.2025.3569516

work page doi:10.1109/tai.2025.3569516 2025
[7]

Wojciech Pichowicz, Michal Kotas, and Pawel Piotrowski. 2025. Performance of mental health chatbot agents in detecting and managing suicidal ideation. Scientific Reports15 (2025), 31652. doi:10.1038/s41598-025-17242-4

work page doi:10.1038/s41598-025-17242-4 2025
[8]

Brown, Barbara Stanley, David A

Kelly Posner, Gregory K. Brown, Barbara Stanley, David A. Brent, Katerina V. Yershova, Maria A. Oquendo, Glenn W. Currier, Glenn A. Melvin, Laurence Greenhill, Susan Shen, and J. John Mann. 2011. The Columbia–Suicide Severity Rating Scale: Initial validity and internal consistency findings from three multisite studies with adolescents and adults.American ...

work page doi:10.1176/appi.ajp.2011.10111704 2011
[9]

Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna

Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily L. Den- ton, and A. Hanna. 2021. AI and the Everything in the Whole Wide World Bench- mark.ArXivabs/2111.15366 (2021). https://api.semanticscholar.org/CorpusID: 244729397

work page arXiv 2021
[10]

arXiv preprint arXiv:2507.02990 , year=

Anna M. Schoene and Cansu Canca. 2025. ‘For argument’s sake, show me how to harm myself!’: Jailbreaking LLMs in suicide and self-harm contexts. arXiv preprint. doi:10.48550/arXiv.2507.02990

work page doi:10.48550/arxiv.2507.02990 2025
[11]

Reva Schwartz, Rumman Chowdhury, Akash Kundu, Heather Frase, Marzieh Fadaee, Tom David, Gabriella Waters, Afaf Taik, Morgan Briggs, Patrick Hall, Shomik Jain, Kyra Yee, Spencer Thomas, Sundeep Bhandari, Paul Duncan, An- drew Thompson, Maya Carlyle, Qinghua Lu, Matthew Holmes, and Theodora Skeadas. 2025. Reality Check: A New Evaluation Ecosystem Is Necessa...

work page arXiv 2025
[12]

Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Hilal Ermiş, Marzieh Fadaee, and Sara Hooker

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’souza, Sayash Kapoor, A. Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Hilal Ermiş, Marzieh Fadaee, and Sara Hooker. 2025. The Leaderboard Illusion.ArXiv abs/2504.20879 (2025). https://api.semanticscholar.org/CorpusID:278171127

work page arXiv 2025
[13]

Substance Abuse and Mental Health Services Administration. 2024. SAFE- T Suicide Assessment Five Step Evaluation and Triage (PEP24-01-036). https://library.samhsa.gov/product/safe-t-suicide-assessment-five-step- evaluation-and-triage/pep24-01-036. Accessed 2025-10-16

work page 2024
[14]

The Columbia Lighthouse Project. [n. d.]. About the scale (C-SSRS). https: //cssrs.columbia.edu/the-columbia-scale-c-ssrs/about-the-scale/. Accessed 2025- 10-16

work page 2025

[1] [1]

Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deborah Raji, and Travis Zack. 2025. Medical Large Language Model Benchmarks Should Prioritize Construct Validity.ArXivabs/2503.10694 (2025). https://api.semanticscholar.org/CorpusID:277043062

work page arXiv 2025

[2] [2]

Education Development Center. [n. d.]. Zero Suicide Toolkit. https://zerosuicide. edc.org/toolkit. Accessed 2025-10-16

work page 2025

[3] [3]

Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gómez, and David Fernández-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evalu- ation.ArXivabs/2502.06559 (2025). https://api.semanticscholar.org/CorpusID: 276249219

work page arXiv 2025

[4] [4]

McBain, Jonathan H

Ryan K. McBain, Jonathan H. Cantor, L. Angela Zhang, Olivia Baker, Fei Zhang, Anna Burnett, Aaron Kofner, Joshua Breslau, Bradley D. Stein, Ateev Mehrotra, and Hao Yu. 2025. Evaluation of alignment between large language models and expert clinicians in suicide risk assessment.Psychiatric Services(2025). doi:10.1176/appi.ps.20250086 Advance online publication

work page doi:10.1176/appi.ps.20250086 2025

[5] [5]

McBain, Jonathan H

Ryan K. McBain, Jonathan H. Cantor, L. Angela Zhang, Olivia Baker, Fei Zhang, Andrew Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, and Hao Yu. 2025. Competency of large language models in evaluating appropri- ate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research27 (2025), e67891. doi:10.2196/67891

work page doi:10.2196/67891 2025

[6] [6]

Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. 2025. Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence.IEEE Transactions on Artificial Intelligence(2025), 1–18. doi:10.1109/tai.2025.3569516

work page doi:10.1109/tai.2025.3569516 2025

[7] [7]

Wojciech Pichowicz, Michal Kotas, and Pawel Piotrowski. 2025. Performance of mental health chatbot agents in detecting and managing suicidal ideation. Scientific Reports15 (2025), 31652. doi:10.1038/s41598-025-17242-4

work page doi:10.1038/s41598-025-17242-4 2025

[8] [8]

Brown, Barbara Stanley, David A

Kelly Posner, Gregory K. Brown, Barbara Stanley, David A. Brent, Katerina V. Yershova, Maria A. Oquendo, Glenn W. Currier, Glenn A. Melvin, Laurence Greenhill, Susan Shen, and J. John Mann. 2011. The Columbia–Suicide Severity Rating Scale: Initial validity and internal consistency findings from three multisite studies with adolescents and adults.American ...

work page doi:10.1176/appi.ajp.2011.10111704 2011

[9] [9]

Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna

Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily L. Den- ton, and A. Hanna. 2021. AI and the Everything in the Whole Wide World Bench- mark.ArXivabs/2111.15366 (2021). https://api.semanticscholar.org/CorpusID: 244729397

work page arXiv 2021

[10] [10]

arXiv preprint arXiv:2507.02990 , year=

Anna M. Schoene and Cansu Canca. 2025. ‘For argument’s sake, show me how to harm myself!’: Jailbreaking LLMs in suicide and self-harm contexts. arXiv preprint. doi:10.48550/arXiv.2507.02990

work page doi:10.48550/arxiv.2507.02990 2025

[11] [11]

Reva Schwartz, Rumman Chowdhury, Akash Kundu, Heather Frase, Marzieh Fadaee, Tom David, Gabriella Waters, Afaf Taik, Morgan Briggs, Patrick Hall, Shomik Jain, Kyra Yee, Spencer Thomas, Sundeep Bhandari, Paul Duncan, An- drew Thompson, Maya Carlyle, Qinghua Lu, Matthew Holmes, and Theodora Skeadas. 2025. Reality Check: A New Evaluation Ecosystem Is Necessa...

work page arXiv 2025

[12] [12]

Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Hilal Ermiş, Marzieh Fadaee, and Sara Hooker

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’souza, Sayash Kapoor, A. Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Hilal Ermiş, Marzieh Fadaee, and Sara Hooker. 2025. The Leaderboard Illusion.ArXiv abs/2504.20879 (2025). https://api.semanticscholar.org/CorpusID:278171127

work page arXiv 2025

[13] [13]

Substance Abuse and Mental Health Services Administration. 2024. SAFE- T Suicide Assessment Five Step Evaluation and Triage (PEP24-01-036). https://library.samhsa.gov/product/safe-t-suicide-assessment-five-step- evaluation-and-triage/pep24-01-036. Accessed 2025-10-16

work page 2024

[14] [14]

The Columbia Lighthouse Project. [n. d.]. About the scale (C-SSRS). https: //cssrs.columbia.edu/the-columbia-scale-c-ssrs/about-the-scale/. Accessed 2025- 10-16

work page 2025