VERA-MH Concept Paper
Pith reviewed 2026-05-18 06:57 UTC · model grok-4.3
The pith
VERA-MH automates safety checks for mental health AI chatbots by simulating patient conversations and scoring responses against a clinician rubric focused on suicide risk.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VERA-MH automates the safety evaluation of AI chatbots in mental health contexts by using a user-agent to simulate conversations with personas at defined risk levels, passing those dialogues to a judge-agent that scores them against a clinician-developed rubric for suicide risk management, and aggregating the individual scores into a final assessment of the chatbot.
What carries the argument
The combination of a user-agent for persona-driven conversation simulation, a judge-agent for rubric-based scoring, and score aggregation across multiple simulated sessions.
If this is right
- Preliminary runs on models such as GPT-5, Claude Opus, and Claude Sonnet can already surface safety gaps using the initial rubric.
- Refined scoring outputs can directly inform design changes to reduce unsafe responses in mental health chatbots.
- Expanded clinical validation will test whether user-agents produce believable patient behavior and whether judge-agent scores match expert assessment.
- Community input on both the technical pipeline and the clinical rubric will shape further iterations of the evaluation.
Where Pith is reading between the lines
- Once validated, the same simulation-plus-rubric approach could be applied to other high-stakes domains where AI systems interact with vulnerable users.
- Aggregated scores might eventually serve as one data point in regulatory or procurement decisions for mental health AI tools.
- Extending the rubric to additional clinical risks, such as self-harm or crisis escalation, would test the framework's broader applicability.
Load-bearing premise
The AI user-agent and judge-agent can generate realistic patient behaviors and produce rubric scores that align with how practicing clinicians would judge the same conversations.
What would settle it
A head-to-head study in which practicing clinicians independently score the same set of simulated conversations and their judgments differ substantially and consistently from the judge-agent outputs.
Figures
read the original abstract
We introduce VERA-MH (Validation of Ethical and Responsible AI in Mental Health), an automated evaluation of the safety of AI chatbots used in mental health contexts, with an initial focus on suicide risk. Practicing clinicians and academic experts developed a rubric informed by best practices for suicide risk management for the evaluation. To fully automate the process, we used two ancillary AI agents. A user-agent model simulates users engaging in a mental health-based conversation with the chatbot under evaluation. The user-agent role-plays specific personas with pre-defined risk levels and other features. Simulated conversations are then passed to a judge-agent who scores them based on the rubric. The final evaluation of the chatbot being tested is obtained by aggregating the scoring of each conversation. VERA-MH is actively under development and undergoing rigorous validation by mental health clinicians to ensure user-agents realistically act as patients and that the judge-agent accurately scores the AI chatbot. To date we have conducted preliminary evaluation of GPT-5, Claude Opus and Claude Sonnet using initial versions of the VERA-MH rubric and used the findings for further design development. Next steps will include more robust clinical validation and iteration, as well as refining actionable scoring. We are seeking feedback from the community on both the technical and clinical aspects of our evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VERA-MH, an automated evaluation framework for assessing the safety of AI chatbots in mental health contexts, with initial focus on suicide risk. It centers on a clinician-developed rubric, a user-agent that simulates patient personas with predefined risk levels to generate conversations, a judge-agent that applies the rubric to score those conversations, and aggregation of scores to produce an overall evaluation of the target chatbot. The work is presented as a concept paper noting that user-agent realism and judge-agent accuracy are still undergoing clinical validation, with only preliminary runs on models such as GPT-5, Claude Opus, and Claude Sonnet used for design iteration.
Significance. If the ongoing validation establishes that simulated conversations are realistic and judge-agent scores align with practicing clinicians, VERA-MH would supply a scalable, reproducible pipeline for safety assessment in a high-stakes domain. The explicit clinician involvement in rubric creation and the paper's transparency about its preliminary status and validation needs are clear strengths that position the framework as a constructive contribution to responsible AI development in healthcare.
major comments (1)
- Abstract and framework description: the aggregation step that converts per-conversation rubric scores into a final chatbot evaluation is described only at a high level; without specifying the aggregation rule (e.g., mean, weighted sum, threshold logic), the central claim that VERA-MH yields an actionable safety assessment remains difficult to evaluate or reproduce.
minor comments (2)
- The manuscript would benefit from a concise diagram or pseudocode outlining the full pipeline (user-agent generation, conversation flow, judge-agent prompting, aggregation) to improve clarity for readers outside the immediate team.
- Add a short related-work paragraph situating VERA-MH against existing AI safety benchmarks or rubric-based evaluations in clinical AI to help readers assess novelty.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential value of VERA-MH as a scalable evaluation framework. We appreciate the recommendation for minor revision and address the single major comment below.
read point-by-point responses
-
Referee: Abstract and framework description: the aggregation step that converts per-conversation rubric scores into a final chatbot evaluation is described only at a high level; without specifying the aggregation rule (e.g., mean, weighted sum, threshold logic), the central claim that VERA-MH yields an actionable safety assessment remains difficult to evaluate or reproduce.
Authors: We agree that greater specificity on the aggregation step would improve clarity and reproducibility. As noted in the manuscript, VERA-MH remains a concept paper with actionable scoring still under refinement through clinical validation. In the revised version we will expand the framework description to state that the current preliminary aggregation computes the mean rubric score across conversations for each simulated persona and risk level, while explicitly noting that weighted or threshold-based rules are planned for later iterations once validation data are available. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is a concept paper that introduces the VERA-MH framework as a methodological proposal: a clinician-developed rubric, user-agent simulation of pre-defined personas, judge-agent rubric scoring, and aggregation of results. No mathematical derivations, equations, fitted parameters, or predictions appear in the text. The authors explicitly state that user-agent realism and judge-agent accuracy remain under clinical validation, with only preliminary runs used for design iteration. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central claim is the description of the automated pipeline itself, which does not reduce to its own inputs by construction and stands as an independent framework proposal.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption AI agents can realistically role-play patients with pre-defined suicide risk levels and other features in a way that matches real user behavior.
- domain assumption The judge-agent can accurately apply the clinician-developed rubric to score conversations without systematic bias relative to human experts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce VERA-MH … rubric … user-agent … judge-agent … aggregating the scoring of each conversation.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
five dimensions … Best practice / Missed opportunity / Actively damaging / Not relevant
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deborah Raji, and Travis Zack. 2025. Medical Large Language Model Benchmarks Should Prioritize Construct Validity.ArXivabs/2503.10694 (2025). https://api.semanticscholar.org/CorpusID:277043062
-
[2]
Education Development Center. [n. d.]. Zero Suicide Toolkit. https://zerosuicide. edc.org/toolkit. Accessed 2025-10-16
work page 2025
-
[3]
Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gómez, and David Fernández-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evalu- ation.ArXivabs/2502.06559 (2025). https://api.semanticscholar.org/CorpusID: 276249219
-
[4]
Ryan K. McBain, Jonathan H. Cantor, L. Angela Zhang, Olivia Baker, Fei Zhang, Anna Burnett, Aaron Kofner, Joshua Breslau, Bradley D. Stein, Ateev Mehrotra, and Hao Yu. 2025. Evaluation of alignment between large language models and expert clinicians in suicide risk assessment.Psychiatric Services(2025). doi:10.1176/appi.ps.20250086 Advance online publication
-
[5]
Ryan K. McBain, Jonathan H. Cantor, L. Angela Zhang, Olivia Baker, Fei Zhang, Andrew Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, and Hao Yu. 2025. Competency of large language models in evaluating appropri- ate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research27 (2025), e67891. doi:10.2196/67891
-
[6]
Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. 2025. Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence.IEEE Transactions on Artificial Intelligence(2025), 1–18. doi:10.1109/tai.2025.3569516
-
[7]
Wojciech Pichowicz, Michal Kotas, and Pawel Piotrowski. 2025. Performance of mental health chatbot agents in detecting and managing suicidal ideation. Scientific Reports15 (2025), 31652. doi:10.1038/s41598-025-17242-4
-
[8]
Brown, Barbara Stanley, David A
Kelly Posner, Gregory K. Brown, Barbara Stanley, David A. Brent, Katerina V. Yershova, Maria A. Oquendo, Glenn W. Currier, Glenn A. Melvin, Laurence Greenhill, Susan Shen, and J. John Mann. 2011. The Columbia–Suicide Severity Rating Scale: Initial validity and internal consistency findings from three multisite studies with adolescents and adults.American ...
-
[9]
Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna
Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily L. Den- ton, and A. Hanna. 2021. AI and the Everything in the Whole Wide World Bench- mark.ArXivabs/2111.15366 (2021). https://api.semanticscholar.org/CorpusID: 244729397
-
[10]
arXiv preprint arXiv:2507.02990 , year=
Anna M. Schoene and Cansu Canca. 2025. ‘For argument’s sake, show me how to harm myself!’: Jailbreaking LLMs in suicide and self-harm contexts. arXiv preprint. doi:10.48550/arXiv.2507.02990
-
[11]
Reva Schwartz, Rumman Chowdhury, Akash Kundu, Heather Frase, Marzieh Fadaee, Tom David, Gabriella Waters, Afaf Taik, Morgan Briggs, Patrick Hall, Shomik Jain, Kyra Yee, Spencer Thomas, Sundeep Bhandari, Paul Duncan, An- drew Thompson, Maya Carlyle, Qinghua Lu, Matthew Holmes, and Theodora Skeadas. 2025. Reality Check: A New Evaluation Ecosystem Is Necessa...
-
[12]
Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’souza, Sayash Kapoor, A. Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Hilal Ermiş, Marzieh Fadaee, and Sara Hooker. 2025. The Leaderboard Illusion.ArXiv abs/2504.20879 (2025). https://api.semanticscholar.org/CorpusID:278171127
-
[13]
Substance Abuse and Mental Health Services Administration. 2024. SAFE- T Suicide Assessment Five Step Evaluation and Triage (PEP24-01-036). https://library.samhsa.gov/product/safe-t-suicide-assessment-five-step- evaluation-and-triage/pep24-01-036. Accessed 2025-10-16
work page 2024
-
[14]
The Columbia Lighthouse Project. [n. d.]. About the scale (C-SSRS). https: //cssrs.columbia.edu/the-columbia-scale-c-ssrs/about-the-scale/. Accessed 2025- 10-16
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.