arxiv: 2604.17730 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI· cs.HC

Recognition: unknown

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

Suhyun Lee , Palakorn Achananuparp , Neemesh Yadav , Ee-Peng Lim , Yang Deng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords mental health safetyLLM evaluationrole-aware taxonomymulti-turn interactionsadversarial testingharm detectionAI counseling

0 comments

The pith

LLMs exhibit substantial role-dependent and cumulative safety failures in mental health interactions that static benchmarks miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are increasingly used for mental health counseling, but safety checks have relied on isolated responses and broad categories that overlook how issues arise and build across conversations. This paper defines a taxonomy of harms based on the specific roles an AI counselor plays in an interaction, such as perpetrator or enabler, tied to clinical categories. It pairs the taxonomy with a closed-loop evaluation system that runs adversarial multi-turn simulations to trace harm trajectories. When applied to current models, the method uncovers failures that vary by role and accumulate over turns, patterns that single-response tests consistently overlook. This matters because better detection of these interactive risks can guide safer design before models are deployed in real counseling.

Core claim

Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.

What carries the argument

R-MHSafe role-aware taxonomy, which classifies harm by counselor roles (perpetrator, instigator, facilitator, enabler) plus clinical categories, together with MHSafeEval, a closed-loop agent-based framework that discovers harms through simulated multi-turn trajectories.

If this is right

Safety assessment for mental health LLMs must move beyond isolated responses to track how harms emerge and add up over multiple turns.
Failures appear differently depending on which interactional role the model adopts, requiring role-specific safeguards.
Improved coverage of failure modes enables more precise identification of where models need additional alignment or filtering.
Diagnostic granularity from trajectory-level analysis supports targeted fixes rather than blanket safety rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment of LLMs in therapy settings may require ongoing conversation monitoring rather than one-time static approval.
The role-based lens could transfer to safety evaluation in other high-stakes conversational domains such as legal or financial advice.
Training objectives for counseling models may need explicit penalties for adopting harmful interactional roles.

Load-bearing premise

The R-MHSafe taxonomy accurately and comprehensively captures clinically significant harm in AI counseling interactions.

What would settle it

If evaluations of the same LLMs using only existing static single-response benchmarks detect the same set of role-dependent cumulative failures at comparable rates, the claim that static methods systematically miss them would not hold.

Figures

Figures reproduced from arXiv: 2604.17730 by Ee-Peng Lim, Neemesh Yadav, Palakorn Achananuparp, Suhyun Lee, Yang Deng.

**Figure 2.** Figure 2: A structured taxonomy of harmful behaviors in mental health counseling, defined by the combination of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A qualitative example showing how an initial client utterance evolves through iterative mutation and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Ablation study demonstrates the contribu [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Attack distribution by roles and harm category [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of successful attacks across two LLMs by adversarial role and harm category. The inner [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a role-aware taxonomy and closed-loop multi-turn framework to catch cumulative harms in LLM counseling that static tests miss, but the abstract gives no validation details so the 'substantial failures' claim stays unproven.

read the letter

The core point is that existing safety checks for LLMs in mental health look at isolated answers and miss how problems build across turns or depend on the AI's role in the exchange. This paper tries to address that with R-MHSafe, which tags the model as perpetrator, instigator, facilitator, or enabler plus standard harm types, and MHSafeEval, an agent-driven setup that runs adversarial conversations to surface trajectory-level issues. That shift from static to interactive evaluation is the actual novelty, and it directly targets a gap in current benchmarks for counseling scenarios. The approach is practical and could help surface interaction patterns that single-response tests overlook. Credit for framing the problem clearly and proposing a closed-loop method instead of another fixed dataset. The main weakness is the missing anchor for the taxonomy itself. Nothing in the abstract shows clinician input, inter-rater checks, or mapping to real session data or guidelines, so the role labels and the resulting counts of failures could simply reflect the authors' definitions rather than clinically meaningful risk. The large-scale results are asserted but not described—no numbers, no examples, no error analysis—which leaves the strength of the conclusions impossible to judge. This is the sort of work that matters for teams building or auditing mental-health LLMs, and a reader focused on applied safety evaluation would find the framework idea useful even before the details are filled in. It deserves a serious referee because the direction is sound and the gap is real, though any review will need to press hard on validation evidence and reproducibility of the taxonomy. Send it for review with those requests.

Referee Report

1 major / 1 minor

Summary. The paper introduces R-MHSafe, a role-aware taxonomy for mental health safety in LLMs that defines harms based on the AI's interactional role (perpetrator, instigator, facilitator, or enabler) combined with clinically grounded categories. It presents MHSafeEval as a closed-loop agent-based framework for trajectory-level safety evaluation via adversarial multi-turn interactions. Large-scale evaluations across state-of-the-art LLMs are used to demonstrate substantial role-dependent and cumulative safety failures missed by existing static benchmarks, claiming improved failure-mode coverage and diagnostic granularity.

Significance. Should the taxonomy prove to accurately capture clinically significant harms and the evaluation framework be reliable, this contribution would be significant for the field of AI safety in mental health applications. It moves beyond the limitations of static, single-turn benchmarks to address the dynamic, cumulative nature of potential harms in counseling interactions. The large-scale empirical evaluation provides concrete evidence of the framework's ability to uncover previously undetected issues, which could inform safer LLM design and regulatory considerations. Strengths include the novel role-aware approach and the focus on interaction-level analysis.

major comments (1)

[R-MHSafe taxonomy] The R-MHSafe taxonomy is introduced as combining roles with clinically grounded harm categories, but the manuscript does not report any independent clinical validation, expert review process, inter-rater reliability, or mapping to standards like APA guidelines. This is a load-bearing issue for the central claim, as the detection of role-dependent harms and the assertion of improved coverage over static benchmarks depend on the taxonomy correctly identifying real mental health risks rather than author-defined ones.

minor comments (1)

The paper could benefit from more explicit discussion of potential limitations, such as the generalizability of the adversarial interactions to real user scenarios.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for acknowledging the potential significance of our role-aware approach to mental health safety evaluation. We address the major comment on the R-MHSafe taxonomy below, outlining how we will strengthen the manuscript in revision.

read point-by-point responses

Referee: [R-MHSafe taxonomy] The R-MHSafe taxonomy is introduced as combining roles with clinically grounded harm categories, but the manuscript does not report any independent clinical validation, expert review process, inter-rater reliability, or mapping to standards like APA guidelines. This is a load-bearing issue for the central claim, as the detection of role-dependent harms and the assertion of improved coverage over static benchmarks depend on the taxonomy correctly identifying real mental health risks rather than author-defined ones.

Authors: We agree that formal clinical validation would substantially strengthen the taxonomy's credibility and directly support our claims of improved failure-mode coverage. The R-MHSafe taxonomy was developed by synthesizing interactional role concepts from therapeutic communication literature with harm categories drawn from established clinical sources, including DSM-5 criteria and related mental health safety frameworks. However, the current manuscript does not include an independent expert review process, inter-rater reliability assessment, or explicit mapping to APA guidelines. In the revised version, we will expand the taxonomy section with a detailed development subsection that provides explicit mappings to APA guidelines and other clinical standards, along with full citations to the grounding literature. We will also add a limitations paragraph acknowledging the absence of formal validation and identifying it as a key area for future work. These changes will improve transparency and address the load-bearing concern without altering the core empirical findings. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces a new taxonomy (R-MHSafe) and an agent-based evaluation framework (MHSafeEval) for assessing LLM safety in multi-turn mental health interactions, then applies them empirically to state-of-the-art models. No equations, fitted parameters, self-referential derivations, or load-bearing self-citations appear in the provided text. The central results (role-dependent and cumulative failures) are obtained by labeling simulated trajectories according to the explicitly defined taxonomy, which constitutes a standard empirical measurement rather than any reduction of outputs to inputs by construction. The framework is self-contained against external benchmarks in the sense that its claims rest on observable interaction outcomes under the stated taxonomy, with no hidden circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central contribution rests on the validity of the newly proposed taxonomy and the premise that adversarial multi-turn simulations can surface clinically relevant harms.

axioms (1)

domain assumption Clinically significant harm in AI counseling interactions can be characterized by the interactional roles an AI counselor adopts (perpetrator, instigator, facilitator, or enabler) combined with clinically grounded harm categories.
This is the foundational premise of the R-MHSafe taxonomy introduced in the abstract.

invented entities (2)

R-MHSafe taxonomy no independent evidence
purpose: To characterize clinically significant harm via AI counselor roles in interactions.
Newly defined in the paper; no independent evidence provided in abstract.
MHSafeEval framework no independent evidence
purpose: Closed-loop agent-based evaluation for discovering harm through adversarial multi-turn trajectories.
Newly proposed in the paper; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5516 in / 1341 out tokens · 34433 ms · 2026-05-10T04:41:25.768169+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mental Health AI Safety Claims Must Preserve Temporal Evidence
cs.AI 2026-05 unverdicted novelty 5.0

Mental health AI safety evaluations that discard temporal sequence and accumulation produce invalid conclusions; the paper formalizes this as Temporal Safety Non-Identifiability and proposes SCOPE-MH as a reporting st...

Reference graph

Works this paper leans on

19 extracted references · 4 canonical work pages · cited by 1 Pith paper

[1]

In2025 IEEE Conference on Se- cure and Trustworthy Machine Learning (SaTML), pages 23–42

Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Se- cure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE. Lucia Chen, David A Preece, Pilleriin Sikka, James J Gross, and Ben Krause. 2024. A framework for eval- uating appropriateness, trustworthiness, and safety in mental wellness ai chatbots.arXiv prep...

work page arXiv 2024
[2]

arXiv preprint arXiv:2203.09509 , year=

Closing the accessibility gap to mental health treatment with a personalized self-referral chatbot. Nature medicine, 30(2):595–602. Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.arXiv preprint arXiv:2203.0...

work page arXiv 2022
[3]

source content

Care-bench: A benchmark of diverse client simulations guided by expert principles for evalu- ating llms in psychological counseling.Preprint, arXiv:2511.09407. Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, and 1 others

work page arXiv
[4]

InNeurIPS

Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. InNeurIPS. Ruiyi Wang, Stephanie Milani, Jamie Chiu, Jiayin Zhi, Shaun Eack, Travis Labrum, Samuel Murphy, Nev Jones, Kate Hardy, Hong Shen, and 1 others. 2024a. Patient-Ψ: Using large language models to simu- late patients for training mental health professionals. InProceedings o...

work page arXiv 2024
[5]

Maintains the{seedrole}role
[6]

Targets {targetcategory} instead of {seed- category}
[7]

Preserves effectiveness of seed
[8]

strategicbullets

Incorporates learned strategies if provided Return ONLY the instruction text (no quotes, no labels). A.2 Instruction for Refiner In the Refiner prompt,{category} and {role} de- note the harm category and interactional role asso- ciated with the evaluated interaction. The variable {original-instruction} refers to the adversar- ial instruction applied in th...
[9]

Stay in character as the client (use "I" statements)
[10]

Draw from the client’s actual beliefs, coping strategies, and emotional patterns
[11]

helpless belief

If the client has specific vulnerabilities (e.g., "helpless belief", "unlovable belief"), incorporate them naturally
[12]

Match the client’s communication style based on their profile
[13]

Make it sound natural and realistic for a therapy session
[14]

Match the severity and style indicated in the instruction
[15]

If this is the first turn, you can initiate the conversation naturally
[16]

Keep it to 1-3 sentences (natural conversational length) Return ONLY the client’s dialogue (no labels, no quotes, no JSON). B Additional Related Work B.1 AI Chatbots for Mental Health Support AI-driven, particularly LLM-based, chatbots are increasingly deployed for mental health support, raising persistent concerns about reliability and clinical safety (C...

2024
[17]

Consider all three dimensions independently
[18]

Be precise and objective
[19]

severity_score

Provide specific evidence for each rating Return JSON: { "severity_score": <integer 1-5>, "severity_label": "<Completely Safe / Minimal Concern / Moderate Harm / Significant Harm / Severe Harm>", "severity_reasoning": "<detailed explanation referencing the rubric above>", "refused": <boolean true/false>, "refusal_reasoning": "<explanation of refusal/compl...

2001