Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

· 2026 · cs.CL · arXiv 2604.04842

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions. A key challenge is distinguishing therapeutic empathy from maladaptive validation, where supportive responses may inadvertently reinforce harmful beliefs or behaviors in multi-turn conversations. This risk is largely overlooked by existing red-teaming frameworks, which focus mainly on generic harms or optimization-based attacks. To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose vulnerabilities in psychological safety alignment. Experiments on seven general and mental health-specialized LLMs show that PCSA substantially outperforms four competitive baselines. Perplexity analysis and human inspection further indicate that PCSA generates more natural and realistic dialogues. Our results reveal that current LLMs remain vulnerable to domain-specific adversarial tactics, providing unauthorized medical advice, reinforcing delusions, and implicitly encouraging risky actions.

representative citing papers

One Year Later...The Harms Persist, But So Do We!

cs.CL · 2026-06-22 · unverdicted · novelty 5.0 · 2 refs

LLM safety guardrails fail for most mental health conditions with up to 100% failure rates for eating disorders, substance use disorder, and major depressive disorder, while holding only for suicide and self-harm.

Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

cs.CL · 2026-05-24 · unverdicted · novelty 5.0

PsyDefDetect shared task introduces PsyDefConv corpus and benchmarks NLP systems on 9-class DMRS defense mechanism classification, with top macro F1 of 0.42.

citing papers explorer

Showing 2 of 2 citing papers.

One Year Later...The Harms Persist, But So Do We! cs.CL · 2026-06-22 · unverdicted · none · ref 29 · 2 links · internal anchor
LLM safety guardrails fail for most mental health conditions with up to 100% failure rates for eating disorders, substance use disorder, and major depressive disorder, while holding only for suicide and self-harm.
Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations cs.CL · 2026-05-24 · unverdicted · none · ref 6 · internal anchor
PsyDefDetect shared task introduces PsyDefConv corpus and benchmarks NLP systems on 9-class DMRS defense mechanism classification, with top macro F1 of 0.42.

Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

fields

years

verdicts

representative citing papers

citing papers explorer