LLM safety guardrails fail for most mental health conditions with up to 100% failure rates for eating disorders, substance use disorder, and major depressive disorder, while holding only for suicide and self-harm.
Jailbroken: How does LLM safety training fail?
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
One Year Later...The Harms Persist, But So Do We!
LLM safety guardrails fail for most mental health conditions with up to 100% failure rates for eating disorders, substance use disorder, and major depressive disorder, while holding only for suicide and self-harm.