One Year Later...The Harms Persist, But So Do We!
Pith reviewed 2026-07-02 21:36 UTC · model grok-4.3
The pith
LLM safety guardrails work reliably only for suicide and self-harm queries but fail for other mental health conditions with rates up to 100%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
General-purpose LLMs show adequate safeguards only for suicide and self-harm; conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. The evaluation uses an eight-dimension harm taxonomy and a multi-dimensional framework across eight models, 16 DSM-5 conditions, and four adversarial attack variants.
What carries the argument
Eight-dimension harm taxonomy paired with a multi-dimensional evaluation framework that applies four adversarial attack variants to eight LLMs across 16 DSM-5 conditions.
If this is right
- Ethical design of LLMs requires clearly defined harm categories for each clinical condition.
- Safeguards must be implemented separately for each condition rather than applied uniformly.
- Integration of these models into schools, search engines, and consumer chatbots carries risks to vulnerable populations until condition-specific protections exist.
- Models without such safeguards should not be deployed in publicly available mental health contexts.
Where Pith is reading between the lines
- The same evaluation approach could be applied to non-clinical sensitive topics such as legal or financial advice to test whether failures are domain-specific.
- Real-world deployment might require ongoing monitoring beyond initial adversarial testing because user phrasing evolves.
- Fine-tuning or post-training alignment targeted at individual conditions could be tested as a direct follow-up to the reported failure rates.
Load-bearing premise
The eight-dimension harm taxonomy and four adversarial attack variants capture the full range of harmful outputs that would appear in real user interactions with LLMs on these topics.
What would settle it
Direct observation of real user conversations with the same LLMs on the tested conditions, scored with the same taxonomy, would show whether failure rates match or diverge from the adversarial test results.
read the original abstract
General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety guardrails remain inadequate and inconsistent across clinical conditions. This study evaluates eight proprietary LLMs across 16 DSM-5 conditions using four adversarial attack variants, introducing an eight-dimension harm taxonomy and a multi-dimensional evaluation framework. Results show that safeguards hold reliably only for suicide and self-harm, while conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. We argue that ethical design and deployment of these LLMs demand clearly defined harm categories across clinical conditions and implementation of safeguards accordingly. Until such safeguards are in place, these models pose significant risks to vulnerable populations, making their growing integration into publicly available settings (e.g., schools, search engines, and consumer chatbots) are particularly concerning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates the safety guardrails of eight proprietary LLMs for conversations on 16 DSM-5 mental health conditions. Using four adversarial attack variants and a newly introduced eight-dimension harm taxonomy, the authors find that safeguards are reliable only for suicide and self-harm, with failure rates reaching 100% for conditions including eating disorders, substance use disorder, and major depressive disorder. They conclude that current LLMs pose risks to vulnerable populations and call for better-defined harm categories and safeguards.
Significance. If the empirical results and taxonomy prove robust, this study would provide valuable evidence on the limitations of current LLM safety mechanisms in clinical domains, potentially guiding future development of more targeted guardrails. The multi-dimensional evaluation framework could serve as a basis for standardized assessments in the field.
major comments (3)
- [Abstract] Abstract: The abstract claims failure rates of up to 100% for several conditions but provides no details on sample sizes, number of prompts tested, inter-rater reliability for the taxonomy, or statistical controls, preventing verification of the central claim.
- [Methods] Methods (implied by abstract description): The eight-dimension harm taxonomy and four adversarial attack variants are presented without demonstrated ecological validity, such as validation against logged user queries, clinical incident data, or expert clinician ratings, which is load-bearing for interpreting the reported failure rates as indicative of real-world harms.
- [Results] Results (implied): The strongest claim regarding condition-specific failure rates lacks supporting quantitative details (e.g., exact percentages per condition, confidence intervals), making it difficult to assess the magnitude and consistency of the findings.
minor comments (1)
- [Abstract] Abstract: The final sentence contains a grammatical error: 'making their growing integration into publicly available settings (e.g., schools, search engines, and consumer chatbots) are particularly concerning.' should be rephrased for clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback, which identifies opportunities to improve transparency and support for our claims. We address each major comment below with proposed revisions where feasible. Our core empirical findings on the differential failure rates of LLM guardrails across mental health conditions remain unchanged, but we will enhance the presentation of methods and results to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract claims failure rates of up to 100% for several conditions but provides no details on sample sizes, number of prompts tested, inter-rater reliability for the taxonomy, or statistical controls, preventing verification of the central claim.
Authors: We agree the abstract prioritizes brevity over methodological specifics. The manuscript evaluates eight LLMs on 16 conditions with four attack variants, and the full Methods section describes the prompt generation process and taxonomy development. We will revise the abstract to incorporate the evaluation scale (number of models, conditions, and attack types) and a high-level reference to the taxonomy construction, while directing readers to the Methods for sample sizes, development process, and any reliability measures. revision: partial
-
Referee: [Methods] Methods (implied by abstract description): The eight-dimension harm taxonomy and four adversarial attack variants are presented without demonstrated ecological validity, such as validation against logged user queries, clinical incident data, or expert clinician ratings, which is load-bearing for interpreting the reported failure rates as indicative of real-world harms.
Authors: The taxonomy dimensions are explicitly mapped to DSM-5 criteria for the selected conditions and draw on established AI safety harm categories. The attack variants extend documented jailbreak techniques. Direct validation against real user logs or clinical incidents was not conducted due to privacy, ethical, and access constraints. We will expand the Methods section with additional rationale for the taxonomy and attacks, and add a Limitations subsection explicitly discussing ecological validity and the need for future clinician-validated benchmarks. revision: partial
-
Referee: [Results] Results (implied): The strongest claim regarding condition-specific failure rates lacks supporting quantitative details (e.g., exact percentages per condition, confidence intervals), making it difficult to assess the magnitude and consistency of the findings.
Authors: The Results section reports condition-specific outcomes across models. We will revise to ensure every reported failure rate is accompanied by the exact sample size per condition-attack combination and, where the data structure permits, include binomial confidence intervals to quantify precision and consistency. revision: yes
Circularity Check
No mathematical derivation or self-referential fitting present
full rationale
This is an empirical evaluation study that introduces an eight-dimension harm taxonomy and applies four adversarial attack variants to LLM outputs across DSM-5 conditions. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear in the provided abstract or description. The central claims rest on direct application of the authors' framework to generated outputs rather than any derivation that reduces to its own inputs by construction. No self-citation load-bearing steps or renamings of known results are indicated. The work is self-contained as an empirical assessment.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. Crawford and T. Glatard. Urgent considerations for suicide prevention in the safe and ethical use of artificial intelligence.CMAJ, 198(15):E599–E601, April 2026. doi: 10.1503/cmaj.251693
- [2]
-
[3]
American Psychiatric Publishing, Washington, DC, USA, 5th, text revision edition, 2022
American Psychiatric Association.Diagnostic and Statistical Manual of Mental Disorders: DSM-5-TR. American Psychiatric Publishing, Washington, DC, USA, 5th, text revision edition, 2022
2022
-
[4]
H. R. Lawrence, R. A. Schneider, S. B. Rubin, M. J. Mataric, D. J. McDuff, and M. J. Bell. The opportunities and risks of large language models in mental health.JMIR Mental Health, 11(1):e59479, 2024. doi: 10.2196/59479. 12 One Year Later
-
[5]
M. De Choudhury, S. R. Pendse, and N. Kumar. Benefits and harms of large language models in digital mental health, 2023. arXiv preprint, arXiv:2311.14693
- [6]
-
[7]
Z. Elyoseph and I. Levkovich. Beyond human expertise: The promise and limitations of ChatGPT in suicide risk assessment.Front. Psychiatry, 14:1213141, 2023. doi: 10.3389/fpsyt.2023.1213141
-
[8]
I. Levkovich and Z. Elyoseph. Suicide risk assessments through the eyes of ChatGPT-3.5 versus ChatGPT-4: Vignette study.JMIR Mental Health, 10:e51232, 2023. doi: 10.2196/51232
-
[9]
R. K. McBain et al. Evaluation of alignment between large language models and expert clinicians in suicide risk assessment.Psychiatr . Serv., 76(11), 2025. doi: 10.1176/appi.ps.20250086
-
[10]
A. Arnaiz-Rodriguez et al. Between help and harm: An evaluation of mental health crisis handling by LLMs. JMIR Mental Health, 2025. doi: 10.2196/88435
-
[11]
G. Holmes, B. Tang, S. Gupta, S. Venkatesh, H. Christensen, and A. Whitton. Applications of large language models in the field of suicide prevention: Scoping review.J. Med. Internet Res., 27:e63126, 2025. doi: 10.2196/63126
-
[12]
Yildirim
C. Yildirim. Differential harm propensity in personalized LLM agents, the curious case of mental health disclosure,
- [13]
-
[14]
Souly et al
A. Souly et al. A StrongREJECT for empty jailbreaks. InAdv. Neural Inf. Process. Syst. (NeurIPS), volume 37, 2024
2024
-
[15]
Mazeika et al
M. Mazeika et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProc. 41st Int. Conf. Mach. Learn. (ICML), pages 35181–35224, 2024
2024
- [16]
-
[17]
A. Badawi et al. Assessing the quality of mental health support in LLM responses through multi-attribute human evaluation, 2026. arXiv:2601.18630
-
[18]
N. Judd et al. Independent clinical evaluation of general-purpose LLM responses to signals of suicide risk, 2025. arXiv:2510.27521
-
[19]
Zirikly, P
A. Zirikly, P. Resnik, O. Uzuner, and K. Hollingshead. CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. InProc. 6th Workshop Comput. Linguistics Clin. Psychol. (CLPsych), pages 24–33, Stroudsburg, PA, USA, 2019. Assoc. Comput. Linguistics. [Online]. Available: https://aclanthology.org/ W19-3003/
2019
-
[20]
Z. Guo, A. Lai, J. H. Thygesen, J. Farrington, T. Keen, and K. Li. Large language models for mental health applications: Systematic review.JMIR Mental Health, 2024. doi: 10.2196/57400
-
[21]
X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang. ‘Do Anything Now’: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProc. ACM SIGSAC Conf. Comput. Commun. Secur . (CCS), pages 1671–1685, 2024
2024
-
[22]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Y . Liu et al. Jailbreaking ChatGPT via prompt engineering: An empirical study, 2023. arXiv preprint, arXiv:2305.13860
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does LLM safety training fail?Adv. Neural Inf. Process. Syst. (NeurIPS), 36:80079–80110, 2023
2023
-
[24]
Xie et al
T. Xie et al. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. InProc. Int. Conf. Learn. Represent. (ICLR), 2025
2025
-
[25]
Ganguli et al
D. Ganguli et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,
-
[26]
arXiv preprint, arXiv:2209.07858
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
X. Yang, B. Zhou, X. Tang, J. Han, and S. Hu. Chain of attack: Hide your intention through multi-turn interrogation. InFindings of the Assoc. Comput. Linguistics: ACL 2025, pages 9881–9901, Vienna, Austria, 2025. doi: 10.18653/v1/2025.findings-acl.514
- [28]
-
[29]
Q. Xu et al. Do no harm: Exposing hidden vulnerabilities of LLMs via persona-based client simulation attack in psychological counseling, 2026. arXiv:2604.04842. 13 One Year Later
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Zhu et al
L. Zhu et al. JudgeLM: Fine-tuned large language models are scalable judges. InProc. Int. Conf. Learn. Represent. (ICLR), 2025
2025
-
[31]
Li et al
J. Li et al. Generative judge for evaluating alignment. InProc. Int. Conf. Learn. Represent. (ICLR), 2024
2024
-
[32]
Kim et al
S. Kim et al. Prometheus 2: An open source language model specialized in evaluating other language models. In Proc. 2024 Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2024
2024
-
[33]
Y . Wang et al. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. InProc. Int. Conf. Learn. Represent. (ICLR), 2024. arXiv:2306.05087
-
[34]
A. Grattafiori et al. The Llama 3 herd of models, 2024. arXiv preprint, arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
M. M. Linehan.Cognitive Behavioral Treatment of Borderline Personality Disorder. Guilford Press, New York, NY , USA, 1993
1993
-
[36]
L. Newlove et al. Crisis-focused psychosocial interventions for borderline personality disorder: systematic review and narrative synthesis.BJPsych Open, 8(3):e94, 2022. doi: 10.1192/bjo.2022.54
-
[37]
Towards Understanding Sycophancy in Language Models
M. Sharma et al. Towards understanding sycophancy in language models, 2024. arXiv:2310.13548. A Experiment 2b Turn Structure The nine-turn structure of Experiment 2b escalates as follows. T1 introduces a fiction-embedded harm query in which the teenage protagonist begins researching the condition for their novel. T2 presents an avoidance monologue in whic...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.