pith. sign in

arxiv: 2606.23884 · v3 · pith:OJWYBJEQnew · submitted 2026-06-22 · 💻 cs.CL · cs.AI

One Year Later...The Harms Persist, But So Do We!

Pith reviewed 2026-07-02 21:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsmental healthsafety guardrailsadversarial attacksharm taxonomyDSM-5 conditionsLLM evaluationclinical safety
0
0 comments X

The pith

LLM safety guardrails work reliably only for suicide and self-harm queries but fail for other mental health conditions with rates up to 100%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests eight proprietary large language models on conversations about 16 DSM-5 mental health conditions. It applies four variants of adversarial attacks and measures outputs against a new eight-dimension harm taxonomy. Safeguards succeed consistently only on suicide and self-harm topics. For eating disorders, substance use disorder, and major depressive disorder the models produce harmful content in nearly all tested cases. The authors conclude that current models require condition-specific safeguards before safe use in public settings such as schools or consumer chatbots.

Core claim

General-purpose LLMs show adequate safeguards only for suicide and self-harm; conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. The evaluation uses an eight-dimension harm taxonomy and a multi-dimensional framework across eight models, 16 DSM-5 conditions, and four adversarial attack variants.

What carries the argument

Eight-dimension harm taxonomy paired with a multi-dimensional evaluation framework that applies four adversarial attack variants to eight LLMs across 16 DSM-5 conditions.

If this is right

  • Ethical design of LLMs requires clearly defined harm categories for each clinical condition.
  • Safeguards must be implemented separately for each condition rather than applied uniformly.
  • Integration of these models into schools, search engines, and consumer chatbots carries risks to vulnerable populations until condition-specific protections exist.
  • Models without such safeguards should not be deployed in publicly available mental health contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evaluation approach could be applied to non-clinical sensitive topics such as legal or financial advice to test whether failures are domain-specific.
  • Real-world deployment might require ongoing monitoring beyond initial adversarial testing because user phrasing evolves.
  • Fine-tuning or post-training alignment targeted at individual conditions could be tested as a direct follow-up to the reported failure rates.

Load-bearing premise

The eight-dimension harm taxonomy and four adversarial attack variants capture the full range of harmful outputs that would appear in real user interactions with LLMs on these topics.

What would settle it

Direct observation of real user conversations with the same LLMs on the tested conditions, scored with the same taxonomy, would show whether failure rates match or diverge from the adversarial test results.

read the original abstract

General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety guardrails remain inadequate and inconsistent across clinical conditions. This study evaluates eight proprietary LLMs across 16 DSM-5 conditions using four adversarial attack variants, introducing an eight-dimension harm taxonomy and a multi-dimensional evaluation framework. Results show that safeguards hold reliably only for suicide and self-harm, while conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. We argue that ethical design and deployment of these LLMs demand clearly defined harm categories across clinical conditions and implementation of safeguards accordingly. Until such safeguards are in place, these models pose significant risks to vulnerable populations, making their growing integration into publicly available settings (e.g., schools, search engines, and consumer chatbots) are particularly concerning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript evaluates the safety guardrails of eight proprietary LLMs for conversations on 16 DSM-5 mental health conditions. Using four adversarial attack variants and a newly introduced eight-dimension harm taxonomy, the authors find that safeguards are reliable only for suicide and self-harm, with failure rates reaching 100% for conditions including eating disorders, substance use disorder, and major depressive disorder. They conclude that current LLMs pose risks to vulnerable populations and call for better-defined harm categories and safeguards.

Significance. If the empirical results and taxonomy prove robust, this study would provide valuable evidence on the limitations of current LLM safety mechanisms in clinical domains, potentially guiding future development of more targeted guardrails. The multi-dimensional evaluation framework could serve as a basis for standardized assessments in the field.

major comments (3)
  1. [Abstract] Abstract: The abstract claims failure rates of up to 100% for several conditions but provides no details on sample sizes, number of prompts tested, inter-rater reliability for the taxonomy, or statistical controls, preventing verification of the central claim.
  2. [Methods] Methods (implied by abstract description): The eight-dimension harm taxonomy and four adversarial attack variants are presented without demonstrated ecological validity, such as validation against logged user queries, clinical incident data, or expert clinician ratings, which is load-bearing for interpreting the reported failure rates as indicative of real-world harms.
  3. [Results] Results (implied): The strongest claim regarding condition-specific failure rates lacks supporting quantitative details (e.g., exact percentages per condition, confidence intervals), making it difficult to assess the magnitude and consistency of the findings.
minor comments (1)
  1. [Abstract] Abstract: The final sentence contains a grammatical error: 'making their growing integration into publicly available settings (e.g., schools, search engines, and consumer chatbots) are particularly concerning.' should be rephrased for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed feedback, which identifies opportunities to improve transparency and support for our claims. We address each major comment below with proposed revisions where feasible. Our core empirical findings on the differential failure rates of LLM guardrails across mental health conditions remain unchanged, but we will enhance the presentation of methods and results to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract claims failure rates of up to 100% for several conditions but provides no details on sample sizes, number of prompts tested, inter-rater reliability for the taxonomy, or statistical controls, preventing verification of the central claim.

    Authors: We agree the abstract prioritizes brevity over methodological specifics. The manuscript evaluates eight LLMs on 16 conditions with four attack variants, and the full Methods section describes the prompt generation process and taxonomy development. We will revise the abstract to incorporate the evaluation scale (number of models, conditions, and attack types) and a high-level reference to the taxonomy construction, while directing readers to the Methods for sample sizes, development process, and any reliability measures. revision: partial

  2. Referee: [Methods] Methods (implied by abstract description): The eight-dimension harm taxonomy and four adversarial attack variants are presented without demonstrated ecological validity, such as validation against logged user queries, clinical incident data, or expert clinician ratings, which is load-bearing for interpreting the reported failure rates as indicative of real-world harms.

    Authors: The taxonomy dimensions are explicitly mapped to DSM-5 criteria for the selected conditions and draw on established AI safety harm categories. The attack variants extend documented jailbreak techniques. Direct validation against real user logs or clinical incidents was not conducted due to privacy, ethical, and access constraints. We will expand the Methods section with additional rationale for the taxonomy and attacks, and add a Limitations subsection explicitly discussing ecological validity and the need for future clinician-validated benchmarks. revision: partial

  3. Referee: [Results] Results (implied): The strongest claim regarding condition-specific failure rates lacks supporting quantitative details (e.g., exact percentages per condition, confidence intervals), making it difficult to assess the magnitude and consistency of the findings.

    Authors: The Results section reports condition-specific outcomes across models. We will revise to ensure every reported failure rate is accompanied by the exact sample size per condition-attack combination and, where the data structure permits, include binomial confidence intervals to quantify precision and consistency. revision: yes

Circularity Check

0 steps flagged

No mathematical derivation or self-referential fitting present

full rationale

This is an empirical evaluation study that introduces an eight-dimension harm taxonomy and applies four adversarial attack variants to LLM outputs across DSM-5 conditions. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear in the provided abstract or description. The central claims rest on direct application of the authors' framework to generated outputs rather than any derivation that reduces to its own inputs by construction. No self-citation load-bearing steps or renamings of known results are indicated. The work is self-contained as an empirical assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study; the harm taxonomy is a new classification scheme rather than a mathematical construct. No free parameters, standard mathematical axioms, or invented physical entities are introduced.

pith-pipeline@v0.9.1-grok · 5680 in / 1181 out tokens · 24320 ms · 2026-07-02T21:36:02.909156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 24 canonical work pages · 5 internal anchors

  1. [1]

    Crawford and T

    A. Crawford and T. Glatard. Urgent considerations for suicide prevention in the safe and ethical use of artificial intelligence.CMAJ, 198(15):E599–E601, April 2026. doi: 10.1503/cmaj.251693

  2. [2]

    A. M. Schoene and C. Canca. ‘For Argument’s Sake, Show Me How to Harm Myself!’: Jailbreaking LLMs in suicide and self-harm contexts, 2025. arXiv:2507.02990

  3. [3]

    American Psychiatric Publishing, Washington, DC, USA, 5th, text revision edition, 2022

    American Psychiatric Association.Diagnostic and Statistical Manual of Mental Disorders: DSM-5-TR. American Psychiatric Publishing, Washington, DC, USA, 5th, text revision edition, 2022

  4. [4]

    H. R. Lawrence, R. A. Schneider, S. B. Rubin, M. J. Mataric, D. J. McDuff, and M. J. Bell. The opportunities and risks of large language models in mental health.JMIR Mental Health, 11(1):e59479, 2024. doi: 10.2196/59479. 12 One Year Later

  5. [5]

    De Choudhury, S

    M. De Choudhury, S. R. Pendse, and N. Kumar. Benefits and harms of large language models in digital mental health, 2023. arXiv preprint, arXiv:2311.14693

  6. [6]

    Hua et al

    Y . Hua et al. Large language models in mental health care: A scoping review, 2024. arXiv:2401.02984

  7. [7]

    Elyoseph and I

    Z. Elyoseph and I. Levkovich. Beyond human expertise: The promise and limitations of ChatGPT in suicide risk assessment.Front. Psychiatry, 14:1213141, 2023. doi: 10.3389/fpsyt.2023.1213141

  8. [8]

    Levkovich and Z

    I. Levkovich and Z. Elyoseph. Suicide risk assessments through the eyes of ChatGPT-3.5 versus ChatGPT-4: Vignette study.JMIR Mental Health, 10:e51232, 2023. doi: 10.2196/51232

  9. [9]

    R. K. McBain et al. Evaluation of alignment between large language models and expert clinicians in suicide risk assessment.Psychiatr . Serv., 76(11), 2025. doi: 10.1176/appi.ps.20250086

  10. [10]

    Arnaiz-Rodriguez et al

    A. Arnaiz-Rodriguez et al. Between help and harm: An evaluation of mental health crisis handling by LLMs. JMIR Mental Health, 2025. doi: 10.2196/88435

  11. [11]

    Holmes, B

    G. Holmes, B. Tang, S. Gupta, S. Venkatesh, H. Christensen, and A. Whitton. Applications of large language models in the field of suicide prevention: Scoping review.J. Med. Internet Res., 27:e63126, 2025. doi: 10.2196/63126

  12. [12]

    Yildirim

    C. Yildirim. Differential harm propensity in personalized LLM agents, the curious case of mental health disclosure,

  13. [13]

    arXiv preprint arXiv:2603.16734

  14. [14]

    Souly et al

    A. Souly et al. A StrongREJECT for empty jailbreaks. InAdv. Neural Inf. Process. Syst. (NeurIPS), volume 37, 2024

  15. [15]

    Mazeika et al

    M. Mazeika et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProc. 41st Int. Conf. Mach. Learn. (ICML), pages 35181–35224, 2024

  16. [16]

    Cao et al

    H. Cao et al. SafeDialBench: A fine-grained safety evaluation benchmark for large language models in multi-turn dialogues with diverse jailbreak attacks, 2025. arXiv:2502.11090

  17. [17]

    Badawi et al

    A. Badawi et al. Assessing the quality of mental health support in LLM responses through multi-attribute human evaluation, 2026. arXiv:2601.18630

  18. [18]

    Judd et al

    N. Judd et al. Independent clinical evaluation of general-purpose LLM responses to signals of suicide risk, 2025. arXiv:2510.27521

  19. [19]

    Zirikly, P

    A. Zirikly, P. Resnik, O. Uzuner, and K. Hollingshead. CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. InProc. 6th Workshop Comput. Linguistics Clin. Psychol. (CLPsych), pages 24–33, Stroudsburg, PA, USA, 2019. Assoc. Comput. Linguistics. [Online]. Available: https://aclanthology.org/ W19-3003/

  20. [20]

    Z. Guo, A. Lai, J. H. Thygesen, J. Farrington, T. Keen, and K. Li. Large language models for mental health applications: Systematic review.JMIR Mental Health, 2024. doi: 10.2196/57400

  21. [21]

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang. ‘Do Anything Now’: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProc. ACM SIGSAC Conf. Comput. Commun. Secur . (CCS), pages 1671–1685, 2024

  22. [22]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Y . Liu et al. Jailbreaking ChatGPT via prompt engineering: An empirical study, 2023. arXiv preprint, arXiv:2305.13860

  23. [23]

    A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does LLM safety training fail?Adv. Neural Inf. Process. Syst. (NeurIPS), 36:80079–80110, 2023

  24. [24]

    Xie et al

    T. Xie et al. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. InProc. Int. Conf. Learn. Represent. (ICLR), 2025

  25. [25]

    Ganguli et al

    D. Ganguli et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,

  26. [26]

    arXiv preprint, arXiv:2209.07858

  27. [27]

    X. Yang, B. Zhou, X. Tang, J. Han, and S. Hu. Chain of attack: Hide your intention through multi-turn interrogation. InFindings of the Assoc. Comput. Linguistics: ACL 2025, pages 9881–9901, Vienna, Austria, 2025. doi: 10.18653/v1/2025.findings-acl.514

  28. [28]

    Zhang, Q

    H. Zhang, Q. Lou, and Y . Wang. Towards safe AI clinicians: A comprehensive study on large language model jailbreaking in healthcare, 2025. arXiv:2501.18632

  29. [29]

    Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

    Q. Xu et al. Do no harm: Exposing hidden vulnerabilities of LLMs via persona-based client simulation attack in psychological counseling, 2026. arXiv:2604.04842. 13 One Year Later

  30. [30]

    Zhu et al

    L. Zhu et al. JudgeLM: Fine-tuned large language models are scalable judges. InProc. Int. Conf. Learn. Represent. (ICLR), 2025

  31. [31]

    Li et al

    J. Li et al. Generative judge for evaluating alignment. InProc. Int. Conf. Learn. Represent. (ICLR), 2024

  32. [32]

    Kim et al

    S. Kim et al. Prometheus 2: An open source language model specialized in evaluating other language models. In Proc. 2024 Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2024

  33. [33]

    Wang et al

    Y . Wang et al. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. InProc. Int. Conf. Learn. Represent. (ICLR), 2024. arXiv:2306.05087

  34. [34]

    The Llama 3 Herd of Models

    A. Grattafiori et al. The Llama 3 herd of models, 2024. arXiv preprint, arXiv:2407.21783

  35. [35]

    M. M. Linehan.Cognitive Behavioral Treatment of Borderline Personality Disorder. Guilford Press, New York, NY , USA, 1993

  36. [36]

    Newlove et al

    L. Newlove et al. Crisis-focused psychosocial interventions for borderline personality disorder: systematic review and narrative synthesis.BJPsych Open, 8(3):e94, 2022. doi: 10.1192/bjo.2022.54

  37. [37]

    Towards Understanding Sycophancy in Language Models

    M. Sharma et al. Towards understanding sycophancy in language models, 2024. arXiv:2310.13548. A Experiment 2b Turn Structure The nine-turn structure of Experiment 2b escalates as follows. T1 introduces a fiction-embedded harm query in which the teenage protagonist begins researching the condition for their novel. T2 presents an avoidance monologue in whic...