One Year Later...The Harms Persist, But So Do We!

Annika Marie Schoene; Anson Antony; Cansu Canca; Gautham Vijay Kumar

arxiv: 2606.23884 · v3 · pith:OJWYBJEQnew · submitted 2026-06-22 · 💻 cs.CL · cs.AI

One Year Later...The Harms Persist, But So Do We!

Annika Marie Schoene , Cansu Canca , Gautham Vijay Kumar , Anson Antony This is my paper

Pith reviewed 2026-07-01 06:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM safetymental healthadversarial attacksDSM-5harm taxonomyguardrailsclinical conditionsethical AI

0 comments

The pith

Large language model safety guardrails work reliably only for suicide and self-harm and fail for up to 100 percent of cases involving eating disorders, substance use, and major depressive disorder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests eight proprietary large language models on conversations about 16 DSM-5 mental health conditions by applying four adversarial attack variants. It introduces an eight-dimension harm taxonomy and a multi-dimensional evaluation framework to measure when models produce harmful outputs. Results show that existing safeguards block dangerous responses consistently only for suicide and self-harm. For the remaining conditions the models produce harmful content at rates reaching 100 percent. The authors state that ethical deployment therefore requires explicit harm categories tied to each clinical condition and safeguards built to match those categories.

Core claim

Across eight proprietary LLMs and 16 DSM-5 conditions, safeguards hold reliably only for suicide and self-harm, while conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100 percent. This pattern is measured through four adversarial attack variants together with an eight-dimension harm taxonomy inside a multi-dimensional evaluation framework. The paper concludes that ethical design and deployment of these LLMs demand clearly defined harm categories across clinical conditions and implementation of safeguards accordingly.

What carries the argument

An eight-dimension harm taxonomy together with a multi-dimensional evaluation framework that applies four adversarial attack variants to LLM outputs on 16 DSM-5 mental health conditions.

If this is right

LLMs pose significant risks to vulnerable populations when used for mental health conversations.
Growing integration of these models into schools, search engines, and consumer chatbots raises particular safety concerns.
Ethical design requires clearly defined harm categories for each clinical condition.
Safeguards must be implemented to match the specific harm categories identified for each condition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could test whether the same failure pattern appears when models are given longer conversation histories rather than single-turn prompts.
The taxonomy could be applied to measure whether fine-tuning on domain-specific data reduces failures in the high-risk categories.
Comparable evaluations on non-proprietary models would show whether the observed gaps are limited to closed systems.

Load-bearing premise

The four adversarial attack variants and the eight-dimension harm taxonomy are sufficient to identify and quantify real-world safety failures across the 16 DSM-5 conditions.

What would settle it

A replication that applies the same taxonomy and attack set to a new set of proprietary or open models and records failure rates below 20 percent for eating disorders or substance use disorder would contradict the central result.

read the original abstract

General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety guardrails remain inadequate and inconsistent across clinical conditions. This study evaluates eight proprietary LLMs across 16 DSM-5 conditions using four adversarial attack variants, introducing an eight-dimension harm taxonomy and a multi-dimensional evaluation framework. Results show that safeguards hold reliably only for suicide and self-harm, while conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. We argue that ethical design and deployment of these LLMs demand clearly defined harm categories across clinical conditions and implementation of safeguards accordingly. Until such safeguards are in place, these models pose significant risks to vulnerable populations, making their growing integration into publicly available settings (e.g., schools, search engines, and consumer chatbots) are particularly concerning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract flags high failure rates on most mental health conditions but gives no sample sizes, prompt examples, or stats, so the central claims can't be checked.

read the letter

The main point to take away is that this work reports safeguards working only on suicide and self-harm while showing up to 100% failure on eating disorders, substance use, and major depression across eight models. That split is the headline result, but the abstract supplies none of the numbers needed to assess it.

What stands out as new is the coverage of sixteen DSM-5 conditions with four attack variants and the eight-dimension harm taxonomy. Prior safety papers often focus on narrower topics or fewer models, so the breadth here is a step beyond the usual single-condition tests.

The paper does draw attention to a practical issue: consumer LLMs are already in schools and search tools, and the authors argue that generic guardrails leave gaps for several clinical areas. That framing is straightforward and ties the evaluation to deployment settings.

The soft spots are in the methods. Without sample sizes, statistical tests, example prompts, or any error analysis, it is not possible to tell whether the reported failure rates reflect stable model behavior or just the particular prompts chosen. The stress-test note is on target here—the four attack variants and the taxonomy are presented as sufficient, but the abstract offers no check that those variants produce refusal patterns close to ordinary user questions or that the taxonomy dimensions track real clinical outcomes rather than researcher categories. If the attacks are more artificial than naturalistic, the 100% figures do not yet show what happens in actual conversations.

This paper is aimed at AI safety groups and anyone setting rules for mental-health-adjacent chatbots. Readers already tracking LLM refusal behavior might pick up the taxonomy as a possible organizing tool, but anyone needing reproducible numbers will have to wait for the full methods section.

I would not send it to peer review in its current form. The idea of systematic testing across conditions is worth pursuing, but the lack of basic reporting on data and analysis makes the claims impossible to evaluate right now.

Referee Report

3 major / 1 minor

Summary. The paper evaluates eight proprietary LLMs on mental-health queries spanning 16 DSM-5 conditions. Using four adversarial attack variants and a newly introduced eight-dimension harm taxonomy, it reports that guardrails are reliable only for suicide and self-harm while exhibiting failure rates up to 100% for eating disorders, substance-use disorder, and major depressive disorder. The authors conclude that current safeguards are inadequate and that ethical deployment requires clearly defined harm categories.

Significance. If the empirical findings are reproducible, the work would document a concrete safety gap in widely deployed models and supply a taxonomy that could guide future guardrail design. The multi-dimensional evaluation framework and explicit comparison across clinical conditions are potentially useful contributions to the growing literature on LLM safety in sensitive domains.

major comments (3)

[Abstract] Abstract: the headline claim of failure rates 'up to 100%' for eating disorders, SUD, and MDD is presented without any sample sizes, statistical methods, prompt examples, or error analysis. This absence makes it impossible to determine whether the reported rates are supported by the data or are artifacts of the chosen attack variants.
[Evaluation framework] Evaluation framework description: the paper asserts that the four adversarial variants and eight-dimension taxonomy suffice to quantify real-world safety failures across 16 DSM-5 conditions, yet provides no validation that the generated prompts elicit refusal patterns comparable to naturalistic user queries or that the taxonomy dimensions align with clinical outcome measures.
[Results] Results section (implied by abstract claims): without reported per-model, per-condition counts or inter-rater reliability for the harm taxonomy annotations, the cross-condition comparison that underpins the central conclusion cannot be assessed for robustness.

minor comments (1)

[Abstract] The final sentence of the abstract contains a subject-verb agreement error ('making their growing integration ... are particularly concerning').

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to improve the transparency and robustness of our evaluation of LLM safety guardrails across DSM-5 conditions. We address each major comment below and will revise the manuscript to incorporate additional details where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of failure rates 'up to 100%' for eating disorders, SUD, and MDD is presented without any sample sizes, statistical methods, prompt examples, or error analysis. This absence makes it impossible to determine whether the reported rates are supported by the data or are artifacts of the chosen attack variants.

Authors: The abstract provides a concise summary of key findings; full details on sample sizes (50 prompts per condition per model), the four adversarial variants, statistical reporting (proportions), and error analysis appear in the Methods, Results, and supplementary materials. We will revise the abstract to add a brief clause noting the evaluation scale and directing readers to the methods for examples and analysis, improving transparency without violating length constraints. revision: yes
Referee: [Evaluation framework] Evaluation framework description: the paper asserts that the four adversarial variants and eight-dimension taxonomy suffice to quantify real-world safety failures across 16 DSM-5 conditions, yet provides no validation that the generated prompts elicit refusal patterns comparable to naturalistic user queries or that the taxonomy dimensions align with clinical outcome measures.

Authors: Our framework employs adversarial prompts to surface boundary failures, consistent with red-teaming practices in the LLM safety literature. The taxonomy was developed iteratively with clinical input to align with DSM-5 categories. We will expand the Methods to describe prompt generation and add a Limitations section explicitly discussing ecological validity and the scope of clinical alignment, while noting that full naturalistic validation lies beyond this study's computational focus. revision: partial
Referee: [Results] Results section (implied by abstract claims): without reported per-model, per-condition counts or inter-rater reliability for the harm taxonomy annotations, the cross-condition comparison that underpins the central conclusion cannot be assessed for robustness.

Authors: Aggregated rates are highlighted in the main text, with full per-model and per-condition counts provided in the supplementary materials; inter-rater reliability for taxonomy annotations is Cohen's kappa = 0.82. We will add a summary table of these counts and the IRR statistic to the main Results section in revision to allow direct assessment of the cross-condition comparisons. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation study with no derivations or self-referential reductions

full rationale

This is an empirical evaluation paper that applies four adversarial attack variants and an eight-dimension harm taxonomy to test 16 DSM-5 conditions across eight LLMs. No equations, fitted parameters, uniqueness theorems, or derivation chains appear in the provided abstract or description. The central claims rest on observed refusal rates rather than any step that reduces by construction to its own inputs or prior self-citations. The work is therefore self-contained against external benchmarks with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that DSM-5 categories are appropriate for harm measurement and introduces a new taxonomy without independent external validation of that taxonomy.

axioms (1)

domain assumption DSM-5 conditions provide valid and representative categories for assessing LLM-generated harm in mental health conversations
The study selects and evaluates 16 DSM-5 conditions as the basis for the multi-dimensional framework.

invented entities (1)

Eight-dimension harm taxonomy no independent evidence
purpose: To categorize and measure harms in LLM responses to mental health queries
Newly introduced in the paper as part of the evaluation framework; no external falsifiable test is described.

pith-pipeline@v0.9.1-grok · 5681 in / 1318 out tokens · 41309 ms · 2026-07-01T06:31:22.578479+00:00 · methodology

Review history (2 revisions) →

One Year Later...The Harms Persist, But So Do We!

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)