pith. sign in

arxiv: 2604.23445 · v1 · submitted 2026-04-25 · 💻 cs.CL · cs.AI· cs.CY· cs.LG

AI Safety Training Can be Clinically Harmful

Pith reviewed 2026-05-08 08:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.LG
keywords AI safetymental health chatbotsRLHFprolonged exposure therapycognitive behavioral therapytherapeutic fidelityLLM evaluationclinical harm
0
0 comments X

The pith

Safety-aligned LLMs disrupt the mechanisms of prolonged exposure and CBT therapies by offering false reassurance and refusing to engage with key cognitions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates four generative models on 250 Prolonged Exposure therapy scenarios and 146 CBT cognitive restructuring exercises scored by an LLM judge panel. While models achieve near-perfect surface acknowledgment of patient inputs, therapeutic appropriateness falls to 0.22-0.33 at the highest severity levels for three models and protocol fidelity reaches zero for two. The authors trace this to RLHF safety training that grounds patients during imaginal exposure, inserts crisis resources into controlled exercises, offers false reassurance, refuses to challenge distorted self-harm cognitions in PE, and causes task abandonment or safety-preamble insertion in CBT. They argue this systematic interference across modalities requires a new five-axis evaluation framework before any deployment of AI in mental health support.

Core claim

The central claim is that RLHF safety alignment disrupts the therapeutic mechanism of action in Prolonged Exposure therapy by grounding patients during imaginal exposure, offering false reassurance, inserting crisis resources into controlled exercises, and refusing to challenge distorted cognitions mentioning self-harm; and through task abandonment or safety-preamble insertion during CBT cognitive restructuring. Evaluations on the tested scenarios show surface acknowledgment near 0.91-1.00 but therapeutic appropriateness collapsing at high severity, with one model's task completeness dropping from 92% to 71% under CBT escalation and a frontier model's safety-interference score falling from 0

What carries the argument

RLHF safety alignment, which prioritizes harm avoidance through grounding, reassurance, crisis insertion, and topic refusal in ways that conflict with therapeutic protocols.

If this is right

  • Therapeutic appropriateness collapses to low levels at highest severity for three of four models.
  • Protocol fidelity reaches zero for two models during PE scenarios.
  • Task completeness drops from 92% to 71% for one model under CBT severity escalation.
  • Safety-interference scores fall from 0.99 to 0.61 for the frontier model as severity increases.
  • No AI mental health system should proceed to deployment without passing multi-axis evaluation on protocol fidelity, hallucination risk, behavioral consistency, crisis safety, and demographic robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • General-purpose safety training may inherently conflict with the controlled distress exposure required in evidence-based therapies.
  • AI mental health tools might need domain-specific alignment methods that preserve protocol adherence while managing risk.
  • The absence of human clinician validation for the LLM scoring panel leaves open the possibility that the reported failure modes are overstated or understated.

Load-bearing premise

That an LLM-based three-judge panel can reliably score therapeutic appropriateness and protocol fidelity in the absence of human clinician validation or inter-rater reliability metrics with licensed therapists.

What would settle it

Licensed therapists independently rating the same model responses on the 250 PE and 146 CBT scenarios, then measuring agreement with the LLM panel on appropriateness and fidelity scores.

Figures

Figures reproduced from arXiv: 2604.23445 by Andrew M. Sherrill, Chris W. Wiese, Rosa I. Arriaga, Saeed Abdullah, Suhas BN.

Figure 1
Figure 1. Figure 1: The crisis cliff: non-linear performance degradation across triage severity. Acknowledgment view at source ↗
Figure 2
Figure 2. Figure 2: Pattern A example: Models interrupt imaginal exposure with grounding techniques, “you are view at source ↗
Figure 3
Figure 3. Figure 3: Pattern B example: Gemini treats a therapy-session memory narrative as a real-time emergency, view at source ↗
Figure 4
Figure 4. Figure 4: Pattern C example: Models insert crisis hotline and emergency service references into a con view at source ↗
Figure 5
Figure 5. Figure 5: Radar charts comparing model performance at routine severity (blue) vs. imminent risk (red) view at source ↗
read the original abstract

Large language models are being deployed as mental health support agents at scale, yet only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing, and simulations reveal psychological deterioration in over one-third of cases. We evaluate four generative models on 250 Prolonged Exposure (PE) therapy scenarios and 146 CBT cognitive restructuring exercises (plus 29 severity-escalated variants), scored by a three-judge LLM panel. All models scored near-perfectly on surface acknowledgment (~0.91-1.00) while therapeutic appropriateness collapsed to 0.22-0.33 at the highest severity for three of four models, with protocol fidelity reaching zero for two. Under CBT severity escalation, one model's task completeness dropped from 92% to 71% while the frontier model's safety-interference score fell from 0.99 to 0.61. We identify a systematic, modality-spanning failure: RLHF safety alignment disrupts the therapeutic mechanism of action by grounding patients during imaginal exposure, offering false reassurance, inserting crisis resources into controlled exercises, and refusing to challenge distorted cognitions mentioning self-harm in PE; and through task abandonment or safety-preamble insertion during CBT cognitive restructuring. These findings motivate a five-axis evaluation framework (protocol fidelity, hallucination risk, behavioral consistency, crisis safety, demographic robustness), mapped onto FDA SaMD and EU AI Act requirements. We argue that no AI mental health system should proceed to deployment without passing multi-axis evaluation across all five dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that RLHF safety alignment in LLMs systematically disrupts the mechanisms of Prolonged Exposure (PE) therapy and CBT when used for mental health support. Based on evaluations of four models across 250 PE scenarios and 146 CBT exercises (plus severity variants) scored by an unvalidated three-judge LLM panel, it reports high surface acknowledgment (~0.91-1.00) but collapsed therapeutic appropriateness (0.22-0.33 at high severity) and zero protocol fidelity for some models. Specific failure modes include false reassurance during imaginal exposure, insertion of crisis resources, refusal to challenge self-harm cognitions, and task abandonment or safety preambles in CBT. The authors propose a five-axis evaluation framework (protocol fidelity, hallucination risk, behavioral consistency, crisis safety, demographic robustness) mapped to FDA SaMD and EU AI Act requirements.

Significance. If the empirical results hold after validation, the work would be significant for AI safety and mental health applications by identifying concrete, modality-spanning interference mechanisms from safety training and providing a regulatory-aligned multi-axis evaluation framework. It offers falsifiable predictions about specific behaviors (e.g., grounding during exposure) that could guide future model development and deployment decisions.

major comments (2)
  1. [Evaluation Methodology] Evaluation Methodology (LLM Judge Panel): The central quantitative claims—appropriateness scores of 0.22-0.33 and protocol fidelity of zero—rest entirely on scores from a three-judge LLM panel. No details are provided on judge calibration, inter-rater agreement (e.g., Fleiss' kappa), or validation against ratings by licensed human clinicians on real or simulated clinical transcripts. This is load-bearing for the attribution of harm to RLHF, as systematic judge bias toward over-penalizing safety refusals could produce the observed collapse without reflecting actual clinical disruption.
  2. [Results] Results (Severity Escalation and Model Comparison): The reported drops (e.g., one model's task completeness from 92% to 71%, frontier model's safety-interference from 0.99 to 0.61) are presented without statistical controls, confidence intervals, or tests for significance. This weakens the claim of a 'systematic, modality-spanning failure' across the 250 PE and 146 CBT scenarios, as it is unclear whether differences exceed what would be expected from scenario variability or model stochasticity.
minor comments (2)
  1. [Abstract] Abstract: The statistic that 'simulations reveal psychological deterioration in over one-third of cases' is stated without citation to the source study or details on how deterioration was measured.
  2. [Discussion] The five-axis framework is introduced in the abstract and conclusion but lacks an explicit table or section detailing how each axis is operationalized with scoring rubrics or example prompts.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments, which have helped us strengthen the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Evaluation Methodology] Evaluation Methodology (LLM Judge Panel): The central quantitative claims—appropriateness scores of 0.22-0.33 and protocol fidelity of zero—rest entirely on scores from a three-judge LLM panel. No details are provided on judge calibration, inter-rater agreement (e.g., Fleiss' kappa), or validation against ratings by licensed human clinicians on real or simulated clinical transcripts. This is load-bearing for the attribution of harm to RLHF, as systematic judge bias toward over-penalizing safety refusals could produce the observed collapse without reflecting actual clinical disruption.

    Authors: We agree that additional details on the evaluation methodology are necessary to support the claims. In the revised manuscript, we have expanded Section 3 to include the exact prompting strategy for the three-judge LLM panel, including the rubrics used for scoring protocol fidelity and therapeutic appropriateness. We have also computed and reported inter-rater agreement using Fleiss' kappa, which was 0.68 indicating substantial agreement. Regarding validation against human clinicians, we conducted a pilot study on 20 randomly selected scenarios where two licensed therapists rated the transcripts independently; agreement with the LLM panel was 75% on binary appropriateness judgments. We acknowledge that this is not a full validation and have added this as a limitation, noting that future work should include larger-scale human expert evaluation. We maintain that the consistent patterns across models and scenarios support the attribution to safety training rather than judge bias alone. revision: partial

  2. Referee: [Results] Results (Severity Escalation and Model Comparison): The reported drops (e.g., one model's task completeness from 92% to 71%, frontier model's safety-interference from 0.99 to 0.61) are presented without statistical controls, confidence intervals, or tests for significance. This weakens the claim of a 'systematic, modality-spanning failure' across the 250 PE and 146 CBT scenarios, as it is unclear whether differences exceed what would be expected from scenario variability or model stochasticity.

    Authors: We appreciate this point and have revised the results section to include statistical analysis. Specifically, we now report 95% bootstrap confidence intervals for all key metrics based on 1000 resamples of the scenarios. For the severity escalation comparisons, we performed paired statistical tests (Wilcoxon signed-rank test for non-normal distributions) and report p-values, confirming that the drops in task completeness (p=0.003) and safety-interference (p=0.012) are statistically significant. We have also added error bars to the relevant figures and discussed the variability across scenarios. These additions support the claim of systematic effects while transparently showing the magnitude of uncertainty. revision: yes

standing simulated objections not resolved
  • Full-scale validation of the LLM judge panel by multiple licensed clinicians across the entire set of 396+ scenarios, which would require a separate IRB-approved study.

Circularity Check

0 steps flagged

No circularity; empirical evaluation is self-contained

full rationale

The paper reports direct empirical results from applying four LLMs to 250 PE and 146 CBT scenarios (plus variants), scored by a three-judge LLM panel on surface acknowledgment, therapeutic appropriateness, protocol fidelity, task completeness, and safety interference. No equations, fitted parameters, self-citations, or prior-work ansatzes appear in the derivation; the central claim that RLHF disrupts therapeutic mechanisms follows from the observed score collapses (e.g., appropriateness 0.22-0.33, fidelity 0) rather than being defined by or reduced to any input quantity. The five-axis framework is introduced as a new mapping to regulatory requirements, not imported or renamed from prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that LLM judges can stand in for clinical experts and that the chosen scenarios are representative of real therapy. No free parameters are fitted; the five-axis framework is an invented evaluation structure.

axioms (1)
  • domain assumption LLM-based judges can produce reliable scores of therapeutic appropriateness without calibration against licensed clinicians
    Invoked when the abstract states that models were scored by a three-judge LLM panel
invented entities (1)
  • five-axis evaluation framework (protocol fidelity, hallucination risk, behavioral consistency, crisis safety, demographic robustness) no independent evidence
    purpose: To map AI mental health systems onto regulatory requirements
    New structure proposed in the abstract; no independent evidence provided beyond the current experiments

pith-pipeline@v0.9.0 · 5586 in / 1294 out tokens · 45017 ms · 2026-05-08T08:01:42.622107+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 8 canonical work pages

  1. [1]

    Jiahao Qiu, Yinghui He, Xinzhe Juan, Yiming Wang, Yuhan Liu, Zixin Yao, Yue Wu, Xun Jiang, Ling Yang, and Mengdi Wang

    Systematic review of 160 studies, 2020-2024. Jiahao Qiu, Yinghui He, Xinzhe Juan, Yiming Wang, Yuhan Liu, Zixin Yao, Yue Wu, Xun Jiang, Ling Yang, and Mengdi Wang. EmoAgent: Assessing and safeguarding human-AI interaction for mental health safety. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP),

  2. [2]

    BN Suhas, Andrew M

    doi: 10.1093/med-psych/9780190926939.001.0001. BN Suhas, Andrew M. Sherrill, Rosa I. Arriaga, Christopher W. Wiese, and Saeed Abdullah. Thousand voices of trauma: A large-scale synthetic dataset for modeling prolonged exposure therapy conversa- tions. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Tra...

  3. [3]

    146 therapist-curated CBT cognitive restructuring exercises

    doi: 10.1145/3712299. 146 therapist-curated CBT cognitive restructuring exercises. Wenhui Zhong, Jianghua Luo, and Hong Zhang. The therapeutic effectiveness of artificial intelligence- based chatbots in alleviation of depressive and anxiety symptoms in short-course treatments: A sys- tematic review and meta-analysis.Journal of Affective Disorders,

  4. [4]

    Scoping review of 132 studies following PRISMA-ScR guidelines

    doi: 10.31234/osf.io/n7qep. Scoping review of 132 studies following PRISMA-ScR guidelines. Marcin Rzadeczka, Anna Sterna, Julia Stolińska, Paulina Kaczyńska, and Marcin Moskalewicz. The efficacy of conversational AI in rectifying the theory-of-mind and autonomy biases: Comparative analysis.JMIR Mental Health,

  5. [5]

    Praveen Kanithi, Clément Christophe, Marco A. F. Pimentel, Tathagata Raha, Nada Saadi, Hamza A Javed, et al. MEDIC: Comprehensive evaluation of leading indicators for LLM safety and utility in clinical applications.arXiv preprint arXiv:2409.07819,

  6. [6]

    doi: 10.1037/tmb0000163. David C. Atkins, Mark Steyvers, Zac E. Imel, and Padhraic Smyth. Scaling up the evaluation of psy- chotherapy: Evaluating motivational interviewing fidelity via statistical text classification.Implemen- tation Science, 9:49,

  7. [7]

    Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I

    BN Suhas, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga, Christopher W. Wiese, and Saeed Abdullah. Fine-tuning large audio-language models with LoRA for precise temporal localization of prolonged exposure therapy elements. InICASSP 2026 - IEEE International Conference on Acoustics, Speech and Signal Processing,

  8. [8]

    Mason, Susan Chen, Sundararajan Srinivasan, Chaitanya Shivade, Jack Moriarty, and Joseph Paul Cohen

    24 Suhas BN, Han-Chin Shing, Lei Xu, Mitch Strong, Jon Burnsky, Jessica Ofor, Jordan R. Mason, Susan Chen, Sundararajan Srinivasan, Chaitanya Shivade, Jack Moriarty, and Joseph Paul Cohen. Fact- controlled diagnosis of hallucinations in medical text summarization. InProc. Interspeech 2025, pages 3070–3074,

  9. [9]

    Subhabrata Mukherjee, Paul Gamble, Markel Sanz Ausin, Neel Kant, Kriti Aggarwal, et al

    doi: 10.21437/Interspeech.2025-537. Subhabrata Mukherjee, Paul Gamble, Markel Sanz Ausin, Neel Kant, Kriti Aggarwal, et al. Polaris: A safety-focused LLM constellation architecture for healthcare.arXiv preprint,

  10. [10]

    Chenhan Lyu, Yutong Song, Pengfei Zhang, and Amir M

    Avail- able at:https://www.nytimes.com/2024/10/23/technology/characterai-teen-suicide.html. Chenhan Lyu, Yutong Song, Pengfei Zhang, and Amir M. Rahmani. Domain-specific constitutional AI: Enhancing safety in LLM-powered mental health chatbots. InInternational Conference on Wearable and Implantable Body Sensor Networks,

  11. [11]

    Richard J

    doi: 10.1016/j.mcpdig.2026.100353. Richard J. Chen, Ming Y. Lu, Tiffany Y. Chen, Drew F.K. Williamson, and Faisal Mahmood. Synthetic data in machine learning for medicine and healthcare.Nature Biomedical Engineering, 5:493–497,

  12. [12]

    Privacy sensitive speech analysis using federated learning to assess depression

    BN Suhas and Saeed Abdullah. Privacy sensitive speech analysis using federated learning to assess depression. InICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Pro- cessing,

  13. [13]

    Available at:https://www.statnews.com/2025/07/02/ woebot-therapy-chatbot-shuts-down-founder-says-ai-moving-faster-than-regulators/. Gary E. Weissman. FDA regulation of predictive clinical decision-support tools: What does it mean for hospitals?Journal of Hospital Medicine,

  14. [14]

    Anca Parmena Olimid

    Comment. Anca Parmena Olimid. Legal analysis of EU artificial intelligence act (2024): Insights from personal data governance and health policy.Access to Justice in Eastern Europe,

  15. [15]

    MedQA-CS: OSCE-style benchmark for evaluating LLM clinical skills

    Zonghai Yao, Zihao Zhang, Chaolong Tang, et al. MedQA-CS: OSCE-style benchmark for evaluating LLM clinical skills. InarXiv preprint arXiv:2410.01553,