One Year Later...The Harms Persist, But So Do We!

Annika Marie Schoene; Anson Antony; Cansu Canca; Gautham Vijay Kumar

arxiv: 2606.23884 · v3 · pith:OJWYBJEQnew · submitted 2026-06-22 · 💻 cs.CL · cs.AI

One Year Later...The Harms Persist, But So Do We!

Annika Marie Schoene , Cansu Canca , Gautham Vijay Kumar , Anson Antony This is my paper

Pith reviewed 2026-07-02 21:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsmental healthsafety guardrailsadversarial attacksharm taxonomyDSM-5 conditionsLLM evaluationclinical safety

0 comments

The pith

LLM safety guardrails work reliably only for suicide and self-harm queries but fail for other mental health conditions with rates up to 100%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests eight proprietary large language models on conversations about 16 DSM-5 mental health conditions. It applies four variants of adversarial attacks and measures outputs against a new eight-dimension harm taxonomy. Safeguards succeed consistently only on suicide and self-harm topics. For eating disorders, substance use disorder, and major depressive disorder the models produce harmful content in nearly all tested cases. The authors conclude that current models require condition-specific safeguards before safe use in public settings such as schools or consumer chatbots.

Core claim

General-purpose LLMs show adequate safeguards only for suicide and self-harm; conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. The evaluation uses an eight-dimension harm taxonomy and a multi-dimensional framework across eight models, 16 DSM-5 conditions, and four adversarial attack variants.

What carries the argument

Eight-dimension harm taxonomy paired with a multi-dimensional evaluation framework that applies four adversarial attack variants to eight LLMs across 16 DSM-5 conditions.

If this is right

Ethical design of LLMs requires clearly defined harm categories for each clinical condition.
Safeguards must be implemented separately for each condition rather than applied uniformly.
Integration of these models into schools, search engines, and consumer chatbots carries risks to vulnerable populations until condition-specific protections exist.
Models without such safeguards should not be deployed in publicly available mental health contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evaluation approach could be applied to non-clinical sensitive topics such as legal or financial advice to test whether failures are domain-specific.
Real-world deployment might require ongoing monitoring beyond initial adversarial testing because user phrasing evolves.
Fine-tuning or post-training alignment targeted at individual conditions could be tested as a direct follow-up to the reported failure rates.

Load-bearing premise

The eight-dimension harm taxonomy and four adversarial attack variants capture the full range of harmful outputs that would appear in real user interactions with LLMs on these topics.

What would settle it

Direct observation of real user conversations with the same LLMs on the tested conditions, scored with the same taxonomy, would show whether failure rates match or diverge from the adversarial test results.

read the original abstract

General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety guardrails remain inadequate and inconsistent across clinical conditions. This study evaluates eight proprietary LLMs across 16 DSM-5 conditions using four adversarial attack variants, introducing an eight-dimension harm taxonomy and a multi-dimensional evaluation framework. Results show that safeguards hold reliably only for suicide and self-harm, while conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. We argue that ethical design and deployment of these LLMs demand clearly defined harm categories across clinical conditions and implementation of safeguards accordingly. Until such safeguards are in place, these models pose significant risks to vulnerable populations, making their growing integration into publicly available settings (e.g., schools, search engines, and consumer chatbots) are particularly concerning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new harm taxonomy and condition-by-condition results are the main addition, but the high failure rates rest on prompts and classifications whose link to actual user interactions is not shown.

read the letter

The paper's main addition is an eight-dimension harm taxonomy applied to 16 DSM-5 conditions across eight proprietary LLMs, using four adversarial variants. It reports that safeguards are strong only for suicide and self-harm while other conditions reach 100% failure. That structured breakdown across clinical topics is the concrete piece others could use or test.

The work is straightforward in its empirical setup and flags a real deployment issue for consumer-facing models. The taxonomy gives a clear way to categorize outputs, which is useful even if later work revises it.

The soft spots are in the methods and validation. No sample sizes, prompt counts, or inter-rater numbers appear in the abstract, and the stress-test concern holds: the attack variants and taxonomy have no demonstrated match to logged user queries or clinician-rated incidents. If the chosen prompts overstate harms that rarely occur in normal conversation, the 100% figures do not support the safety conclusion. The full text may add these details, but the current evidence leaves the central claim unverified.

This is for AI safety researchers who work on mental-health-adjacent applications. A reader in that group gets a usable taxonomy and a set of conditions to compare against their own tests. It deserves peer review because the topic is timely and the empirical framing is clear, even though the methods will need tightening before the results can be taken as reliable.

Referee Report

3 major / 1 minor

Summary. The manuscript evaluates the safety guardrails of eight proprietary LLMs for conversations on 16 DSM-5 mental health conditions. Using four adversarial attack variants and a newly introduced eight-dimension harm taxonomy, the authors find that safeguards are reliable only for suicide and self-harm, with failure rates reaching 100% for conditions including eating disorders, substance use disorder, and major depressive disorder. They conclude that current LLMs pose risks to vulnerable populations and call for better-defined harm categories and safeguards.

Significance. If the empirical results and taxonomy prove robust, this study would provide valuable evidence on the limitations of current LLM safety mechanisms in clinical domains, potentially guiding future development of more targeted guardrails. The multi-dimensional evaluation framework could serve as a basis for standardized assessments in the field.

major comments (3)

[Abstract] Abstract: The abstract claims failure rates of up to 100% for several conditions but provides no details on sample sizes, number of prompts tested, inter-rater reliability for the taxonomy, or statistical controls, preventing verification of the central claim.
[Methods] Methods (implied by abstract description): The eight-dimension harm taxonomy and four adversarial attack variants are presented without demonstrated ecological validity, such as validation against logged user queries, clinical incident data, or expert clinician ratings, which is load-bearing for interpreting the reported failure rates as indicative of real-world harms.
[Results] Results (implied): The strongest claim regarding condition-specific failure rates lacks supporting quantitative details (e.g., exact percentages per condition, confidence intervals), making it difficult to assess the magnitude and consistency of the findings.

minor comments (1)

[Abstract] Abstract: The final sentence contains a grammatical error: 'making their growing integration into publicly available settings (e.g., schools, search engines, and consumer chatbots) are particularly concerning.' should be rephrased for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed feedback, which identifies opportunities to improve transparency and support for our claims. We address each major comment below with proposed revisions where feasible. Our core empirical findings on the differential failure rates of LLM guardrails across mental health conditions remain unchanged, but we will enhance the presentation of methods and results to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract claims failure rates of up to 100% for several conditions but provides no details on sample sizes, number of prompts tested, inter-rater reliability for the taxonomy, or statistical controls, preventing verification of the central claim.

Authors: We agree the abstract prioritizes brevity over methodological specifics. The manuscript evaluates eight LLMs on 16 conditions with four attack variants, and the full Methods section describes the prompt generation process and taxonomy development. We will revise the abstract to incorporate the evaluation scale (number of models, conditions, and attack types) and a high-level reference to the taxonomy construction, while directing readers to the Methods for sample sizes, development process, and any reliability measures. revision: partial
Referee: [Methods] Methods (implied by abstract description): The eight-dimension harm taxonomy and four adversarial attack variants are presented without demonstrated ecological validity, such as validation against logged user queries, clinical incident data, or expert clinician ratings, which is load-bearing for interpreting the reported failure rates as indicative of real-world harms.

Authors: The taxonomy dimensions are explicitly mapped to DSM-5 criteria for the selected conditions and draw on established AI safety harm categories. The attack variants extend documented jailbreak techniques. Direct validation against real user logs or clinical incidents was not conducted due to privacy, ethical, and access constraints. We will expand the Methods section with additional rationale for the taxonomy and attacks, and add a Limitations subsection explicitly discussing ecological validity and the need for future clinician-validated benchmarks. revision: partial
Referee: [Results] Results (implied): The strongest claim regarding condition-specific failure rates lacks supporting quantitative details (e.g., exact percentages per condition, confidence intervals), making it difficult to assess the magnitude and consistency of the findings.

Authors: The Results section reports condition-specific outcomes across models. We will revise to ensure every reported failure rate is accompanied by the exact sample size per condition-attack combination and, where the data structure permits, include binomial confidence intervals to quantify precision and consistency. revision: yes

Circularity Check

0 steps flagged

No mathematical derivation or self-referential fitting present

full rationale

This is an empirical evaluation study that introduces an eight-dimension harm taxonomy and applies four adversarial attack variants to LLM outputs across DSM-5 conditions. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear in the provided abstract or description. The central claims rest on direct application of the authors' framework to generated outputs rather than any derivation that reduces to its own inputs by construction. No self-citation load-bearing steps or renamings of known results are indicated. The work is self-contained as an empirical assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study; the harm taxonomy is a new classification scheme rather than a mathematical construct. No free parameters, standard mathematical axioms, or invented physical entities are introduced.

pith-pipeline@v0.9.1-grok · 5680 in / 1181 out tokens · 24320 ms · 2026-07-02T21:36:02.909156+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 24 canonical work pages · 5 internal anchors

[1]

Crawford and T

A. Crawford and T. Glatard. Urgent considerations for suicide prevention in the safe and ethical use of artificial intelligence.CMAJ, 198(15):E599–E601, April 2026. doi: 10.1503/cmaj.251693

work page doi:10.1503/cmaj.251693 2026
[2]

A. M. Schoene and C. Canca. ‘For Argument’s Sake, Show Me How to Harm Myself!’: Jailbreaking LLMs in suicide and self-harm contexts, 2025. arXiv:2507.02990

work page arXiv 2025
[3]

American Psychiatric Publishing, Washington, DC, USA, 5th, text revision edition, 2022

American Psychiatric Association.Diagnostic and Statistical Manual of Mental Disorders: DSM-5-TR. American Psychiatric Publishing, Washington, DC, USA, 5th, text revision edition, 2022

2022
[4]

H. R. Lawrence, R. A. Schneider, S. B. Rubin, M. J. Mataric, D. J. McDuff, and M. J. Bell. The opportunities and risks of large language models in mental health.JMIR Mental Health, 11(1):e59479, 2024. doi: 10.2196/59479. 12 One Year Later

work page doi:10.2196/59479 2024
[5]

De Choudhury, S

M. De Choudhury, S. R. Pendse, and N. Kumar. Benefits and harms of large language models in digital mental health, 2023. arXiv preprint, arXiv:2311.14693

work page arXiv 2023
[6]

Hua et al

Y . Hua et al. Large language models in mental health care: A scoping review, 2024. arXiv:2401.02984

work page arXiv 2024
[7]

Elyoseph and I

Z. Elyoseph and I. Levkovich. Beyond human expertise: The promise and limitations of ChatGPT in suicide risk assessment.Front. Psychiatry, 14:1213141, 2023. doi: 10.3389/fpsyt.2023.1213141

work page doi:10.3389/fpsyt.2023.1213141 2023
[8]

Levkovich and Z

I. Levkovich and Z. Elyoseph. Suicide risk assessments through the eyes of ChatGPT-3.5 versus ChatGPT-4: Vignette study.JMIR Mental Health, 10:e51232, 2023. doi: 10.2196/51232

work page doi:10.2196/51232 2023
[9]

R. K. McBain et al. Evaluation of alignment between large language models and expert clinicians in suicide risk assessment.Psychiatr . Serv., 76(11), 2025. doi: 10.1176/appi.ps.20250086

work page doi:10.1176/appi.ps.20250086 2025
[10]

Arnaiz-Rodriguez et al

A. Arnaiz-Rodriguez et al. Between help and harm: An evaluation of mental health crisis handling by LLMs. JMIR Mental Health, 2025. doi: 10.2196/88435

work page doi:10.2196/88435 2025
[11]

Holmes, B

G. Holmes, B. Tang, S. Gupta, S. Venkatesh, H. Christensen, and A. Whitton. Applications of large language models in the field of suicide prevention: Scoping review.J. Med. Internet Res., 27:e63126, 2025. doi: 10.2196/63126

work page doi:10.2196/63126 2025
[12]

Yildirim

C. Yildirim. Differential harm propensity in personalized LLM agents, the curious case of mental health disclosure,
[13]

arXiv preprint arXiv:2603.16734

work page arXiv
[14]

Souly et al

A. Souly et al. A StrongREJECT for empty jailbreaks. InAdv. Neural Inf. Process. Syst. (NeurIPS), volume 37, 2024

2024
[15]

Mazeika et al

M. Mazeika et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProc. 41st Int. Conf. Mach. Learn. (ICML), pages 35181–35224, 2024

2024
[16]

Cao et al

H. Cao et al. SafeDialBench: A fine-grained safety evaluation benchmark for large language models in multi-turn dialogues with diverse jailbreak attacks, 2025. arXiv:2502.11090

work page arXiv 2025
[17]

Badawi et al

A. Badawi et al. Assessing the quality of mental health support in LLM responses through multi-attribute human evaluation, 2026. arXiv:2601.18630

work page arXiv 2026
[18]

Judd et al

N. Judd et al. Independent clinical evaluation of general-purpose LLM responses to signals of suicide risk, 2025. arXiv:2510.27521

work page arXiv 2025
[19]

Zirikly, P

A. Zirikly, P. Resnik, O. Uzuner, and K. Hollingshead. CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. InProc. 6th Workshop Comput. Linguistics Clin. Psychol. (CLPsych), pages 24–33, Stroudsburg, PA, USA, 2019. Assoc. Comput. Linguistics. [Online]. Available: https://aclanthology.org/ W19-3003/

2019
[20]

Z. Guo, A. Lai, J. H. Thygesen, J. Farrington, T. Keen, and K. Li. Large language models for mental health applications: Systematic review.JMIR Mental Health, 2024. doi: 10.2196/57400

work page doi:10.2196/57400 2024
[21]

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang. ‘Do Anything Now’: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProc. ACM SIGSAC Conf. Comput. Commun. Secur . (CCS), pages 1671–1685, 2024

2024
[22]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Y . Liu et al. Jailbreaking ChatGPT via prompt engineering: An empirical study, 2023. arXiv preprint, arXiv:2305.13860

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does LLM safety training fail?Adv. Neural Inf. Process. Syst. (NeurIPS), 36:80079–80110, 2023

2023
[24]

Xie et al

T. Xie et al. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. InProc. Int. Conf. Learn. Represent. (ICLR), 2025

2025
[25]

Ganguli et al

D. Ganguli et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,
[26]

arXiv preprint, arXiv:2209.07858

work page internal anchor Pith review Pith/arXiv arXiv
[27]

X. Yang, B. Zhou, X. Tang, J. Han, and S. Hu. Chain of attack: Hide your intention through multi-turn interrogation. InFindings of the Assoc. Comput. Linguistics: ACL 2025, pages 9881–9901, Vienna, Austria, 2025. doi: 10.18653/v1/2025.findings-acl.514

work page doi:10.18653/v1/2025.findings-acl.514 2025
[28]

Zhang, Q

H. Zhang, Q. Lou, and Y . Wang. Towards safe AI clinicians: A comprehensive study on large language model jailbreaking in healthcare, 2025. arXiv:2501.18632

work page arXiv 2025
[29]

Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

Q. Xu et al. Do no harm: Exposing hidden vulnerabilities of LLMs via persona-based client simulation attack in psychological counseling, 2026. arXiv:2604.04842. 13 One Year Later

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Zhu et al

L. Zhu et al. JudgeLM: Fine-tuned large language models are scalable judges. InProc. Int. Conf. Learn. Represent. (ICLR), 2025

2025
[31]

Li et al

J. Li et al. Generative judge for evaluating alignment. InProc. Int. Conf. Learn. Represent. (ICLR), 2024

2024
[32]

Kim et al

S. Kim et al. Prometheus 2: An open source language model specialized in evaluating other language models. In Proc. 2024 Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2024

2024
[33]

Wang et al

Y . Wang et al. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. InProc. Int. Conf. Learn. Represent. (ICLR), 2024. arXiv:2306.05087

work page arXiv 2024
[34]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models, 2024. arXiv preprint, arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

M. M. Linehan.Cognitive Behavioral Treatment of Borderline Personality Disorder. Guilford Press, New York, NY , USA, 1993

1993
[36]

Newlove et al

L. Newlove et al. Crisis-focused psychosocial interventions for borderline personality disorder: systematic review and narrative synthesis.BJPsych Open, 8(3):e94, 2022. doi: 10.1192/bjo.2022.54

work page doi:10.1192/bjo.2022.54 2022
[37]

Towards Understanding Sycophancy in Language Models

M. Sharma et al. Towards understanding sycophancy in language models, 2024. arXiv:2310.13548. A Experiment 2b Turn Structure The nine-turn structure of Experiment 2b escalates as follows. T1 introduces a fiction-embedded harm query in which the teenage protagonist begins researching the condition for their novel. T2 presents an avoidance monologue in whic...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Crawford and T

A. Crawford and T. Glatard. Urgent considerations for suicide prevention in the safe and ethical use of artificial intelligence.CMAJ, 198(15):E599–E601, April 2026. doi: 10.1503/cmaj.251693

work page doi:10.1503/cmaj.251693 2026

[2] [2]

A. M. Schoene and C. Canca. ‘For Argument’s Sake, Show Me How to Harm Myself!’: Jailbreaking LLMs in suicide and self-harm contexts, 2025. arXiv:2507.02990

work page arXiv 2025

[3] [3]

American Psychiatric Publishing, Washington, DC, USA, 5th, text revision edition, 2022

American Psychiatric Association.Diagnostic and Statistical Manual of Mental Disorders: DSM-5-TR. American Psychiatric Publishing, Washington, DC, USA, 5th, text revision edition, 2022

2022

[4] [4]

H. R. Lawrence, R. A. Schneider, S. B. Rubin, M. J. Mataric, D. J. McDuff, and M. J. Bell. The opportunities and risks of large language models in mental health.JMIR Mental Health, 11(1):e59479, 2024. doi: 10.2196/59479. 12 One Year Later

work page doi:10.2196/59479 2024

[5] [5]

De Choudhury, S

M. De Choudhury, S. R. Pendse, and N. Kumar. Benefits and harms of large language models in digital mental health, 2023. arXiv preprint, arXiv:2311.14693

work page arXiv 2023

[6] [6]

Hua et al

Y . Hua et al. Large language models in mental health care: A scoping review, 2024. arXiv:2401.02984

work page arXiv 2024

[7] [7]

Elyoseph and I

Z. Elyoseph and I. Levkovich. Beyond human expertise: The promise and limitations of ChatGPT in suicide risk assessment.Front. Psychiatry, 14:1213141, 2023. doi: 10.3389/fpsyt.2023.1213141

work page doi:10.3389/fpsyt.2023.1213141 2023

[8] [8]

Levkovich and Z

I. Levkovich and Z. Elyoseph. Suicide risk assessments through the eyes of ChatGPT-3.5 versus ChatGPT-4: Vignette study.JMIR Mental Health, 10:e51232, 2023. doi: 10.2196/51232

work page doi:10.2196/51232 2023

[9] [9]

R. K. McBain et al. Evaluation of alignment between large language models and expert clinicians in suicide risk assessment.Psychiatr . Serv., 76(11), 2025. doi: 10.1176/appi.ps.20250086

work page doi:10.1176/appi.ps.20250086 2025

[10] [10]

Arnaiz-Rodriguez et al

A. Arnaiz-Rodriguez et al. Between help and harm: An evaluation of mental health crisis handling by LLMs. JMIR Mental Health, 2025. doi: 10.2196/88435

work page doi:10.2196/88435 2025

[11] [11]

Holmes, B

G. Holmes, B. Tang, S. Gupta, S. Venkatesh, H. Christensen, and A. Whitton. Applications of large language models in the field of suicide prevention: Scoping review.J. Med. Internet Res., 27:e63126, 2025. doi: 10.2196/63126

work page doi:10.2196/63126 2025

[12] [12]

Yildirim

C. Yildirim. Differential harm propensity in personalized LLM agents, the curious case of mental health disclosure,

[13] [13]

arXiv preprint arXiv:2603.16734

work page arXiv

[14] [14]

Souly et al

A. Souly et al. A StrongREJECT for empty jailbreaks. InAdv. Neural Inf. Process. Syst. (NeurIPS), volume 37, 2024

2024

[15] [15]

Mazeika et al

M. Mazeika et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InProc. 41st Int. Conf. Mach. Learn. (ICML), pages 35181–35224, 2024

2024

[16] [16]

Cao et al

H. Cao et al. SafeDialBench: A fine-grained safety evaluation benchmark for large language models in multi-turn dialogues with diverse jailbreak attacks, 2025. arXiv:2502.11090

work page arXiv 2025

[17] [17]

Badawi et al

A. Badawi et al. Assessing the quality of mental health support in LLM responses through multi-attribute human evaluation, 2026. arXiv:2601.18630

work page arXiv 2026

[18] [18]

Judd et al

N. Judd et al. Independent clinical evaluation of general-purpose LLM responses to signals of suicide risk, 2025. arXiv:2510.27521

work page arXiv 2025

[19] [19]

Zirikly, P

A. Zirikly, P. Resnik, O. Uzuner, and K. Hollingshead. CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. InProc. 6th Workshop Comput. Linguistics Clin. Psychol. (CLPsych), pages 24–33, Stroudsburg, PA, USA, 2019. Assoc. Comput. Linguistics. [Online]. Available: https://aclanthology.org/ W19-3003/

2019

[20] [20]

Z. Guo, A. Lai, J. H. Thygesen, J. Farrington, T. Keen, and K. Li. Large language models for mental health applications: Systematic review.JMIR Mental Health, 2024. doi: 10.2196/57400

work page doi:10.2196/57400 2024

[21] [21]

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang. ‘Do Anything Now’: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProc. ACM SIGSAC Conf. Comput. Commun. Secur . (CCS), pages 1671–1685, 2024

2024

[22] [22]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Y . Liu et al. Jailbreaking ChatGPT via prompt engineering: An empirical study, 2023. arXiv preprint, arXiv:2305.13860

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does LLM safety training fail?Adv. Neural Inf. Process. Syst. (NeurIPS), 36:80079–80110, 2023

2023

[24] [24]

Xie et al

T. Xie et al. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. InProc. Int. Conf. Learn. Represent. (ICLR), 2025

2025

[25] [25]

Ganguli et al

D. Ganguli et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,

[26] [26]

arXiv preprint, arXiv:2209.07858

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

X. Yang, B. Zhou, X. Tang, J. Han, and S. Hu. Chain of attack: Hide your intention through multi-turn interrogation. InFindings of the Assoc. Comput. Linguistics: ACL 2025, pages 9881–9901, Vienna, Austria, 2025. doi: 10.18653/v1/2025.findings-acl.514

work page doi:10.18653/v1/2025.findings-acl.514 2025

[28] [28]

Zhang, Q

H. Zhang, Q. Lou, and Y . Wang. Towards safe AI clinicians: A comprehensive study on large language model jailbreaking in healthcare, 2025. arXiv:2501.18632

work page arXiv 2025

[29] [29]

Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

Q. Xu et al. Do no harm: Exposing hidden vulnerabilities of LLMs via persona-based client simulation attack in psychological counseling, 2026. arXiv:2604.04842. 13 One Year Later

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Zhu et al

L. Zhu et al. JudgeLM: Fine-tuned large language models are scalable judges. InProc. Int. Conf. Learn. Represent. (ICLR), 2025

2025

[31] [31]

Li et al

J. Li et al. Generative judge for evaluating alignment. InProc. Int. Conf. Learn. Represent. (ICLR), 2024

2024

[32] [32]

Kim et al

S. Kim et al. Prometheus 2: An open source language model specialized in evaluating other language models. In Proc. 2024 Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2024

2024

[33] [33]

Wang et al

Y . Wang et al. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. InProc. Int. Conf. Learn. Represent. (ICLR), 2024. arXiv:2306.05087

work page arXiv 2024

[34] [34]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models, 2024. arXiv preprint, arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

M. M. Linehan.Cognitive Behavioral Treatment of Borderline Personality Disorder. Guilford Press, New York, NY , USA, 1993

1993

[36] [36]

Newlove et al

L. Newlove et al. Crisis-focused psychosocial interventions for borderline personality disorder: systematic review and narrative synthesis.BJPsych Open, 8(3):e94, 2022. doi: 10.1192/bjo.2022.54

work page doi:10.1192/bjo.2022.54 2022

[37] [37]

Towards Understanding Sycophancy in Language Models

M. Sharma et al. Towards understanding sycophancy in language models, 2024. arXiv:2310.13548. A Experiment 2b Turn Structure The nine-turn structure of Experiment 2b escalates as follows. T1 introduces a fiction-embedded harm query in which the teenage protagonist begins researching the condition for their novel. T2 presents an avoidance monologue in whic...

work page internal anchor Pith review Pith/arXiv arXiv 2024