Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs
Pith reviewed 2026-05-18 12:49 UTC · model grok-4.3
The pith
LLMs often produce unsafe responses to self-harm and suicidal crises despite handling some explicit cases reliably.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs can respond reliably to some explicit mental health crises, yet significant risks remain because many outputs, especially in self-harm and suicidal ideation categories, are inappropriate or unsafe; performance varies across models with some showing low harm rates while others generate more unsafe replies, and all models struggle with indirect signals, default replies, and context misalignment.
What carries the argument
A six-category clinical crisis taxonomy paired with a 5-point Likert scale response assessment protocol used to classify inputs and audit LLM safety and appropriateness.
If this is right
- Alignment and safety practices beyond model scale are crucial for reliable crisis support.
- The taxonomy, datasets, and evaluation protocol can be used to guide further development of safer AI mental health tools.
- All tested models require improved detection of indirect crisis signals to reduce potential harm.
- Better context-aware response mechanisms are needed to avoid default or misaligned replies in crisis situations.
Where Pith is reading between the lines
- Real-world users turning to chatbots during distress may encounter inconsistent safety levels depending on which model they reach.
- Specialized fine-tuning or layered safety filters could be tested to improve handling of the indirect signals where current models fail.
- Deployment of LLMs for mental health queries may benefit from mandatory third-party audits using similar taxonomies before public release.
Load-bearing premise
The 5-point Likert scale ratings produced under the clinical response assessment protocol accurately reflect clinical safety and appropriateness without systematic evaluator bias or incomplete context.
What would settle it
Independent clinical experts re-rating the same set of model responses on the identical 5-point scale and arriving at substantially different harm rates would undermine the reported model comparisons.
Figures
read the original abstract
Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a six-category crisis taxonomy for mental health issues, curates a dataset of 2,252 relevant inputs from over 239,000 examples across 12 source datasets, develops a clinical response assessment protocol, and evaluates LLMs both for automatic crisis classification (three models) and for response safety/appropriateness (five models) via 5-point Likert-scale grading from harmful (1) to appropriate (5). It reports that models such as gpt-5-nano and deepseek-v3.2-exp exhibit low harm rates while gpt-4o-mini and grok-4-fast produce more unsafe outputs, that all models struggle with indirect signals and context misalignment, and that better safeguards are needed.
Significance. If the grading protocol proves reliable, the work supplies a reusable taxonomy, dataset, and evaluation framework that directly supports empirical research on LLM safety in high-stakes mental-health contexts. The concrete model comparisons and emphasis on indirect-signal failures provide actionable evidence that alignment practices matter beyond scale.
major comments (2)
- [Clinical response assessment protocol and results sections] The central quantitative claims (low harm rates for gpt-5-nano and deepseek-v3.2-exp; higher unsafe rates for gpt-4o-mini and grok-4-fast; universal difficulty with indirect signals) rest entirely on 5-point Likert scores assigned under the clinical response assessment protocol. No inter-rater reliability statistics, rater qualifications, blinding procedures, or exclusion criteria are reported, leaving the validity of all model comparisons open to systematic bias or inconsistent thresholds.
- [Dataset curation and taxonomy sections] The dataset curation step (reduction from 239k inputs to 2,252 classified examples) is described at a high level but lacks explicit details on how the six-category taxonomy was applied, inter-annotator agreement for the initial classification, or precise exclusion rules. These omissions directly affect reproducibility of the evaluation set that underpins every reported performance difference.
minor comments (2)
- [Abstract] The abstract states that three LLMs were tested for automatic classification but does not name the models or report their accuracy/F1 scores; adding these numbers would clarify the classification results.
- [Results] Consider reporting confidence intervals or statistical tests on the Likert-score differences between models to strengthen the comparative claims.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve transparency and reproducibility.
read point-by-point responses
-
Referee: [Clinical response assessment protocol and results sections] The central quantitative claims (low harm rates for gpt-5-nano and deepseek-v3.2-exp; higher unsafe rates for gpt-4o-mini and grok-4-fast; universal difficulty with indirect signals) rest entirely on 5-point Likert scores assigned under the clinical response assessment protocol. No inter-rater reliability statistics, rater qualifications, blinding procedures, or exclusion criteria are reported, leaving the validity of all model comparisons open to systematic bias or inconsistent thresholds.
Authors: We agree that the manuscript would benefit from greater transparency in the clinical response assessment protocol. We will revise the relevant section to describe rater qualifications, report inter-rater reliability statistics, detail blinding procedures, and specify exclusion criteria. These additions will allow readers to evaluate the reliability of the Likert-scale results and the model comparisons more rigorously. revision: yes
-
Referee: [Dataset curation and taxonomy sections] The dataset curation step (reduction from 239k inputs to 2,252 classified examples) is described at a high level but lacks explicit details on how the six-category taxonomy was applied, inter-annotator agreement for the initial classification, or precise exclusion rules. These omissions directly affect reproducibility of the evaluation set that underpins every reported performance difference.
Authors: We acknowledge the need for more explicit methodological details on dataset curation. We will expand the taxonomy and curation sections to include concrete examples of taxonomy application, inter-annotator agreement metrics for the classification step, and the precise exclusion rules applied when reducing the initial pool to 2,252 examples. These changes will directly support reproducibility of the evaluation set. revision: yes
Circularity Check
No circularity: empirical evaluation rests on external data and independent ratings
full rationale
The paper performs a direct empirical study: it defines a crisis taxonomy, curates 2,252 examples from prior public mental-health datasets, applies LLMs for classification, and grades model responses on a 5-point Likert scale via a clinical protocol. No equations, fitted parameters, predictions, or derivations appear. All quantitative claims (model harm rates, struggles with indirect signals) are computed from the curated inputs and fresh human ratings rather than reducing to any self-citation chain or self-definitional loop. The protocol and taxonomy are presented as new contributions, not as outputs derived from the evaluation results themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A clinically informed taxonomy of six crisis categories provides a valid and sufficient classification scheme for mental health crisis inputs.
invented entities (1)
-
Six-category crisis taxonomy
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We address these gaps by introducing: (1) a unified taxonomy of six clinically informed mental health crisis categories; (2) a curated... dataset... and (3) an expert-designed protocol for assessing response appropriateness... graded on a 5-point Likert scale
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Responses are rated on a 1-5 scale, ranging from harmful (1) to fully appropriate (5).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
PCSA is the first persona-based client simulation attack that exposes LLMs' vulnerabilities in counseling by generating natural dialogues where models give bad advice, reinforce delusions, and encourage risky actions.
Reference graph
Works this paper leans on
-
[1]
Amod. Nlp mental health conversations: Datasets at hugging face.https://huggingface.co/dat asets/Amod/mental_health_counseling_conversations, 2024. Accessed: 2025-06-19
work page 2024
-
[2]
Arianaazarbal. self-harm-synthetic-eval: Datasets at hugging face.https://huggingface.co/dat asets/arianaazarbal/self-harm-synthetic-eval, 2024. Accessed: 2025-06-19
work page 2024
-
[3]
Oren Asman, John Torous, Amir Tal, et al. Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025
work page 2025
- [4]
-
[5]
Suha Ballout. Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025
work page 2025
-
[6]
Article on chatbot safety concerns, 2025
BBC News. Article on chatbot safety concerns, 2025. URLhttps://www.bbc.com/news/article s/cgerwp7rdlvo. Accessed: 2025-06-14
work page 2025
-
[7]
it’s not only attention we need
Andreas Bucher, Sarah Egger, Inna Vashkite, Wenyuan Wu, and Gerhard Schwabe. “it’s not only attention we need”’: Systematic review of large language models in mental health care.JMIR Mental Health, 12(1):e78410, 2025
work page 2025
-
[8]
Nicholas C Cardamone, Mark Olfson, Timothy Schmutte, Lyle Ungar, Tony Liu, Sara W Cullen, Nathaniel J Williams, and Steven C Marcus. Classifying unstructured text in electronic health records for mental health prediction models: large language model evaluation study.JMIR Medical Informatics, 13(1):e65454, 2025
work page 2025
-
[9]
Neo Christopher Chung, George Dyer, and Lennart Brocki. Challenges of large language models for mental health counseling.arXiv preprint arXiv:2311.13857, 2023. 22
-
[10]
CypsiSAS. transformed suicidal ideation: Datasets at hugging face.https://huggingface.co/d atasets/cypsiSAS/transformed_Suicidal_ideation, 2024. Accessed: 2025-06-19
work page 2024
-
[11]
Ellen-ge Denton and Kiara ´Alvarez. The global prevalence of nonsuicidal self-injury among adoles- cents.JAMA network open, 7(6):e2415406–e2415406, 2024
work page 2024
-
[12]
Thomas J Dishion and Jessica M Tipsord. Peer contagion in child and adolescent social and emo- tional development.Annual review of psychology, 62(1):189–214, 2011
work page 2011
-
[13]
Sebastian Dohn´ any, Zeb Kurth-Nelson, Eleanor Spens, Lennart Luettgau, Alastair Reid, Christo- pher Summerfield, Murray Shanahan, and Matthew M Nour. Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025
-
[14]
Zohar Elyoseph, Inbar Levkovich, et al. Comparing the perspectives of generative ai, mental health experts, and the general public on schizophrenia recovery: case vignette study.JMIR Mental Health, 11(1):e53043, 2024
work page 2024
-
[15]
Framework for Responsible Research and Innovation, 2013
Engineering and Physical Sciences Research Council (EPSRC) and UK Research and Innovation (UKRI). Framework for Responsible Research and Innovation, 2013. URLhttps://www.ukri.org /who-we-are/epsrc/our-policies-and-standards/framework-for-responsible-innovatio n/
work page 2013
-
[16]
Fadodr. mental health dataset: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_dataset, 2024. Accessed: 2025-06-19
work page 2024
-
[17]
Fadodr. mental health therapy: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_therapy, 2024. Accessed: 2025-06-19
work page 2024
-
[18]
Fanyin3639. test test self harm all levels: Datasets at hugging face.https://huggingface.co/d atasets/fanyin3639/test_test_self_harm_all_levels, 2024. Accessed: 2025-06-19
work page 2024
-
[19]
Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. Can ai relate: Testing large language model response for mental health support.arXiv preprint arXiv:2405.12021, 2024
-
[20]
Social media contagion of high-risk behaviors in youth.Pediatric Clinics, 72(2):213–224, 2025
Meredith Gansner, Casey Berson, and Zainub Javed. Social media contagion of high-risk behaviors in youth.Pediatric Clinics, 72(2):213–224, 2025
work page 2025
-
[21]
Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Farrington, Thomas Keen, Kezhi Li, et al. Large language models for mental health applications: Systematic review.JMIR mental health, 11 (1):e57400, 2024
work page 2024
-
[22]
Self-harm and suicide in adolescents
Keith Hawton, Kate EA Saunders, and Rory C O’Connor. Self-harm and suicide in adolescents. The lancet, 379(9834):2373–2382, 2012
work page 2012
-
[23]
Safety of large language models in addressing depression.Cureus, 15(12), 2023
Thomas F Heston. Safety of large language models in addressing depression.Cureus, 15(12), 2023
work page 2023
-
[24]
Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025
work page 2025
-
[25]
Huggingface datasets hub.https://huggingface.co/datasets, 2023
HuggingFace. Huggingface datasets hub.https://huggingface.co/datasets, 2023. Accessed: 2024-06-19
work page 2023
-
[26]
H. Mohamed Hussain. Psych8k: Counseling conversations dataset.https://www.kaggle.com/dat asets/hmohamedhussain/psych8k, 2024. Accessed: 2025-06-26
work page 2024
-
[27]
Jerryjalapeno. nart-100k-synthetic: Datasets at hugging face.https://huggingface.co/dataset s/jerryjalapeno/nart-100k-synthetic, 2024. Accessed: 2025-06-19
work page 2024
-
[28]
The applications of large language models in mental health: Scoping review
Yu Jin, Jiayi Liu, Pan Li, Baosen Wang, Yangxinyu Yan, Huilin Zhang, Chenhao Ni, Jing Wang, Yi Li, Yajun Bu, et al. The applications of large language models in mental health: Scoping review. Journal of Medical Internet Research, 27:e69284, 2025. 23
work page 2025
-
[29]
Responsible research and innovation in the digital age.Communications of the ACM, 60(5):62–68, 2017
Marina Jirotka, Barbara Grimpe, Bernd Stahl, Grace Eden, and Mark Hartswood. Responsible research and innovation in the digital age.Communications of the ACM, 60(5):62–68, 2017
work page 2017
-
[30]
June M Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. Chatcounselor: A large language models for mental health support.arXiv preprint arXiv:2309.15461, 2023
-
[31]
Karin A Mack, Wojciech Kaczkowski, Steven Sumner, Royal Law, and Amy Wolkin. Special re- port from the cdc: Suicide rates, sodium nitrite-related suicides, and online content, united states. Journal of safety research, 89:361–368, 2024
work page 2024
-
[32]
Masab A Mansoor and Kashif H Ansari. Early detection of mental health crises through artifical- intelligence-powered social media analysis: A prospective observational study.Journal of Personal- ized Medicine, 14(9):958, 2024
work page 2024
-
[33]
Ryan K McBain, Jonathan H Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, et al. Competency of large language models in evaluating appropriate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research, 27:e67891, 2025
work page 2025
-
[34]
Talk, trust, and trade-offs: How and why teens use ai companions
Common Sense Media. Talk, trust, and trade-offs: How and why teens use ai companions. Technical report, Common Sense Media, July 2025. URLhttps://www.commonsensemedia.org/researc h/talk-trust-and-trade-offs-how-and-why-teens-use-ai-companions. Accessed October 2025
work page 2025
-
[35]
Kayley Moylan and Kevin Doherty. Expert and interdisciplinary analysis of ai-driven chatbots for mental health support: Mixed methods study.Journal of Medical Internet Research, 27:e67114, 2025
work page 2025
-
[36]
Neimeyer and Kathleen Bonnelle
Robert A. Neimeyer and Kathleen Bonnelle. The suicide intervention response inventory: A revision and validation.Death Studies, 21(1):59–81, 1997
work page 1997
-
[37]
Jacqueline Nesi, Taylor A Burke, Alexandra H Bettis, Anastacia Y Kudinova, Elizabeth C Thomp- son, Heather A MacPherson, Kara A Fox, Hannah R Lawrence, Sarah A Thomas, Jennifer C Wolff, et al. Social media use and self-injurious thoughts and behaviors: A systematic review and meta- analysis.Clinical psychology review, 87:102038, 2021
work page 2021
-
[38]
Nuria Oliver. Governance in the era of data-driven decision-making algorithms.Women Shaping Global Economic Governance, 171, 2019
work page 2019
-
[39]
Helping people when they need it most, 2025
OpenAI. Helping people when they need it most, 2025. URLhttps://openai.com/index/helpi ng-people-when-they-need-it-most/. Accessed: 2025-02-14
work page 2025
-
[40]
World Health Organization, 2025
World Health Organization.Suicide worldwide in 2021: global health estimates. World Health Organization, 2025
work page 2021
-
[41]
Marmik Pandya. mental-health: Datasets at hugging face.https://huggingface.co/datasets/ marmikpandya/mental-health, 2024. Accessed: 2025-06-19
work page 2024
-
[42]
Jung In Park, Mahyar Abbasian, Iman Azimi, Dawn T Bounds, Angela Jun, Jaesu Han, Robert M McCarron, Jessica Borelli, Parmida Safavi, Sanaz Mirbaha, et al. Building trust in mental health chatbots: safety metrics and llm-based evaluation tools.arXiv preprint arXiv:2408.04650, 2024
-
[43]
Dipti Pawar and Shraddha Phansalkar. A binary question answering system for diagnosing mental health syndromes powered by large language model with custom-built dataset. In2024 IEEE 4th International Conference on ICT in Business Industry & Government (ICTBIG), pages 1–8. IEEE, 2024
work page 2024
-
[44]
psyset: Datasets at hugging face.https://huggingface.co/datasets/psycode1/psy set, 2024
Psycode1. psyset: Datasets at hugging face.https://huggingface.co/datasets/psycode1/psy set, 2024. Accessed: 2025-06-19
work page 2024
-
[45]
Richie-ghost. suicidal finetune: Datasets at hugging face.https://huggingface.co/datasets/ri chie-ghost/suicidal_finetune, 2024. Accessed: 2025-06-19. 24
work page 2024
-
[46]
Sajjadhadi. Mental-disorder-detection-data: Datasets at hugging face.https://huggingface.co /datasets/sajjadhadi/Mental-Disorder-Detection-Data, 2024. Accessed: 2025-06-19
work page 2024
-
[47]
Mentalchat16k: Datasets at hugging face.https://huggingface.co/datasets/ShenLa b/MentalChat16K, 2024
ShenLab. Mentalchat16k: Datasets at hugging face.https://huggingface.co/datasets/ShenLa b/MentalChat16K, 2024. Accessed: 2025-06-19
work page 2024
-
[48]
Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, and Eyal Klang. Large language models and empathy: Systematic review.Journal of medical Internet research, 26:e52597, 2024
work page 2024
-
[49]
Counseling summarization using mental health knowledge guided utterance filtering
Aseem Srivastava, Tharun Suresh, Sarah P Lord, Md Shad Akhtar, and Tanmoy Chakraborty. Counseling summarization using mental health knowledge guided utterance filtering. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3920–3930, 2022
work page 2022
-
[50]
Chatgpt under scrutiny after family of teen who killed himself sue openai, 2025
The Guardian. Chatgpt under scrutiny after family of teen who killed himself sue openai, 2025. URLhttps://www.theguardian.com/technology/2025/aug/27/chatgpt-scrutiny-family-t een-killed-himself-sue-open-ai. Accessed: 2025-06-14
work page 2025
-
[51]
Xuena Wang, Xueting Li, Zi Yin, Yue Wu, and Jia Liu. Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023. 25
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.