Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez; Elvira Perez Vallejos; Erik Derner; Jenn Layton Annable; Mark Ball; Mark Ince; Miguel Baidal; Nuria Oliver

arxiv: 2509.24857 · v3 · submitted 2025-09-29 · 💻 cs.CL · cs.CY

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez , Miguel Baidal , Erik Derner , Jenn Layton Annable , Mark Ball , Mark Ince , Elvira Perez Vallejos , Nuria Oliver This is my paper

Pith reviewed 2026-05-18 12:49 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords mental health crisesLLM safety evaluationsuicidal ideationself-harm responsecrisis taxonomyAI appropriatenessclinical assessment protocol

0 comments

The pith

LLMs often produce unsafe responses to self-harm and suicidal crises despite handling some explicit cases reliably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a six-category crisis taxonomy and assembles a dataset of 2,252 examples drawn from existing mental health collections to test how large language models detect and reply to crisis inputs. It introduces a clinical response assessment protocol that grades model outputs on a 5-point Likert scale from harmful to appropriate and applies it to five different LLMs. Results show that while certain models maintain lower rates of harmful replies, many outputs in self-harm and suicidal categories remain inappropriate or unsafe, and all models falter on indirect signals, default replies, and context misalignment. These findings point to the need for stronger safeguards and context-aware handling in AI systems used for mental health support.

Core claim

LLMs can respond reliably to some explicit mental health crises, yet significant risks remain because many outputs, especially in self-harm and suicidal ideation categories, are inappropriate or unsafe; performance varies across models with some showing low harm rates while others generate more unsafe replies, and all models struggle with indirect signals, default replies, and context misalignment.

What carries the argument

A six-category clinical crisis taxonomy paired with a 5-point Likert scale response assessment protocol used to classify inputs and audit LLM safety and appropriateness.

If this is right

Alignment and safety practices beyond model scale are crucial for reliable crisis support.
The taxonomy, datasets, and evaluation protocol can be used to guide further development of safer AI mental health tools.
All tested models require improved detection of indirect crisis signals to reduce potential harm.
Better context-aware response mechanisms are needed to avoid default or misaligned replies in crisis situations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world users turning to chatbots during distress may encounter inconsistent safety levels depending on which model they reach.
Specialized fine-tuning or layered safety filters could be tested to improve handling of the indirect signals where current models fail.
Deployment of LLMs for mental health queries may benefit from mandatory third-party audits using similar taxonomies before public release.

Load-bearing premise

The 5-point Likert scale ratings produced under the clinical response assessment protocol accurately reflect clinical safety and appropriateness without systematic evaluator bias or incomplete context.

What would settle it

Independent clinical experts re-rating the same set of model responses on the identical 5-point scale and arriving at substantially different harm rates would undermine the reported model comparisons.

Figures

Figures reproduced from arXiv: 2509.24857 by Adrian Arnaiz-Rodriguez, Elvira Perez Vallejos, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Miguel Baidal, Nuria Oliver.

**Figure 1.** Figure 1: Methodology. 1. Dataset Curation (left): From an aggregation of n≈239k user textual inputs from 12 publicly available datasets for mental health research, 206 and 2,046 examples are selected as validation and test set examples, respectively. 2. Crisis Category Classification Validation: The validation set (n=206) is labeled by three state-of-the-art LLMs and four domain experts according to a taxonomy with… view at source ↗

**Figure 2.** Figure 2: Left: Pipeline applied to each user input and LLM. The Crisis Category Classification module leverages the LLM-as-a-judge technique to assign a mental health crisis category to the user input. In parallel, the evaluated LLM provides a Response to the same user input. Each response is scored for appropriateness (according to a 5-point Likert scale) by the Crisis Response Evaluation module, using the LLM-as-… view at source ↗

**Figure 3.** Figure 3: Crisis category classification pipeline. Left: In the validation stage, three LLMs (each run three times) and four human experts independently labeled the validation set of 206 user inputs. Agreement between each pair of LLM and human annotations was quantified using Cohen’s Kappa, and the model with the highest mean agreement was selected for the second stage. In the second stage, the best-performing mode… view at source ↗

**Figure 4.** Figure 4: Aggregate evaluation results for the three runs per LLM ( [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of Low Safety Scores (≤ 3.6) per LLM and Mental Health Crisis Category. Bars show the combined percentage of responses scoring between 1 and 3.6. The overall low-score distribution is split into: Score = 1 (hatched area), [1, 2.3] (orange), and (2.3, 3.6] (blue). categories, such as self-harm (3.73%), risk-taking behaviors (3.81%), and violent thoughts (4.22%). While the proportion of harmful … view at source ↗

read the original abstract

Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies a useful new taxonomy and dataset for auditing LLMs on mental health crises, though the response grading protocol needs more detail to support the model comparisons.

read the letter

The main things to know about this paper are the new six-category crisis taxonomy and the 2,252-example dataset they put together from twelve mental health sources. They use these to audit how five different LLMs respond to crisis inputs, grading the outputs on a 5-point Likert scale for safety and appropriateness. They do a solid job filling the gap in unified taxonomies that the abstract mentions. Curating down from 239,000 inputs to the relevant 2,252 and testing LLMs on classification shows some practical work. The performance differences come through clearly: models like gpt-5-nano and deepseek-v3.2-exp show lower harm rates, while gpt-4o-mini and grok-4-fast produce more unsafe responses in self-harm and suicidal categories. The universal struggle with indirect signals and context misalignment is a point worth noting for anyone deploying these systems. The softer part is the clinical response assessment protocol. All the quantitative claims rest on those Likert ratings, but the description lacks inter-rater reliability figures, details on exclusion criteria, or who the evaluators were. That leaves room for the kind of bias or incomplete context issue raised in the stress-test note. If the raters aren't clinically trained or if context is limited, the comparisons between models could shift. This paper is for people working on AI safety and responsible deployment in mental health contexts. Readers who need benchmarks or datasets for crisis detection will find it relevant. It deserves a serious referee because the artifacts are new and the topic has clear stakes, though the methods will need close checking on the grading process. I'd send it for peer review but flag the need for more protocol transparency and data release status in the revisions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a six-category crisis taxonomy for mental health issues, curates a dataset of 2,252 relevant inputs from over 239,000 examples across 12 source datasets, develops a clinical response assessment protocol, and evaluates LLMs both for automatic crisis classification (three models) and for response safety/appropriateness (five models) via 5-point Likert-scale grading from harmful (1) to appropriate (5). It reports that models such as gpt-5-nano and deepseek-v3.2-exp exhibit low harm rates while gpt-4o-mini and grok-4-fast produce more unsafe outputs, that all models struggle with indirect signals and context misalignment, and that better safeguards are needed.

Significance. If the grading protocol proves reliable, the work supplies a reusable taxonomy, dataset, and evaluation framework that directly supports empirical research on LLM safety in high-stakes mental-health contexts. The concrete model comparisons and emphasis on indirect-signal failures provide actionable evidence that alignment practices matter beyond scale.

major comments (2)

[Clinical response assessment protocol and results sections] The central quantitative claims (low harm rates for gpt-5-nano and deepseek-v3.2-exp; higher unsafe rates for gpt-4o-mini and grok-4-fast; universal difficulty with indirect signals) rest entirely on 5-point Likert scores assigned under the clinical response assessment protocol. No inter-rater reliability statistics, rater qualifications, blinding procedures, or exclusion criteria are reported, leaving the validity of all model comparisons open to systematic bias or inconsistent thresholds.
[Dataset curation and taxonomy sections] The dataset curation step (reduction from 239k inputs to 2,252 classified examples) is described at a high level but lacks explicit details on how the six-category taxonomy was applied, inter-annotator agreement for the initial classification, or precise exclusion rules. These omissions directly affect reproducibility of the evaluation set that underpins every reported performance difference.

minor comments (2)

[Abstract] The abstract states that three LLMs were tested for automatic classification but does not name the models or report their accuracy/F1 scores; adding these numbers would clarify the classification results.
[Results] Consider reporting confidence intervals or statistical tests on the Likert-score differences between models to strengthen the comparative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve transparency and reproducibility.

read point-by-point responses

Referee: [Clinical response assessment protocol and results sections] The central quantitative claims (low harm rates for gpt-5-nano and deepseek-v3.2-exp; higher unsafe rates for gpt-4o-mini and grok-4-fast; universal difficulty with indirect signals) rest entirely on 5-point Likert scores assigned under the clinical response assessment protocol. No inter-rater reliability statistics, rater qualifications, blinding procedures, or exclusion criteria are reported, leaving the validity of all model comparisons open to systematic bias or inconsistent thresholds.

Authors: We agree that the manuscript would benefit from greater transparency in the clinical response assessment protocol. We will revise the relevant section to describe rater qualifications, report inter-rater reliability statistics, detail blinding procedures, and specify exclusion criteria. These additions will allow readers to evaluate the reliability of the Likert-scale results and the model comparisons more rigorously. revision: yes
Referee: [Dataset curation and taxonomy sections] The dataset curation step (reduction from 239k inputs to 2,252 classified examples) is described at a high level but lacks explicit details on how the six-category taxonomy was applied, inter-annotator agreement for the initial classification, or precise exclusion rules. These omissions directly affect reproducibility of the evaluation set that underpins every reported performance difference.

Authors: We acknowledge the need for more explicit methodological details on dataset curation. We will expand the taxonomy and curation sections to include concrete examples of taxonomy application, inter-annotator agreement metrics for the classification step, and the precise exclusion rules applied when reducing the initial pool to 2,252 examples. These changes will directly support reproducibility of the evaluation set. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation rests on external data and independent ratings

full rationale

The paper performs a direct empirical study: it defines a crisis taxonomy, curates 2,252 examples from prior public mental-health datasets, applies LLMs for classification, and grades model responses on a 5-point Likert scale via a clinical protocol. No equations, fitted parameters, predictions, or derivations appear. All quantitative claims (model harm rates, struggles with indirect signals) are computed from the curated inputs and fresh human ratings rather than reducing to any self-citation chain or self-definitional loop. The protocol and taxonomy are presented as new contributions, not as outputs derived from the evaluation results themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the clinical validity of the newly introduced taxonomy and the reliability of the Likert-scale protocol; these are domain assumptions rather than derived quantities.

axioms (1)

domain assumption A clinically informed taxonomy of six crisis categories provides a valid and sufficient classification scheme for mental health crisis inputs.
Invoked when building the taxonomy and classifying the 2,252 examples; no external validation metrics supplied in abstract.

invented entities (1)

Six-category crisis taxonomy no independent evidence
purpose: To enable consistent classification and evaluation of crisis inputs across datasets and models
Newly constructed taxonomy introduced to address lack of unified standards; independent clinical validation not described in abstract.

pith-pipeline@v0.9.0 · 5900 in / 1474 out tokens · 51670 ms · 2026-05-18T12:49:49.460663+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We address these gaps by introducing: (1) a unified taxonomy of six clinically informed mental health crisis categories; (2) a curated... dataset... and (3) an expert-designed protocol for assessing response appropriateness... graded on a 5-point Likert scale
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Responses are rated on a 1-5 scale, ranging from harmful (1) to fully appropriate (5).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
cs.CL 2026-04 unverdicted novelty 7.0

PCSA is the first persona-based client simulation attack that exposes LLMs' vulnerabilities in counseling by generating natural dialogues where models give bad advice, reinforce delusions, and encourage risky actions.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper

[1]

Nlp mental health conversations: Datasets at hugging face.https://huggingface.co/dat asets/Amod/mental_health_counseling_conversations, 2024

Amod. Nlp mental health conversations: Datasets at hugging face.https://huggingface.co/dat asets/Amod/mental_health_counseling_conversations, 2024. Accessed: 2025-06-19

work page 2024
[2]

self-harm-synthetic-eval: Datasets at hugging face.https://huggingface.co/dat asets/arianaazarbal/self-harm-synthetic-eval, 2024

Arianaazarbal. self-harm-synthetic-eval: Datasets at hugging face.https://huggingface.co/dat asets/arianaazarbal/self-harm-synthetic-eval, 2024. Accessed: 2025-06-19

work page 2024
[3]

Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

Oren Asman, John Torous, Amir Tal, et al. Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

work page 2025
[4]

Baidal, E

M. Baidal, E. Derner, and N. Oliver. Guardians of trust: Risks and opportunities for llms in mental health. InProceedings of the 4th Workshop on NLP for Positive Impact, ACL 2025, 2025

work page 2025
[5]

Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

Suha Ballout. Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

work page 2025
[6]

Article on chatbot safety concerns, 2025

BBC News. Article on chatbot safety concerns, 2025. URLhttps://www.bbc.com/news/article s/cgerwp7rdlvo. Accessed: 2025-06-14

work page 2025
[7]

it’s not only attention we need

Andreas Bucher, Sarah Egger, Inna Vashkite, Wenyuan Wu, and Gerhard Schwabe. “it’s not only attention we need”’: Systematic review of large language models in mental health care.JMIR Mental Health, 12(1):e78410, 2025

work page 2025
[8]

Classifying unstructured text in electronic health records for mental health prediction models: large language model evaluation study.JMIR Medical Informatics, 13(1):e65454, 2025

Nicholas C Cardamone, Mark Olfson, Timothy Schmutte, Lyle Ungar, Tony Liu, Sara W Cullen, Nathaniel J Williams, and Steven C Marcus. Classifying unstructured text in electronic health records for mental health prediction models: large language model evaluation study.JMIR Medical Informatics, 13(1):e65454, 2025

work page 2025
[9]

Challenges of large language models for mental health counseling.arXiv preprint arXiv:2311.13857, 2023

Neo Christopher Chung, George Dyer, and Lennart Brocki. Challenges of large language models for mental health counseling.arXiv preprint arXiv:2311.13857, 2023. 22

work page arXiv 2023
[10]

transformed suicidal ideation: Datasets at hugging face.https://huggingface.co/d atasets/cypsiSAS/transformed_Suicidal_ideation, 2024

CypsiSAS. transformed suicidal ideation: Datasets at hugging face.https://huggingface.co/d atasets/cypsiSAS/transformed_Suicidal_ideation, 2024. Accessed: 2025-06-19

work page 2024
[11]

The global prevalence of nonsuicidal self-injury among adoles- cents.JAMA network open, 7(6):e2415406–e2415406, 2024

Ellen-ge Denton and Kiara ´Alvarez. The global prevalence of nonsuicidal self-injury among adoles- cents.JAMA network open, 7(6):e2415406–e2415406, 2024

work page 2024
[12]

Peer contagion in child and adolescent social and emo- tional development.Annual review of psychology, 62(1):189–214, 2011

Thomas J Dishion and Jessica M Tipsord. Peer contagion in child and adolescent social and emo- tional development.Annual review of psychology, 62(1):189–214, 2011

work page 2011
[13]

Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025

Sebastian Dohn´ any, Zeb Kurth-Nelson, Eleanor Spens, Lennart Luettgau, Alastair Reid, Christo- pher Summerfield, Murray Shanahan, and Matthew M Nour. Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025

work page arXiv 2025
[14]

Comparing the perspectives of generative ai, mental health experts, and the general public on schizophrenia recovery: case vignette study.JMIR Mental Health, 11(1):e53043, 2024

Zohar Elyoseph, Inbar Levkovich, et al. Comparing the perspectives of generative ai, mental health experts, and the general public on schizophrenia recovery: case vignette study.JMIR Mental Health, 11(1):e53043, 2024

work page 2024
[15]

Framework for Responsible Research and Innovation, 2013

Engineering and Physical Sciences Research Council (EPSRC) and UK Research and Innovation (UKRI). Framework for Responsible Research and Innovation, 2013. URLhttps://www.ukri.org /who-we-are/epsrc/our-policies-and-standards/framework-for-responsible-innovatio n/

work page 2013
[16]

mental health dataset: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_dataset, 2024

Fadodr. mental health dataset: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_dataset, 2024. Accessed: 2025-06-19

work page 2024
[17]

mental health therapy: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_therapy, 2024

Fadodr. mental health therapy: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_therapy, 2024. Accessed: 2025-06-19

work page 2024
[18]

test test self harm all levels: Datasets at hugging face.https://huggingface.co/d atasets/fanyin3639/test_test_self_harm_all_levels, 2024

Fanyin3639. test test self harm all levels: Datasets at hugging face.https://huggingface.co/d atasets/fanyin3639/test_test_self_harm_all_levels, 2024. Accessed: 2025-06-19

work page 2024
[19]

Can ai relate: Testing large language model response for mental health support.arXiv preprint arXiv:2405.12021, 2024

Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. Can ai relate: Testing large language model response for mental health support.arXiv preprint arXiv:2405.12021, 2024

work page arXiv 2024
[20]

Social media contagion of high-risk behaviors in youth.Pediatric Clinics, 72(2):213–224, 2025

Meredith Gansner, Casey Berson, and Zainub Javed. Social media contagion of high-risk behaviors in youth.Pediatric Clinics, 72(2):213–224, 2025

work page 2025
[21]

Large language models for mental health applications: Systematic review.JMIR mental health, 11 (1):e57400, 2024

Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Farrington, Thomas Keen, Kezhi Li, et al. Large language models for mental health applications: Systematic review.JMIR mental health, 11 (1):e57400, 2024

work page 2024
[22]

Self-harm and suicide in adolescents

Keith Hawton, Kate EA Saunders, and Rory C O’Connor. Self-harm and suicide in adolescents. The lancet, 379(9834):2373–2382, 2012

work page 2012
[23]

Safety of large language models in addressing depression.Cureus, 15(12), 2023

Thomas F Heston. Safety of large language models in addressing depression.Cureus, 15(12), 2023

work page 2023
[24]

A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

work page 2025
[25]

Huggingface datasets hub.https://huggingface.co/datasets, 2023

HuggingFace. Huggingface datasets hub.https://huggingface.co/datasets, 2023. Accessed: 2024-06-19

work page 2023
[26]

Mohamed Hussain

H. Mohamed Hussain. Psych8k: Counseling conversations dataset.https://www.kaggle.com/dat asets/hmohamedhussain/psych8k, 2024. Accessed: 2025-06-26

work page 2024
[27]

nart-100k-synthetic: Datasets at hugging face.https://huggingface.co/dataset s/jerryjalapeno/nart-100k-synthetic, 2024

Jerryjalapeno. nart-100k-synthetic: Datasets at hugging face.https://huggingface.co/dataset s/jerryjalapeno/nart-100k-synthetic, 2024. Accessed: 2025-06-19

work page 2024
[28]

The applications of large language models in mental health: Scoping review

Yu Jin, Jiayi Liu, Pan Li, Baosen Wang, Yangxinyu Yan, Huilin Zhang, Chenhao Ni, Jing Wang, Yi Li, Yajun Bu, et al. The applications of large language models in mental health: Scoping review. Journal of Medical Internet Research, 27:e69284, 2025. 23

work page 2025
[29]

Responsible research and innovation in the digital age.Communications of the ACM, 60(5):62–68, 2017

Marina Jirotka, Barbara Grimpe, Bernd Stahl, Grace Eden, and Mark Hartswood. Responsible research and innovation in the digital age.Communications of the ACM, 60(5):62–68, 2017

work page 2017
[30]

Chatcounselor: A large language models for mental health support.arXiv preprint arXiv:2309.15461, 2023

June M Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. Chatcounselor: A large language models for mental health support.arXiv preprint arXiv:2309.15461, 2023

work page arXiv 2023
[31]

Special re- port from the cdc: Suicide rates, sodium nitrite-related suicides, and online content, united states

Karin A Mack, Wojciech Kaczkowski, Steven Sumner, Royal Law, and Amy Wolkin. Special re- port from the cdc: Suicide rates, sodium nitrite-related suicides, and online content, united states. Journal of safety research, 89:361–368, 2024

work page 2024
[32]

Masab A Mansoor and Kashif H Ansari. Early detection of mental health crises through artifical- intelligence-powered social media analysis: A prospective observational study.Journal of Personal- ized Medicine, 14(9):958, 2024

work page 2024
[33]

Competency of large language models in evaluating appropriate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research, 27:e67891, 2025

Ryan K McBain, Jonathan H Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, et al. Competency of large language models in evaluating appropriate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research, 27:e67891, 2025

work page 2025
[34]

Talk, trust, and trade-offs: How and why teens use ai companions

Common Sense Media. Talk, trust, and trade-offs: How and why teens use ai companions. Technical report, Common Sense Media, July 2025. URLhttps://www.commonsensemedia.org/researc h/talk-trust-and-trade-offs-how-and-why-teens-use-ai-companions. Accessed October 2025

work page 2025
[35]

Expert and interdisciplinary analysis of ai-driven chatbots for mental health support: Mixed methods study.Journal of Medical Internet Research, 27:e67114, 2025

Kayley Moylan and Kevin Doherty. Expert and interdisciplinary analysis of ai-driven chatbots for mental health support: Mixed methods study.Journal of Medical Internet Research, 27:e67114, 2025

work page 2025
[36]

Neimeyer and Kathleen Bonnelle

Robert A. Neimeyer and Kathleen Bonnelle. The suicide intervention response inventory: A revision and validation.Death Studies, 21(1):59–81, 1997

work page 1997
[37]

Social media use and self-injurious thoughts and behaviors: A systematic review and meta- analysis.Clinical psychology review, 87:102038, 2021

Jacqueline Nesi, Taylor A Burke, Alexandra H Bettis, Anastacia Y Kudinova, Elizabeth C Thomp- son, Heather A MacPherson, Kara A Fox, Hannah R Lawrence, Sarah A Thomas, Jennifer C Wolff, et al. Social media use and self-injurious thoughts and behaviors: A systematic review and meta- analysis.Clinical psychology review, 87:102038, 2021

work page 2021
[38]

Governance in the era of data-driven decision-making algorithms.Women Shaping Global Economic Governance, 171, 2019

Nuria Oliver. Governance in the era of data-driven decision-making algorithms.Women Shaping Global Economic Governance, 171, 2019

work page 2019
[39]

Helping people when they need it most, 2025

OpenAI. Helping people when they need it most, 2025. URLhttps://openai.com/index/helpi ng-people-when-they-need-it-most/. Accessed: 2025-02-14

work page 2025
[40]

World Health Organization, 2025

World Health Organization.Suicide worldwide in 2021: global health estimates. World Health Organization, 2025

work page 2021
[41]

mental-health: Datasets at hugging face.https://huggingface.co/datasets/ marmikpandya/mental-health, 2024

Marmik Pandya. mental-health: Datasets at hugging face.https://huggingface.co/datasets/ marmikpandya/mental-health, 2024. Accessed: 2025-06-19

work page 2024
[42]

Building trust in mental health chatbots: safety metrics and llm-based evaluation tools.arXiv preprint arXiv:2408.04650, 2024

Jung In Park, Mahyar Abbasian, Iman Azimi, Dawn T Bounds, Angela Jun, Jaesu Han, Robert M McCarron, Jessica Borelli, Parmida Safavi, Sanaz Mirbaha, et al. Building trust in mental health chatbots: safety metrics and llm-based evaluation tools.arXiv preprint arXiv:2408.04650, 2024

work page arXiv 2024
[43]

A binary question answering system for diagnosing mental health syndromes powered by large language model with custom-built dataset

Dipti Pawar and Shraddha Phansalkar. A binary question answering system for diagnosing mental health syndromes powered by large language model with custom-built dataset. In2024 IEEE 4th International Conference on ICT in Business Industry & Government (ICTBIG), pages 1–8. IEEE, 2024

work page 2024
[44]

psyset: Datasets at hugging face.https://huggingface.co/datasets/psycode1/psy set, 2024

Psycode1. psyset: Datasets at hugging face.https://huggingface.co/datasets/psycode1/psy set, 2024. Accessed: 2025-06-19

work page 2024
[45]

suicidal finetune: Datasets at hugging face.https://huggingface.co/datasets/ri chie-ghost/suicidal_finetune, 2024

Richie-ghost. suicidal finetune: Datasets at hugging face.https://huggingface.co/datasets/ri chie-ghost/suicidal_finetune, 2024. Accessed: 2025-06-19. 24

work page 2024
[46]

Mental-disorder-detection-data: Datasets at hugging face.https://huggingface.co /datasets/sajjadhadi/Mental-Disorder-Detection-Data, 2024

Sajjadhadi. Mental-disorder-detection-data: Datasets at hugging face.https://huggingface.co /datasets/sajjadhadi/Mental-Disorder-Detection-Data, 2024. Accessed: 2025-06-19

work page 2024
[47]

Mentalchat16k: Datasets at hugging face.https://huggingface.co/datasets/ShenLa b/MentalChat16K, 2024

ShenLab. Mentalchat16k: Datasets at hugging face.https://huggingface.co/datasets/ShenLa b/MentalChat16K, 2024. Accessed: 2025-06-19

work page 2024
[48]

Large language models and empathy: Systematic review.Journal of medical Internet research, 26:e52597, 2024

Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, and Eyal Klang. Large language models and empathy: Systematic review.Journal of medical Internet research, 26:e52597, 2024

work page 2024
[49]

Counseling summarization using mental health knowledge guided utterance filtering

Aseem Srivastava, Tharun Suresh, Sarah P Lord, Md Shad Akhtar, and Tanmoy Chakraborty. Counseling summarization using mental health knowledge guided utterance filtering. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3920–3930, 2022

work page 2022
[50]

Chatgpt under scrutiny after family of teen who killed himself sue openai, 2025

The Guardian. Chatgpt under scrutiny after family of teen who killed himself sue openai, 2025. URLhttps://www.theguardian.com/technology/2025/aug/27/chatgpt-scrutiny-family-t een-killed-himself-sue-open-ai. Accessed: 2025-06-14

work page 2025
[51]

Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023

Xuena Wang, Xueting Li, Zi Yin, Yue Wu, and Jia Liu. Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023. 25

work page 2023

[1] [1]

Nlp mental health conversations: Datasets at hugging face.https://huggingface.co/dat asets/Amod/mental_health_counseling_conversations, 2024

Amod. Nlp mental health conversations: Datasets at hugging face.https://huggingface.co/dat asets/Amod/mental_health_counseling_conversations, 2024. Accessed: 2025-06-19

work page 2024

[2] [2]

self-harm-synthetic-eval: Datasets at hugging face.https://huggingface.co/dat asets/arianaazarbal/self-harm-synthetic-eval, 2024

Arianaazarbal. self-harm-synthetic-eval: Datasets at hugging face.https://huggingface.co/dat asets/arianaazarbal/self-harm-synthetic-eval, 2024. Accessed: 2025-06-19

work page 2024

[3] [3]

Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

Oren Asman, John Torous, Amir Tal, et al. Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

work page 2025

[4] [4]

Baidal, E

M. Baidal, E. Derner, and N. Oliver. Guardians of trust: Risks and opportunities for llms in mental health. InProceedings of the 4th Workshop on NLP for Positive Impact, ACL 2025, 2025

work page 2025

[5] [5]

Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

Suha Ballout. Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

work page 2025

[6] [6]

Article on chatbot safety concerns, 2025

BBC News. Article on chatbot safety concerns, 2025. URLhttps://www.bbc.com/news/article s/cgerwp7rdlvo. Accessed: 2025-06-14

work page 2025

[7] [7]

it’s not only attention we need

Andreas Bucher, Sarah Egger, Inna Vashkite, Wenyuan Wu, and Gerhard Schwabe. “it’s not only attention we need”’: Systematic review of large language models in mental health care.JMIR Mental Health, 12(1):e78410, 2025

work page 2025

[8] [8]

Classifying unstructured text in electronic health records for mental health prediction models: large language model evaluation study.JMIR Medical Informatics, 13(1):e65454, 2025

Nicholas C Cardamone, Mark Olfson, Timothy Schmutte, Lyle Ungar, Tony Liu, Sara W Cullen, Nathaniel J Williams, and Steven C Marcus. Classifying unstructured text in electronic health records for mental health prediction models: large language model evaluation study.JMIR Medical Informatics, 13(1):e65454, 2025

work page 2025

[9] [9]

Challenges of large language models for mental health counseling.arXiv preprint arXiv:2311.13857, 2023

Neo Christopher Chung, George Dyer, and Lennart Brocki. Challenges of large language models for mental health counseling.arXiv preprint arXiv:2311.13857, 2023. 22

work page arXiv 2023

[10] [10]

transformed suicidal ideation: Datasets at hugging face.https://huggingface.co/d atasets/cypsiSAS/transformed_Suicidal_ideation, 2024

CypsiSAS. transformed suicidal ideation: Datasets at hugging face.https://huggingface.co/d atasets/cypsiSAS/transformed_Suicidal_ideation, 2024. Accessed: 2025-06-19

work page 2024

[11] [11]

The global prevalence of nonsuicidal self-injury among adoles- cents.JAMA network open, 7(6):e2415406–e2415406, 2024

Ellen-ge Denton and Kiara ´Alvarez. The global prevalence of nonsuicidal self-injury among adoles- cents.JAMA network open, 7(6):e2415406–e2415406, 2024

work page 2024

[12] [12]

Peer contagion in child and adolescent social and emo- tional development.Annual review of psychology, 62(1):189–214, 2011

Thomas J Dishion and Jessica M Tipsord. Peer contagion in child and adolescent social and emo- tional development.Annual review of psychology, 62(1):189–214, 2011

work page 2011

[13] [13]

Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025

Sebastian Dohn´ any, Zeb Kurth-Nelson, Eleanor Spens, Lennart Luettgau, Alastair Reid, Christo- pher Summerfield, Murray Shanahan, and Matthew M Nour. Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025

work page arXiv 2025

[14] [14]

Comparing the perspectives of generative ai, mental health experts, and the general public on schizophrenia recovery: case vignette study.JMIR Mental Health, 11(1):e53043, 2024

Zohar Elyoseph, Inbar Levkovich, et al. Comparing the perspectives of generative ai, mental health experts, and the general public on schizophrenia recovery: case vignette study.JMIR Mental Health, 11(1):e53043, 2024

work page 2024

[15] [15]

Framework for Responsible Research and Innovation, 2013

Engineering and Physical Sciences Research Council (EPSRC) and UK Research and Innovation (UKRI). Framework for Responsible Research and Innovation, 2013. URLhttps://www.ukri.org /who-we-are/epsrc/our-policies-and-standards/framework-for-responsible-innovatio n/

work page 2013

[16] [16]

mental health dataset: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_dataset, 2024

Fadodr. mental health dataset: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_dataset, 2024. Accessed: 2025-06-19

work page 2024

[17] [17]

mental health therapy: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_therapy, 2024

Fadodr. mental health therapy: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_therapy, 2024. Accessed: 2025-06-19

work page 2024

[18] [18]

test test self harm all levels: Datasets at hugging face.https://huggingface.co/d atasets/fanyin3639/test_test_self_harm_all_levels, 2024

Fanyin3639. test test self harm all levels: Datasets at hugging face.https://huggingface.co/d atasets/fanyin3639/test_test_self_harm_all_levels, 2024. Accessed: 2025-06-19

work page 2024

[19] [19]

Can ai relate: Testing large language model response for mental health support.arXiv preprint arXiv:2405.12021, 2024

Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. Can ai relate: Testing large language model response for mental health support.arXiv preprint arXiv:2405.12021, 2024

work page arXiv 2024

[20] [20]

Social media contagion of high-risk behaviors in youth.Pediatric Clinics, 72(2):213–224, 2025

Meredith Gansner, Casey Berson, and Zainub Javed. Social media contagion of high-risk behaviors in youth.Pediatric Clinics, 72(2):213–224, 2025

work page 2025

[21] [21]

Large language models for mental health applications: Systematic review.JMIR mental health, 11 (1):e57400, 2024

Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Farrington, Thomas Keen, Kezhi Li, et al. Large language models for mental health applications: Systematic review.JMIR mental health, 11 (1):e57400, 2024

work page 2024

[22] [22]

Self-harm and suicide in adolescents

Keith Hawton, Kate EA Saunders, and Rory C O’Connor. Self-harm and suicide in adolescents. The lancet, 379(9834):2373–2382, 2012

work page 2012

[23] [23]

Safety of large language models in addressing depression.Cureus, 15(12), 2023

Thomas F Heston. Safety of large language models in addressing depression.Cureus, 15(12), 2023

work page 2023

[24] [24]

A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

work page 2025

[25] [25]

Huggingface datasets hub.https://huggingface.co/datasets, 2023

HuggingFace. Huggingface datasets hub.https://huggingface.co/datasets, 2023. Accessed: 2024-06-19

work page 2023

[26] [26]

Mohamed Hussain

H. Mohamed Hussain. Psych8k: Counseling conversations dataset.https://www.kaggle.com/dat asets/hmohamedhussain/psych8k, 2024. Accessed: 2025-06-26

work page 2024

[27] [27]

nart-100k-synthetic: Datasets at hugging face.https://huggingface.co/dataset s/jerryjalapeno/nart-100k-synthetic, 2024

Jerryjalapeno. nart-100k-synthetic: Datasets at hugging face.https://huggingface.co/dataset s/jerryjalapeno/nart-100k-synthetic, 2024. Accessed: 2025-06-19

work page 2024

[28] [28]

The applications of large language models in mental health: Scoping review

Yu Jin, Jiayi Liu, Pan Li, Baosen Wang, Yangxinyu Yan, Huilin Zhang, Chenhao Ni, Jing Wang, Yi Li, Yajun Bu, et al. The applications of large language models in mental health: Scoping review. Journal of Medical Internet Research, 27:e69284, 2025. 23

work page 2025

[29] [29]

Responsible research and innovation in the digital age.Communications of the ACM, 60(5):62–68, 2017

Marina Jirotka, Barbara Grimpe, Bernd Stahl, Grace Eden, and Mark Hartswood. Responsible research and innovation in the digital age.Communications of the ACM, 60(5):62–68, 2017

work page 2017

[30] [30]

Chatcounselor: A large language models for mental health support.arXiv preprint arXiv:2309.15461, 2023

June M Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. Chatcounselor: A large language models for mental health support.arXiv preprint arXiv:2309.15461, 2023

work page arXiv 2023

[31] [31]

Special re- port from the cdc: Suicide rates, sodium nitrite-related suicides, and online content, united states

Karin A Mack, Wojciech Kaczkowski, Steven Sumner, Royal Law, and Amy Wolkin. Special re- port from the cdc: Suicide rates, sodium nitrite-related suicides, and online content, united states. Journal of safety research, 89:361–368, 2024

work page 2024

[32] [32]

Masab A Mansoor and Kashif H Ansari. Early detection of mental health crises through artifical- intelligence-powered social media analysis: A prospective observational study.Journal of Personal- ized Medicine, 14(9):958, 2024

work page 2024

[33] [33]

Competency of large language models in evaluating appropriate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research, 27:e67891, 2025

Ryan K McBain, Jonathan H Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, et al. Competency of large language models in evaluating appropriate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research, 27:e67891, 2025

work page 2025

[34] [34]

Talk, trust, and trade-offs: How and why teens use ai companions

Common Sense Media. Talk, trust, and trade-offs: How and why teens use ai companions. Technical report, Common Sense Media, July 2025. URLhttps://www.commonsensemedia.org/researc h/talk-trust-and-trade-offs-how-and-why-teens-use-ai-companions. Accessed October 2025

work page 2025

[35] [35]

Expert and interdisciplinary analysis of ai-driven chatbots for mental health support: Mixed methods study.Journal of Medical Internet Research, 27:e67114, 2025

Kayley Moylan and Kevin Doherty. Expert and interdisciplinary analysis of ai-driven chatbots for mental health support: Mixed methods study.Journal of Medical Internet Research, 27:e67114, 2025

work page 2025

[36] [36]

Neimeyer and Kathleen Bonnelle

Robert A. Neimeyer and Kathleen Bonnelle. The suicide intervention response inventory: A revision and validation.Death Studies, 21(1):59–81, 1997

work page 1997

[37] [37]

Social media use and self-injurious thoughts and behaviors: A systematic review and meta- analysis.Clinical psychology review, 87:102038, 2021

Jacqueline Nesi, Taylor A Burke, Alexandra H Bettis, Anastacia Y Kudinova, Elizabeth C Thomp- son, Heather A MacPherson, Kara A Fox, Hannah R Lawrence, Sarah A Thomas, Jennifer C Wolff, et al. Social media use and self-injurious thoughts and behaviors: A systematic review and meta- analysis.Clinical psychology review, 87:102038, 2021

work page 2021

[38] [38]

Governance in the era of data-driven decision-making algorithms.Women Shaping Global Economic Governance, 171, 2019

Nuria Oliver. Governance in the era of data-driven decision-making algorithms.Women Shaping Global Economic Governance, 171, 2019

work page 2019

[39] [39]

Helping people when they need it most, 2025

OpenAI. Helping people when they need it most, 2025. URLhttps://openai.com/index/helpi ng-people-when-they-need-it-most/. Accessed: 2025-02-14

work page 2025

[40] [40]

World Health Organization, 2025

World Health Organization.Suicide worldwide in 2021: global health estimates. World Health Organization, 2025

work page 2021

[41] [41]

mental-health: Datasets at hugging face.https://huggingface.co/datasets/ marmikpandya/mental-health, 2024

Marmik Pandya. mental-health: Datasets at hugging face.https://huggingface.co/datasets/ marmikpandya/mental-health, 2024. Accessed: 2025-06-19

work page 2024

[42] [42]

Building trust in mental health chatbots: safety metrics and llm-based evaluation tools.arXiv preprint arXiv:2408.04650, 2024

Jung In Park, Mahyar Abbasian, Iman Azimi, Dawn T Bounds, Angela Jun, Jaesu Han, Robert M McCarron, Jessica Borelli, Parmida Safavi, Sanaz Mirbaha, et al. Building trust in mental health chatbots: safety metrics and llm-based evaluation tools.arXiv preprint arXiv:2408.04650, 2024

work page arXiv 2024

[43] [43]

A binary question answering system for diagnosing mental health syndromes powered by large language model with custom-built dataset

Dipti Pawar and Shraddha Phansalkar. A binary question answering system for diagnosing mental health syndromes powered by large language model with custom-built dataset. In2024 IEEE 4th International Conference on ICT in Business Industry & Government (ICTBIG), pages 1–8. IEEE, 2024

work page 2024

[44] [44]

psyset: Datasets at hugging face.https://huggingface.co/datasets/psycode1/psy set, 2024

Psycode1. psyset: Datasets at hugging face.https://huggingface.co/datasets/psycode1/psy set, 2024. Accessed: 2025-06-19

work page 2024

[45] [45]

suicidal finetune: Datasets at hugging face.https://huggingface.co/datasets/ri chie-ghost/suicidal_finetune, 2024

Richie-ghost. suicidal finetune: Datasets at hugging face.https://huggingface.co/datasets/ri chie-ghost/suicidal_finetune, 2024. Accessed: 2025-06-19. 24

work page 2024

[46] [46]

Mental-disorder-detection-data: Datasets at hugging face.https://huggingface.co /datasets/sajjadhadi/Mental-Disorder-Detection-Data, 2024

Sajjadhadi. Mental-disorder-detection-data: Datasets at hugging face.https://huggingface.co /datasets/sajjadhadi/Mental-Disorder-Detection-Data, 2024. Accessed: 2025-06-19

work page 2024

[47] [47]

Mentalchat16k: Datasets at hugging face.https://huggingface.co/datasets/ShenLa b/MentalChat16K, 2024

ShenLab. Mentalchat16k: Datasets at hugging face.https://huggingface.co/datasets/ShenLa b/MentalChat16K, 2024. Accessed: 2025-06-19

work page 2024

[48] [48]

Large language models and empathy: Systematic review.Journal of medical Internet research, 26:e52597, 2024

Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, and Eyal Klang. Large language models and empathy: Systematic review.Journal of medical Internet research, 26:e52597, 2024

work page 2024

[49] [49]

Counseling summarization using mental health knowledge guided utterance filtering

Aseem Srivastava, Tharun Suresh, Sarah P Lord, Md Shad Akhtar, and Tanmoy Chakraborty. Counseling summarization using mental health knowledge guided utterance filtering. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3920–3930, 2022

work page 2022

[50] [50]

Chatgpt under scrutiny after family of teen who killed himself sue openai, 2025

The Guardian. Chatgpt under scrutiny after family of teen who killed himself sue openai, 2025. URLhttps://www.theguardian.com/technology/2025/aug/27/chatgpt-scrutiny-family-t een-killed-himself-sue-open-ai. Accessed: 2025-06-14

work page 2025

[51] [51]

Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023

Xuena Wang, Xueting Li, Zi Yin, Yue Wu, and Jia Liu. Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023. 25

work page 2023