pith. sign in

arxiv: 2509.24857 · v3 · submitted 2025-09-29 · 💻 cs.CL · cs.CY

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Pith reviewed 2026-05-18 12:49 UTC · model grok-4.3

classification 💻 cs.CL cs.CY
keywords mental health crisesLLM safety evaluationsuicidal ideationself-harm responsecrisis taxonomyAI appropriatenessclinical assessment protocol
0
0 comments X

The pith

LLMs often produce unsafe responses to self-harm and suicidal crises despite handling some explicit cases reliably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a six-category crisis taxonomy and assembles a dataset of 2,252 examples drawn from existing mental health collections to test how large language models detect and reply to crisis inputs. It introduces a clinical response assessment protocol that grades model outputs on a 5-point Likert scale from harmful to appropriate and applies it to five different LLMs. Results show that while certain models maintain lower rates of harmful replies, many outputs in self-harm and suicidal categories remain inappropriate or unsafe, and all models falter on indirect signals, default replies, and context misalignment. These findings point to the need for stronger safeguards and context-aware handling in AI systems used for mental health support.

Core claim

LLMs can respond reliably to some explicit mental health crises, yet significant risks remain because many outputs, especially in self-harm and suicidal ideation categories, are inappropriate or unsafe; performance varies across models with some showing low harm rates while others generate more unsafe replies, and all models struggle with indirect signals, default replies, and context misalignment.

What carries the argument

A six-category clinical crisis taxonomy paired with a 5-point Likert scale response assessment protocol used to classify inputs and audit LLM safety and appropriateness.

If this is right

  • Alignment and safety practices beyond model scale are crucial for reliable crisis support.
  • The taxonomy, datasets, and evaluation protocol can be used to guide further development of safer AI mental health tools.
  • All tested models require improved detection of indirect crisis signals to reduce potential harm.
  • Better context-aware response mechanisms are needed to avoid default or misaligned replies in crisis situations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world users turning to chatbots during distress may encounter inconsistent safety levels depending on which model they reach.
  • Specialized fine-tuning or layered safety filters could be tested to improve handling of the indirect signals where current models fail.
  • Deployment of LLMs for mental health queries may benefit from mandatory third-party audits using similar taxonomies before public release.

Load-bearing premise

The 5-point Likert scale ratings produced under the clinical response assessment protocol accurately reflect clinical safety and appropriateness without systematic evaluator bias or incomplete context.

What would settle it

Independent clinical experts re-rating the same set of model responses on the identical 5-point scale and arriving at substantially different harm rates would undermine the reported model comparisons.

Figures

Figures reproduced from arXiv: 2509.24857 by Adrian Arnaiz-Rodriguez, Elvira Perez Vallejos, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Miguel Baidal, Nuria Oliver.

Figure 1
Figure 1. Figure 1: Methodology. 1. Dataset Curation (left): From an aggregation of n≈239k user textual inputs from 12 publicly available datasets for mental health research, 206 and 2,046 examples are selected as validation and test set examples, respectively. 2. Crisis Category Classification Validation: The validation set (n=206) is labeled by three state-of-the-art LLMs and four domain experts according to a taxonomy with… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Pipeline applied to each user input and LLM. The Crisis Category Classification module leverages the LLM-as-a-judge technique to assign a mental health crisis category to the user input. In parallel, the evaluated LLM provides a Response to the same user input. Each response is scored for appropriateness (according to a 5-point Likert scale) by the Crisis Response Evaluation module, using the LLM-as-… view at source ↗
Figure 3
Figure 3. Figure 3: Crisis category classification pipeline. Left: In the validation stage, three LLMs (each run three times) and four human experts independently labeled the validation set of 206 user inputs. Agreement between each pair of LLM and human annotations was quantified using Cohen’s Kappa, and the model with the highest mean agreement was selected for the second stage. In the second stage, the best-performing mode… view at source ↗
Figure 4
Figure 4. Figure 4: Aggregate evaluation results for the three runs per LLM ( [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of Low Safety Scores (≤ 3.6) per LLM and Mental Health Crisis Category. Bars show the combined percentage of responses scoring between 1 and 3.6. The overall low-score distribution is split into: Score = 1 (hatched area), [1, 2.3] (orange), and (2.3, 3.6] (blue). categories, such as self-harm (3.73%), risk-taking behaviors (3.81%), and violent thoughts (4.22%). While the proportion of harmful … view at source ↗
read the original abstract

Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a six-category crisis taxonomy for mental health issues, curates a dataset of 2,252 relevant inputs from over 239,000 examples across 12 source datasets, develops a clinical response assessment protocol, and evaluates LLMs both for automatic crisis classification (three models) and for response safety/appropriateness (five models) via 5-point Likert-scale grading from harmful (1) to appropriate (5). It reports that models such as gpt-5-nano and deepseek-v3.2-exp exhibit low harm rates while gpt-4o-mini and grok-4-fast produce more unsafe outputs, that all models struggle with indirect signals and context misalignment, and that better safeguards are needed.

Significance. If the grading protocol proves reliable, the work supplies a reusable taxonomy, dataset, and evaluation framework that directly supports empirical research on LLM safety in high-stakes mental-health contexts. The concrete model comparisons and emphasis on indirect-signal failures provide actionable evidence that alignment practices matter beyond scale.

major comments (2)
  1. [Clinical response assessment protocol and results sections] The central quantitative claims (low harm rates for gpt-5-nano and deepseek-v3.2-exp; higher unsafe rates for gpt-4o-mini and grok-4-fast; universal difficulty with indirect signals) rest entirely on 5-point Likert scores assigned under the clinical response assessment protocol. No inter-rater reliability statistics, rater qualifications, blinding procedures, or exclusion criteria are reported, leaving the validity of all model comparisons open to systematic bias or inconsistent thresholds.
  2. [Dataset curation and taxonomy sections] The dataset curation step (reduction from 239k inputs to 2,252 classified examples) is described at a high level but lacks explicit details on how the six-category taxonomy was applied, inter-annotator agreement for the initial classification, or precise exclusion rules. These omissions directly affect reproducibility of the evaluation set that underpins every reported performance difference.
minor comments (2)
  1. [Abstract] The abstract states that three LLMs were tested for automatic classification but does not name the models or report their accuracy/F1 scores; adding these numbers would clarify the classification results.
  2. [Results] Consider reporting confidence intervals or statistical tests on the Likert-score differences between models to strengthen the comparative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve transparency and reproducibility.

read point-by-point responses
  1. Referee: [Clinical response assessment protocol and results sections] The central quantitative claims (low harm rates for gpt-5-nano and deepseek-v3.2-exp; higher unsafe rates for gpt-4o-mini and grok-4-fast; universal difficulty with indirect signals) rest entirely on 5-point Likert scores assigned under the clinical response assessment protocol. No inter-rater reliability statistics, rater qualifications, blinding procedures, or exclusion criteria are reported, leaving the validity of all model comparisons open to systematic bias or inconsistent thresholds.

    Authors: We agree that the manuscript would benefit from greater transparency in the clinical response assessment protocol. We will revise the relevant section to describe rater qualifications, report inter-rater reliability statistics, detail blinding procedures, and specify exclusion criteria. These additions will allow readers to evaluate the reliability of the Likert-scale results and the model comparisons more rigorously. revision: yes

  2. Referee: [Dataset curation and taxonomy sections] The dataset curation step (reduction from 239k inputs to 2,252 classified examples) is described at a high level but lacks explicit details on how the six-category taxonomy was applied, inter-annotator agreement for the initial classification, or precise exclusion rules. These omissions directly affect reproducibility of the evaluation set that underpins every reported performance difference.

    Authors: We acknowledge the need for more explicit methodological details on dataset curation. We will expand the taxonomy and curation sections to include concrete examples of taxonomy application, inter-annotator agreement metrics for the classification step, and the precise exclusion rules applied when reducing the initial pool to 2,252 examples. These changes will directly support reproducibility of the evaluation set. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation rests on external data and independent ratings

full rationale

The paper performs a direct empirical study: it defines a crisis taxonomy, curates 2,252 examples from prior public mental-health datasets, applies LLMs for classification, and grades model responses on a 5-point Likert scale via a clinical protocol. No equations, fitted parameters, predictions, or derivations appear. All quantitative claims (model harm rates, struggles with indirect signals) are computed from the curated inputs and fresh human ratings rather than reducing to any self-citation chain or self-definitional loop. The protocol and taxonomy are presented as new contributions, not as outputs derived from the evaluation results themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the clinical validity of the newly introduced taxonomy and the reliability of the Likert-scale protocol; these are domain assumptions rather than derived quantities.

axioms (1)
  • domain assumption A clinically informed taxonomy of six crisis categories provides a valid and sufficient classification scheme for mental health crisis inputs.
    Invoked when building the taxonomy and classifying the 2,252 examples; no external validation metrics supplied in abstract.
invented entities (1)
  • Six-category crisis taxonomy no independent evidence
    purpose: To enable consistent classification and evaluation of crisis inputs across datasets and models
    Newly constructed taxonomy introduced to address lack of unified standards; independent clinical validation not described in abstract.

pith-pipeline@v0.9.0 · 5900 in / 1474 out tokens · 51670 ms · 2026-05-18T12:49:49.460663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

    cs.CL 2026-04 unverdicted novelty 7.0

    PCSA is the first persona-based client simulation attack that exposes LLMs' vulnerabilities in counseling by generating natural dialogues where models give bad advice, reinforce delusions, and encourage risky actions.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper

  1. [1]

    Nlp mental health conversations: Datasets at hugging face.https://huggingface.co/dat asets/Amod/mental_health_counseling_conversations, 2024

    Amod. Nlp mental health conversations: Datasets at hugging face.https://huggingface.co/dat asets/Amod/mental_health_counseling_conversations, 2024. Accessed: 2025-06-19

  2. [2]

    self-harm-synthetic-eval: Datasets at hugging face.https://huggingface.co/dat asets/arianaazarbal/self-harm-synthetic-eval, 2024

    Arianaazarbal. self-harm-synthetic-eval: Datasets at hugging face.https://huggingface.co/dat asets/arianaazarbal/self-harm-synthetic-eval, 2024. Accessed: 2025-06-19

  3. [3]

    Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

    Oren Asman, John Torous, Amir Tal, et al. Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

  4. [4]

    Baidal, E

    M. Baidal, E. Derner, and N. Oliver. Guardians of trust: Risks and opportunities for llms in mental health. InProceedings of the 4th Workshop on NLP for Positive Impact, ACL 2025, 2025

  5. [5]

    Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

    Suha Ballout. Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

  6. [6]

    Article on chatbot safety concerns, 2025

    BBC News. Article on chatbot safety concerns, 2025. URLhttps://www.bbc.com/news/article s/cgerwp7rdlvo. Accessed: 2025-06-14

  7. [7]

    it’s not only attention we need

    Andreas Bucher, Sarah Egger, Inna Vashkite, Wenyuan Wu, and Gerhard Schwabe. “it’s not only attention we need”’: Systematic review of large language models in mental health care.JMIR Mental Health, 12(1):e78410, 2025

  8. [8]

    Classifying unstructured text in electronic health records for mental health prediction models: large language model evaluation study.JMIR Medical Informatics, 13(1):e65454, 2025

    Nicholas C Cardamone, Mark Olfson, Timothy Schmutte, Lyle Ungar, Tony Liu, Sara W Cullen, Nathaniel J Williams, and Steven C Marcus. Classifying unstructured text in electronic health records for mental health prediction models: large language model evaluation study.JMIR Medical Informatics, 13(1):e65454, 2025

  9. [9]

    Challenges of large language models for mental health counseling.arXiv preprint arXiv:2311.13857, 2023

    Neo Christopher Chung, George Dyer, and Lennart Brocki. Challenges of large language models for mental health counseling.arXiv preprint arXiv:2311.13857, 2023. 22

  10. [10]

    transformed suicidal ideation: Datasets at hugging face.https://huggingface.co/d atasets/cypsiSAS/transformed_Suicidal_ideation, 2024

    CypsiSAS. transformed suicidal ideation: Datasets at hugging face.https://huggingface.co/d atasets/cypsiSAS/transformed_Suicidal_ideation, 2024. Accessed: 2025-06-19

  11. [11]

    The global prevalence of nonsuicidal self-injury among adoles- cents.JAMA network open, 7(6):e2415406–e2415406, 2024

    Ellen-ge Denton and Kiara ´Alvarez. The global prevalence of nonsuicidal self-injury among adoles- cents.JAMA network open, 7(6):e2415406–e2415406, 2024

  12. [12]

    Peer contagion in child and adolescent social and emo- tional development.Annual review of psychology, 62(1):189–214, 2011

    Thomas J Dishion and Jessica M Tipsord. Peer contagion in child and adolescent social and emo- tional development.Annual review of psychology, 62(1):189–214, 2011

  13. [13]

    Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025

    Sebastian Dohn´ any, Zeb Kurth-Nelson, Eleanor Spens, Lennart Luettgau, Alastair Reid, Christo- pher Summerfield, Murray Shanahan, and Matthew M Nour. Technological folie\a deux: Feedback loops between ai chatbots and mental illness.arXiv preprint arXiv:2507.19218, 2025

  14. [14]

    Comparing the perspectives of generative ai, mental health experts, and the general public on schizophrenia recovery: case vignette study.JMIR Mental Health, 11(1):e53043, 2024

    Zohar Elyoseph, Inbar Levkovich, et al. Comparing the perspectives of generative ai, mental health experts, and the general public on schizophrenia recovery: case vignette study.JMIR Mental Health, 11(1):e53043, 2024

  15. [15]

    Framework for Responsible Research and Innovation, 2013

    Engineering and Physical Sciences Research Council (EPSRC) and UK Research and Innovation (UKRI). Framework for Responsible Research and Innovation, 2013. URLhttps://www.ukri.org /who-we-are/epsrc/our-policies-and-standards/framework-for-responsible-innovatio n/

  16. [16]

    mental health dataset: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_dataset, 2024

    Fadodr. mental health dataset: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_dataset, 2024. Accessed: 2025-06-19

  17. [17]

    mental health therapy: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_therapy, 2024

    Fadodr. mental health therapy: Datasets at hugging face.https://huggingface.co/datasets/ fadodr/mental_health_therapy, 2024. Accessed: 2025-06-19

  18. [18]

    test test self harm all levels: Datasets at hugging face.https://huggingface.co/d atasets/fanyin3639/test_test_self_harm_all_levels, 2024

    Fanyin3639. test test self harm all levels: Datasets at hugging face.https://huggingface.co/d atasets/fanyin3639/test_test_self_harm_all_levels, 2024. Accessed: 2025-06-19

  19. [19]

    Can ai relate: Testing large language model response for mental health support.arXiv preprint arXiv:2405.12021, 2024

    Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. Can ai relate: Testing large language model response for mental health support.arXiv preprint arXiv:2405.12021, 2024

  20. [20]

    Social media contagion of high-risk behaviors in youth.Pediatric Clinics, 72(2):213–224, 2025

    Meredith Gansner, Casey Berson, and Zainub Javed. Social media contagion of high-risk behaviors in youth.Pediatric Clinics, 72(2):213–224, 2025

  21. [21]

    Large language models for mental health applications: Systematic review.JMIR mental health, 11 (1):e57400, 2024

    Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Farrington, Thomas Keen, Kezhi Li, et al. Large language models for mental health applications: Systematic review.JMIR mental health, 11 (1):e57400, 2024

  22. [22]

    Self-harm and suicide in adolescents

    Keith Hawton, Kate EA Saunders, and Rory C O’Connor. Self-harm and suicide in adolescents. The lancet, 379(9834):2373–2382, 2012

  23. [23]

    Safety of large language models in addressing depression.Cureus, 15(12), 2023

    Thomas F Heston. Safety of large language models in addressing depression.Cureus, 15(12), 2023

  24. [24]

    A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

    Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

  25. [25]

    Huggingface datasets hub.https://huggingface.co/datasets, 2023

    HuggingFace. Huggingface datasets hub.https://huggingface.co/datasets, 2023. Accessed: 2024-06-19

  26. [26]

    Mohamed Hussain

    H. Mohamed Hussain. Psych8k: Counseling conversations dataset.https://www.kaggle.com/dat asets/hmohamedhussain/psych8k, 2024. Accessed: 2025-06-26

  27. [27]

    nart-100k-synthetic: Datasets at hugging face.https://huggingface.co/dataset s/jerryjalapeno/nart-100k-synthetic, 2024

    Jerryjalapeno. nart-100k-synthetic: Datasets at hugging face.https://huggingface.co/dataset s/jerryjalapeno/nart-100k-synthetic, 2024. Accessed: 2025-06-19

  28. [28]

    The applications of large language models in mental health: Scoping review

    Yu Jin, Jiayi Liu, Pan Li, Baosen Wang, Yangxinyu Yan, Huilin Zhang, Chenhao Ni, Jing Wang, Yi Li, Yajun Bu, et al. The applications of large language models in mental health: Scoping review. Journal of Medical Internet Research, 27:e69284, 2025. 23

  29. [29]

    Responsible research and innovation in the digital age.Communications of the ACM, 60(5):62–68, 2017

    Marina Jirotka, Barbara Grimpe, Bernd Stahl, Grace Eden, and Mark Hartswood. Responsible research and innovation in the digital age.Communications of the ACM, 60(5):62–68, 2017

  30. [30]

    Chatcounselor: A large language models for mental health support.arXiv preprint arXiv:2309.15461, 2023

    June M Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. Chatcounselor: A large language models for mental health support.arXiv preprint arXiv:2309.15461, 2023

  31. [31]

    Special re- port from the cdc: Suicide rates, sodium nitrite-related suicides, and online content, united states

    Karin A Mack, Wojciech Kaczkowski, Steven Sumner, Royal Law, and Amy Wolkin. Special re- port from the cdc: Suicide rates, sodium nitrite-related suicides, and online content, united states. Journal of safety research, 89:361–368, 2024

  32. [32]

    Masab A Mansoor and Kashif H Ansari. Early detection of mental health crises through artifical- intelligence-powered social media analysis: A prospective observational study.Journal of Personal- ized Medicine, 14(9):958, 2024

  33. [33]

    Competency of large language models in evaluating appropriate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research, 27:e67891, 2025

    Ryan K McBain, Jonathan H Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Halbisen, Aaron Kofner, Joshua Breslau, Bradley Stein, Ateev Mehrotra, et al. Competency of large language models in evaluating appropriate responses to suicidal ideation: Comparative study.Journal of Medical Internet Research, 27:e67891, 2025

  34. [34]

    Talk, trust, and trade-offs: How and why teens use ai companions

    Common Sense Media. Talk, trust, and trade-offs: How and why teens use ai companions. Technical report, Common Sense Media, July 2025. URLhttps://www.commonsensemedia.org/researc h/talk-trust-and-trade-offs-how-and-why-teens-use-ai-companions. Accessed October 2025

  35. [35]

    Expert and interdisciplinary analysis of ai-driven chatbots for mental health support: Mixed methods study.Journal of Medical Internet Research, 27:e67114, 2025

    Kayley Moylan and Kevin Doherty. Expert and interdisciplinary analysis of ai-driven chatbots for mental health support: Mixed methods study.Journal of Medical Internet Research, 27:e67114, 2025

  36. [36]

    Neimeyer and Kathleen Bonnelle

    Robert A. Neimeyer and Kathleen Bonnelle. The suicide intervention response inventory: A revision and validation.Death Studies, 21(1):59–81, 1997

  37. [37]

    Social media use and self-injurious thoughts and behaviors: A systematic review and meta- analysis.Clinical psychology review, 87:102038, 2021

    Jacqueline Nesi, Taylor A Burke, Alexandra H Bettis, Anastacia Y Kudinova, Elizabeth C Thomp- son, Heather A MacPherson, Kara A Fox, Hannah R Lawrence, Sarah A Thomas, Jennifer C Wolff, et al. Social media use and self-injurious thoughts and behaviors: A systematic review and meta- analysis.Clinical psychology review, 87:102038, 2021

  38. [38]

    Governance in the era of data-driven decision-making algorithms.Women Shaping Global Economic Governance, 171, 2019

    Nuria Oliver. Governance in the era of data-driven decision-making algorithms.Women Shaping Global Economic Governance, 171, 2019

  39. [39]

    Helping people when they need it most, 2025

    OpenAI. Helping people when they need it most, 2025. URLhttps://openai.com/index/helpi ng-people-when-they-need-it-most/. Accessed: 2025-02-14

  40. [40]

    World Health Organization, 2025

    World Health Organization.Suicide worldwide in 2021: global health estimates. World Health Organization, 2025

  41. [41]

    mental-health: Datasets at hugging face.https://huggingface.co/datasets/ marmikpandya/mental-health, 2024

    Marmik Pandya. mental-health: Datasets at hugging face.https://huggingface.co/datasets/ marmikpandya/mental-health, 2024. Accessed: 2025-06-19

  42. [42]

    Building trust in mental health chatbots: safety metrics and llm-based evaluation tools.arXiv preprint arXiv:2408.04650, 2024

    Jung In Park, Mahyar Abbasian, Iman Azimi, Dawn T Bounds, Angela Jun, Jaesu Han, Robert M McCarron, Jessica Borelli, Parmida Safavi, Sanaz Mirbaha, et al. Building trust in mental health chatbots: safety metrics and llm-based evaluation tools.arXiv preprint arXiv:2408.04650, 2024

  43. [43]

    A binary question answering system for diagnosing mental health syndromes powered by large language model with custom-built dataset

    Dipti Pawar and Shraddha Phansalkar. A binary question answering system for diagnosing mental health syndromes powered by large language model with custom-built dataset. In2024 IEEE 4th International Conference on ICT in Business Industry & Government (ICTBIG), pages 1–8. IEEE, 2024

  44. [44]

    psyset: Datasets at hugging face.https://huggingface.co/datasets/psycode1/psy set, 2024

    Psycode1. psyset: Datasets at hugging face.https://huggingface.co/datasets/psycode1/psy set, 2024. Accessed: 2025-06-19

  45. [45]

    suicidal finetune: Datasets at hugging face.https://huggingface.co/datasets/ri chie-ghost/suicidal_finetune, 2024

    Richie-ghost. suicidal finetune: Datasets at hugging face.https://huggingface.co/datasets/ri chie-ghost/suicidal_finetune, 2024. Accessed: 2025-06-19. 24

  46. [46]

    Mental-disorder-detection-data: Datasets at hugging face.https://huggingface.co /datasets/sajjadhadi/Mental-Disorder-Detection-Data, 2024

    Sajjadhadi. Mental-disorder-detection-data: Datasets at hugging face.https://huggingface.co /datasets/sajjadhadi/Mental-Disorder-Detection-Data, 2024. Accessed: 2025-06-19

  47. [47]

    Mentalchat16k: Datasets at hugging face.https://huggingface.co/datasets/ShenLa b/MentalChat16K, 2024

    ShenLab. Mentalchat16k: Datasets at hugging face.https://huggingface.co/datasets/ShenLa b/MentalChat16K, 2024. Accessed: 2025-06-19

  48. [48]

    Large language models and empathy: Systematic review.Journal of medical Internet research, 26:e52597, 2024

    Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, and Eyal Klang. Large language models and empathy: Systematic review.Journal of medical Internet research, 26:e52597, 2024

  49. [49]

    Counseling summarization using mental health knowledge guided utterance filtering

    Aseem Srivastava, Tharun Suresh, Sarah P Lord, Md Shad Akhtar, and Tanmoy Chakraborty. Counseling summarization using mental health knowledge guided utterance filtering. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3920–3930, 2022

  50. [50]

    Chatgpt under scrutiny after family of teen who killed himself sue openai, 2025

    The Guardian. Chatgpt under scrutiny after family of teen who killed himself sue openai, 2025. URLhttps://www.theguardian.com/technology/2025/aug/27/chatgpt-scrutiny-family-t een-killed-himself-sue-open-ai. Accessed: 2025-06-14

  51. [51]

    Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023

    Xuena Wang, Xueting Li, Zi Yin, Yue Wu, and Jia Liu. Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023. 25