pith. machine review for the scientific record. sign in

arxiv: 2605.10027 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Speech-based Psychological Crisis Assessment using LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLMcrisis classificationspeech transcriptsparalinguistic cuesreasoning chainsmental health assessmentdata augmentation
0
0 comments X

The pith

LLM classifies psychological crisis levels from speech by injecting non-verbal cues into transcripts and training on reasoning chains, reaching 80.5% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an automated framework using large language models to classify crisis severity in psychological hotline conversations. Human assessments vary with experience and face staffing limits, so the work targets consistent, scalable alternatives. It detects non-verbal emotional signals in speech and inserts them into text transcripts so the LLM can reason over acoustic details. An auxiliary training task requires the model to generate diagnostic reasoning chains, acting as a regularizer, and data augmentation is added for robustness. The resulting system reaches 0.802 macro F1 and 0.805 accuracy on a three-class task under 5-fold cross-validation.

Core claim

By injecting identified non-verbal emotional cues into speech transcripts and training the model to generate diagnostic reasoning chains as an auxiliary task, combined with data augmentation, the LLM-based system achieves a macro F1-score of 0.802 and accuracy of 0.805 on three-class crisis classification under 5-fold cross-validation.

What carries the argument

Paralinguistic injection method that inserts non-verbal emotional cues into speech transcripts, paired with reasoning-enhanced training that uses diagnostic reasoning chain generation as an auxiliary task to regularize the classification.

If this is right

  • The system supplies consistent crisis-level judgments to support variable human operators.
  • It reduces dependence on limited staffing in psychological hotlines.
  • The framework raises overall service quality by handling more calls with stable assessment.
  • Data augmentation compensates for scarce labeled hotline examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The injection technique might transfer to other spoken domains where tone signals intent, such as therapy transcripts or emergency dispatch.
  • End-to-end speech pipelines could become feasible once cue detection is fully automated rather than separate.
  • Performance gains may vary across languages or different base LLMs if cue reliability differs.

Load-bearing premise

Non-verbal emotional cues can be reliably identified from speech and inserting them into text transcripts plus auxiliary reasoning training meaningfully improves LLM classification without introducing new errors or biases.

What would settle it

Ablating the paralinguistic cue injection and the reasoning-chain auxiliary task on the same hotline dataset and checking whether macro F1 falls substantially below 0.802 would test whether these additions drive the reported performance.

Figures

Figures reproduced from arXiv: 2605.10027 by Chao Zhang, Terumi Chiba, Yang Luo, Yongsheng Tong, Ziyun Cui.

Figure 1
Figure 1. Figure 1: Data statistics: (a) duration, (b) age, (c) gender, (d) crisis levels, and (e) relation between duration and crisis level [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of our proposed method. Our proposed crisis classification framework processes con￾versational hotline data through a two-step pipeline comprising multimodal data preprocessing and an LLM-based classifier. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model Training Pipeline. els. The target reasoning text is constructed according to the TAF’s scoring logic: we prompt the gpt-oss-120b model4 [23] to generate step-by-step clinical rationales conditioned on (i) the paralinguistic enriched transcript, (ii) the ground-truth cri￾sis label, and (iii) the explicit TAF criteria across the Affective, Behavioral, and Cognitive domains. This produces supervision s… view at source ↗
read the original abstract

Psychological support hotlines provide critical support for individuals experiencing mental health emergencies, yet current assessments largely rely on human operators whose judgments may vary with professional experience and are constrained by limited staffing resources. This paper proposes a large language model (LLM)-based framework for automated crisis level classification, a key indicator that supports many downstream tasks and improves the overall quality of hotline services. To better capture emotional signals in spoken conversations, we introduce a paralinguistic injection method that inserts identified non-verbal emotional cues into speech transcripts, enabling LLM-based reasoning to incorporate critical acoustic nuances. In addition, we propose a reasoning-enhanced training strategy that trains the model to generate diagnostic reasoning chains as an auxiliary task, which serves as a regulariser to improve classification performance. Combined with data augmentation, our final system achieves a macro F1-score of 0.802 and an accuracy of 0.805 on the three-class classification task under 5-fold cross-validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an LLM-based framework for three-class psychological crisis level classification from speech. It introduces a paralinguistic injection method that identifies and inserts non-verbal emotional cues into speech transcripts, a reasoning-enhanced training strategy that uses diagnostic reasoning chains as an auxiliary task, and data augmentation. The final system is reported to achieve a macro F1-score of 0.802 and accuracy of 0.805 under 5-fold cross-validation.

Significance. If the performance claims are supported by proper controls and ablations, the work could contribute to scalable automated assessment tools for mental health hotlines, reducing variability from human operators. The paralinguistic injection approach offers a concrete way to incorporate acoustic information into text-only LLMs, which may be useful for other speech-based affective tasks. The use of auxiliary reasoning as regularization is a standard but well-motivated technique here.

major comments (3)
  1. [Methods] Methods section: The paralinguistic cue identification and insertion procedure is described at a high level but supplies no standalone accuracy, error rate, or feature details for the cue detector. This is load-bearing for the central claim because the abstract attributes the 0.802 F1 to the combination of injection, reasoning training, and augmentation; without isolated validation, the contribution of acoustic-nuance injection cannot be separated from transcription artifacts or label noise.
  2. [Experiments] Experiments/Results: No ablation is reported that removes only the paralinguistic injection while holding reasoning-enhanced training and data augmentation fixed. The headline numbers (macro F1 0.802, acc 0.805) therefore cannot be unambiguously credited to the proposed mechanism, undermining the claim that non-verbal cue insertion meaningfully improves classification.
  3. [Results] Results: The paper does not report dataset size, class balance, or comparisons against strong baselines (e.g., standard fine-tuned LLM without injection or reasoning auxiliary). These omissions make the 5-fold CV results difficult to interpret or reproduce and weaken support for the three-class performance claim.
minor comments (2)
  1. [Abstract] Abstract: Add a brief statement of dataset size and the base LLM used to give readers immediate context for the reported F1 and accuracy.
  2. Notation: Ensure consistent use of terms such as 'paralinguistic injection' versus 'cue insertion' across sections to avoid minor ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for greater clarity and rigor.

read point-by-point responses
  1. Referee: [Methods] Methods section: The paralinguistic cue identification and insertion procedure is described at a high level but supplies no standalone accuracy, error rate, or feature details for the cue detector. This is load-bearing for the central claim because the abstract attributes the 0.802 F1 to the combination of injection, reasoning training, and augmentation; without isolated validation, the contribution of acoustic-nuance injection cannot be separated from transcription artifacts or label noise.

    Authors: We agree that the current description of the paralinguistic cue detector is insufficiently detailed. In the revised manuscript, we will expand the Methods section to report the standalone accuracy and error rates of the cue identification model on a validation set, along with the specific acoustic features and detection procedure used. This addition will allow readers to better assess the reliability of the injection step. revision: yes

  2. Referee: [Experiments] Experiments/Results: No ablation is reported that removes only the paralinguistic injection while holding reasoning-enhanced training and data augmentation fixed. The headline numbers (macro F1 0.802, acc 0.805) therefore cannot be unambiguously credited to the proposed mechanism, undermining the claim that non-verbal cue insertion meaningfully improves classification.

    Authors: We acknowledge that an ablation isolating the paralinguistic injection is needed to attribute performance gains specifically to this component. We will add this ablation to the Experiments section, comparing the full system against a variant that retains reasoning-enhanced training and data augmentation but omits cue injection. The revised results will include these numbers to clarify the contribution. revision: yes

  3. Referee: [Results] Results: The paper does not report dataset size, class balance, or comparisons against strong baselines (e.g., standard fine-tuned LLM without injection or reasoning auxiliary). These omissions make the 5-fold CV results difficult to interpret or reproduce and weaken support for the three-class performance claim.

    Authors: We will update the Results section to report the dataset size, class distribution, and additional baseline comparisons, including a standard fine-tuned LLM without paralinguistic injection or auxiliary reasoning. These details will improve interpretability and support reproducibility of the 5-fold cross-validation results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance from standard training and CV

full rationale

The paper describes an LLM classification pipeline that combines paralinguistic cue injection into transcripts, auxiliary reasoning-chain generation during training, and data augmentation. The headline numbers (macro F1 0.802, accuracy 0.805 under 5-fold CV) are presented strictly as measured outcomes of model training and evaluation on held-out folds. No equations, first-principles derivations, or closed-loop predictions are offered; the result is not obtained by fitting a parameter and then renaming the fit as a prediction, nor by any self-definitional or self-citation chain that reduces the claim to its own inputs. The method is therefore self-contained as an empirical ML experiment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work rests on standard assumptions of LLM fine-tuning and the unstated reliability of automatic paralinguistic feature extraction.

pith-pipeline@v0.9.0 · 5461 in / 1039 out tokens · 66234 ms · 2026-05-12T03:24:54.147929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

  1. [1]

    If not promptly addressed, they may escalate into tragic outcomes such as self- harm, suicide, or even harm to others [1]

    Introduction Psychological crises are acute states in which an individual’s per- ceived difficulties exceed their coping capacity. If not promptly addressed, they may escalate into tragic outcomes such as self- harm, suicide, or even harm to others [1]. Given the potentially severe consequences of psychological crises, support hotlines play a critical rol...

  2. [2]

    Speech-based Psychological Crisis Assessment using LLMs

    Task and Dataset We collected the dataset from a Chinese psychological support hotline, consisting of 154 calls, where all calls were annotated by two experts, both the Chief Director of Hotline Assessment, into three classes: ( 0) no crisis, (1) low crisis, and (2) medium-to- high crisis, and inter-annotator agreement was assessed to verify annotation co...

  3. [3]

    I was forced to move out of the school dorm

    Methods 3.1. Overall Framework L L M Speech Crisis Level ASR SpeechLLM Rich Transcription Figure 2:The overall framework of our proposed method. Our proposed crisis classification framework processes con- versational hotline data through a two-step pipeline comprising multimodal data preprocessing and an LLM-based classifier. As illustrated in Fig. 2, the...

  4. [4]

    Evaluation Protocol A 5-fold cross-validation scheme was applied to ensure a robust evaluation

    Experimental Setup 4.1. Evaluation Protocol A 5-fold cross-validation scheme was applied to ensure a robust evaluation. In each fold, the data was partitioned into an 80% training set and a 20% test set. We report the mean and stan- dard deviation of the macro F1-score across all five folds as our primary performance metric. Accuracy is also reported. 4.2...

  5. [5]

    We evaluate the effectiveness of our proposed method and conduct ablation studies on key components

    Results and Discussion In this section, the empirical results of our proposed crisis classi- fication system alongside detailed discussions are presented. We evaluate the effectiveness of our proposed method and conduct ablation studies on key components. 5.1. Main Result We first evaluate the overall effectiveness of the proposed frame- work. The results...

  6. [6]

    Conclusion This paper proposes a novel framework for psychological crisis classification by fine-tuning LLM on authentic hotline calls. We introduced paralinguistic injection, which effectively capitalises on the superior text-processing capabilities of LLMs by mapping vital acoustic cues into textual markers, allowing LLM reasoning to jointly leverage se...

  7. [7]

    K. J. Richard, W. Julia, and A. M. Rick,Crisis Intervention Strate- gies, Chapter 1, 9th ed. OH, USA: Cengage, 2025

  8. [8]

    Crisis lines: Current status and recommendations for research and policy,

    S. Zabelski, A. R. Kaniuka, R. A. Robertson, and R. J. Cramer, “Crisis lines: Current status and recommendations for research and policy,”Psychiatric services, vol. 74, no. 5, pp. 505–512, 2023

  9. [9]

    Preventing suicide and self-harm,

    P. Tyson, C. Law, S. Reed, E. Johnsey, O. Aruna, and S. Hall, “Preventing suicide and self-harm,”Crisis, vol. 37, no. 5, pp. 353– 360, 2016

  10. [10]

    Hotline services in China during COVID-19 pandemic,

    J. Wang, H. Wei, and L. Zhou, “Hotline services in China during COVID-19 pandemic,”Journal of affective disorders, vol. 275, p. 125, 2020

  11. [11]

    The effectiveness of crisis line services: a systematic review,

    A. S. Hoffberg, K. A. Stearns-Yoder, and L. A. Brenner, “The effectiveness of crisis line services: a systematic review,”Frontiers in public health, vol. 7, p. 399, 2020

  12. [12]

    A study on the com- petence characteristics of psychological hotline counselors during the outbreak of COVID-19,

    L. You, X. Jia, Y . Ding, Q. An, and B. Li, “A study on the com- petence characteristics of psychological hotline counselors during the outbreak of COVID-19,”Frontiers in Psychology, vol. 12, p. 566460, 2021

  13. [13]

    Acoustic features for identifying suicide risk in crisis hotline callers: Ma- chine learning approach,

    Z. Su, H. Jiang, Y . Yang, X. Hou, Y . Su, and L. Yang, “Acoustic features for identifying suicide risk in crisis hotline callers: Ma- chine learning approach,”Journal of Medical Internet Research, vol. 27, p. e67772, 2025

  14. [14]

    A machine learning approach to identifying suicide risk among text-based crisis counseling encoun- ters,

    M. Broadbent, M. Medina Grespan, K. Axford, X. Zhang, V . Sriku- mar, B. Kious, and Z. Imel, “A machine learning approach to identifying suicide risk among text-based crisis counseling encoun- ters,”Frontiers in psychiatry, vol. 14, p. 1110527, 2023

  15. [15]

    Machine learning–based evaluation of suicide risk assessment in crisis counseling calls,

    Z. E. Imel, B. Pace, B. Pendergraft, J. Pruett, M. Tanana, C. S. Soma, K. A. Comtois, and D. C. Atkins, “Machine learning–based evaluation of suicide risk assessment in crisis counseling calls,” Psychiatric services, vol. 75, no. 11, pp. 1068–1074, 2024

  16. [16]

    An ex- ploratory deep learning approach for predicting subsequent suici- dal acts in chinese psychological support hotlines,

    C. Song, Q. Zhao, J. Li, Y . Chen, Y . Tong, and G. Fu, “An ex- ploratory deep learning approach for predicting subsequent suici- dal acts in chinese psychological support hotlines,”arXiv preprint arXiv:2408.16463, 2024

  17. [17]

    Large language models for mental health applications: Systematic review,

    Z. Guo, A. Lai, J. H. Thygesen, J. Farrington, T. Keen, and K. Li, “Large language models for mental health applications: Systematic review,”JMIR mental health, vol. 11, no. 1, p. e57400, 2024

  18. [18]

    Large language models improve alzheimer’s disease diagnosis using multi-modality data,

    Y . Feng, X. Xu, Y . Zhuang, and M. Zhang, “Large language models improve alzheimer’s disease diagnosis using multi-modality data,” inProc. MedAI, 2023

  19. [19]

    Efficient and effective fine-tuning method for depression detection from conversation,

    K. Ikeuchi, T. Kishimoto, F. Nakai, T. Horigome, M. Kitazawa, and T. Ohtsuki, “Efficient and effective fine-tuning method for depression detection from conversation,” inProc. EMBC, 2025

  20. [20]

    Advanced deep learning and large language models for suicide ideation detection on social media,

    M. Qorich and R. El Ouazzani, “Advanced deep learning and large language models for suicide ideation detection on social media,” Progress in Artificial Intelligence, vol. 13, no. 2, pp. 135–147, 2024

  21. [21]

    Deep learning and large language models for audio and text analysis in predicting suicidal acts in chinese psychological support hotlines,

    Y . Chen, J. Li, C. Song, Q. Zhao, Y . Tong, and G. Fu, “Deep learning and large language models for audio and text analysis in predicting suicidal acts in chinese psychological support hotlines,” arXiv preprint arXiv:2409.06164, 2024

  22. [22]

    Evaluating Large Language Models in cri- sis detection: A real-world benchmark from psychological support hotlines,

    G. Deng, S. Rao, T. Lin, A. Dai, P. Wang, J. Xie, H. Song, K. Zhao, D. Xu, Z. Chenget al., “Evaluating Large Language Models in cri- sis detection: A real-world benchmark from psychological support hotlines,”arXiv preprint arXiv:2506.01329, 2025

  23. [23]

    K. J. Richard, W. Julia, and A. M. Rick,Crisis Intervention Strate- gies, Chapter 3, 9th ed. OH, USA: Cengage, 2025

  24. [24]

    Enhancing depression detection with chain-of- thought prompting: From emotion to reasoning using large lan- guage models,

    S. Teng, J. Liu, R. K. Jain, S. Chai, R. Hou, T. Tateyama, L. Lin, and Y .-W. Chen, “Enhancing depression detection with chain-of- thought prompting: From emotion to reasoning using large lan- guage models,” inProc. EMBC, 2025

  25. [25]

    Cognitive-mental-LLM: Evaluating reasoning in large language models for mental health prediction via online text,

    A. Patil and A. K. Gedhu, “Cognitive-mental-LLM: Evaluating reasoning in large language models for mental health prediction via online text,”arXiv preprint arXiv:2503.10095, 2025

  26. [26]

    Qwen2. 5 technical report,

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv e-prints, pp. arXiv–2412, 2024

  27. [27]

    Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition.arXiv preprint arXiv:2206.08317, 2022

    Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End- to-End Speech Recognition,”arXiv preprint arXiv:2206.08317, 2022

  28. [28]

    Step-audio-r1 technical report, 2025

    F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-Audio-R1 technical report,” arXiv preprint arXiv:2511.15848, 2025

  29. [29]

    gpt-oss-120b & gpt-oss-20b Model Card

    S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Baoet al., “gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

  30. [30]

    The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,

    F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,”IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015

  31. [31]

    OpenSMILE: The Munich versatile and fast open-source audio feature extractor,

    F. Eyben, M. W¨ollmer, and B. Schuller, “OpenSMILE: The Munich versatile and fast open-source audio feature extractor,” inProc. ACM Multimedia, 2010

  32. [32]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  33. [33]

    LoRA: Low-Rank Adaptation of Large Language Models

    Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wanget al., “LoRA: Low-Rank Adaptation of Large Language Models,”arXiv preprint arXiv:2106.09685, 2021

  34. [34]

    Bench- marking contextual and paralinguistic reasoning in speech-LLMs: A case study with in-the-wild data,

    Q. Wang, H. B. Sailor, T. Liu, W. Zhang, M. Huzaifah, N. Lertcheva, S. Sun, N. F. Chen, J. Wu, and A. Aw, “Bench- marking contextual and paralinguistic reasoning in speech-LLMs: A case study with in-the-wild data,” inProc. EMNLP, 2025

  35. [35]

    Incorporating contextual paralinguistic understanding in large speech-language models,

    Q. Wang, H. B. Sailor, J. H. Wong, T. Liu, S. Sun, W. Zhang, M. Huzaifah, N. Chen, and A. T. Aw, “Incorporating contextual paralinguistic understanding in large speech-language models,” in Proc. ASRU, 2025

  36. [36]

    Librispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal process- ing (ICASSP). IEEE, 2015, pp. 5206–5210

  37. [37]

    emotion2vec: Self-supervised pre-training for speech emotion representation,

    Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion representation,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 15 747–15 760