Recognition: no theorem link
Speech-based Psychological Crisis Assessment using LLMs
Pith reviewed 2026-05-12 03:24 UTC · model grok-4.3
The pith
LLM classifies psychological crisis levels from speech by injecting non-verbal cues into transcripts and training on reasoning chains, reaching 80.5% accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By injecting identified non-verbal emotional cues into speech transcripts and training the model to generate diagnostic reasoning chains as an auxiliary task, combined with data augmentation, the LLM-based system achieves a macro F1-score of 0.802 and accuracy of 0.805 on three-class crisis classification under 5-fold cross-validation.
What carries the argument
Paralinguistic injection method that inserts non-verbal emotional cues into speech transcripts, paired with reasoning-enhanced training that uses diagnostic reasoning chain generation as an auxiliary task to regularize the classification.
If this is right
- The system supplies consistent crisis-level judgments to support variable human operators.
- It reduces dependence on limited staffing in psychological hotlines.
- The framework raises overall service quality by handling more calls with stable assessment.
- Data augmentation compensates for scarce labeled hotline examples.
Where Pith is reading between the lines
- The injection technique might transfer to other spoken domains where tone signals intent, such as therapy transcripts or emergency dispatch.
- End-to-end speech pipelines could become feasible once cue detection is fully automated rather than separate.
- Performance gains may vary across languages or different base LLMs if cue reliability differs.
Load-bearing premise
Non-verbal emotional cues can be reliably identified from speech and inserting them into text transcripts plus auxiliary reasoning training meaningfully improves LLM classification without introducing new errors or biases.
What would settle it
Ablating the paralinguistic cue injection and the reasoning-chain auxiliary task on the same hotline dataset and checking whether macro F1 falls substantially below 0.802 would test whether these additions drive the reported performance.
Figures
read the original abstract
Psychological support hotlines provide critical support for individuals experiencing mental health emergencies, yet current assessments largely rely on human operators whose judgments may vary with professional experience and are constrained by limited staffing resources. This paper proposes a large language model (LLM)-based framework for automated crisis level classification, a key indicator that supports many downstream tasks and improves the overall quality of hotline services. To better capture emotional signals in spoken conversations, we introduce a paralinguistic injection method that inserts identified non-verbal emotional cues into speech transcripts, enabling LLM-based reasoning to incorporate critical acoustic nuances. In addition, we propose a reasoning-enhanced training strategy that trains the model to generate diagnostic reasoning chains as an auxiliary task, which serves as a regulariser to improve classification performance. Combined with data augmentation, our final system achieves a macro F1-score of 0.802 and an accuracy of 0.805 on the three-class classification task under 5-fold cross-validation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an LLM-based framework for three-class psychological crisis level classification from speech. It introduces a paralinguistic injection method that identifies and inserts non-verbal emotional cues into speech transcripts, a reasoning-enhanced training strategy that uses diagnostic reasoning chains as an auxiliary task, and data augmentation. The final system is reported to achieve a macro F1-score of 0.802 and accuracy of 0.805 under 5-fold cross-validation.
Significance. If the performance claims are supported by proper controls and ablations, the work could contribute to scalable automated assessment tools for mental health hotlines, reducing variability from human operators. The paralinguistic injection approach offers a concrete way to incorporate acoustic information into text-only LLMs, which may be useful for other speech-based affective tasks. The use of auxiliary reasoning as regularization is a standard but well-motivated technique here.
major comments (3)
- [Methods] Methods section: The paralinguistic cue identification and insertion procedure is described at a high level but supplies no standalone accuracy, error rate, or feature details for the cue detector. This is load-bearing for the central claim because the abstract attributes the 0.802 F1 to the combination of injection, reasoning training, and augmentation; without isolated validation, the contribution of acoustic-nuance injection cannot be separated from transcription artifacts or label noise.
- [Experiments] Experiments/Results: No ablation is reported that removes only the paralinguistic injection while holding reasoning-enhanced training and data augmentation fixed. The headline numbers (macro F1 0.802, acc 0.805) therefore cannot be unambiguously credited to the proposed mechanism, undermining the claim that non-verbal cue insertion meaningfully improves classification.
- [Results] Results: The paper does not report dataset size, class balance, or comparisons against strong baselines (e.g., standard fine-tuned LLM without injection or reasoning auxiliary). These omissions make the 5-fold CV results difficult to interpret or reproduce and weaken support for the three-class performance claim.
minor comments (2)
- [Abstract] Abstract: Add a brief statement of dataset size and the base LLM used to give readers immediate context for the reported F1 and accuracy.
- Notation: Ensure consistent use of terms such as 'paralinguistic injection' versus 'cue insertion' across sections to avoid minor ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for greater clarity and rigor.
read point-by-point responses
-
Referee: [Methods] Methods section: The paralinguistic cue identification and insertion procedure is described at a high level but supplies no standalone accuracy, error rate, or feature details for the cue detector. This is load-bearing for the central claim because the abstract attributes the 0.802 F1 to the combination of injection, reasoning training, and augmentation; without isolated validation, the contribution of acoustic-nuance injection cannot be separated from transcription artifacts or label noise.
Authors: We agree that the current description of the paralinguistic cue detector is insufficiently detailed. In the revised manuscript, we will expand the Methods section to report the standalone accuracy and error rates of the cue identification model on a validation set, along with the specific acoustic features and detection procedure used. This addition will allow readers to better assess the reliability of the injection step. revision: yes
-
Referee: [Experiments] Experiments/Results: No ablation is reported that removes only the paralinguistic injection while holding reasoning-enhanced training and data augmentation fixed. The headline numbers (macro F1 0.802, acc 0.805) therefore cannot be unambiguously credited to the proposed mechanism, undermining the claim that non-verbal cue insertion meaningfully improves classification.
Authors: We acknowledge that an ablation isolating the paralinguistic injection is needed to attribute performance gains specifically to this component. We will add this ablation to the Experiments section, comparing the full system against a variant that retains reasoning-enhanced training and data augmentation but omits cue injection. The revised results will include these numbers to clarify the contribution. revision: yes
-
Referee: [Results] Results: The paper does not report dataset size, class balance, or comparisons against strong baselines (e.g., standard fine-tuned LLM without injection or reasoning auxiliary). These omissions make the 5-fold CV results difficult to interpret or reproduce and weaken support for the three-class performance claim.
Authors: We will update the Results section to report the dataset size, class distribution, and additional baseline comparisons, including a standard fine-tuned LLM without paralinguistic injection or auxiliary reasoning. These details will improve interpretability and support reproducibility of the 5-fold cross-validation results. revision: yes
Circularity Check
No circularity: empirical performance from standard training and CV
full rationale
The paper describes an LLM classification pipeline that combines paralinguistic cue injection into transcripts, auxiliary reasoning-chain generation during training, and data augmentation. The headline numbers (macro F1 0.802, accuracy 0.805 under 5-fold CV) are presented strictly as measured outcomes of model training and evaluation on held-out folds. No equations, first-principles derivations, or closed-loop predictions are offered; the result is not obtained by fitting a parameter and then renaming the fit as a prediction, nor by any self-definitional or self-citation chain that reduces the claim to its own inputs. The method is therefore self-contained as an empirical ML experiment.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Psychological crises are acute states in which an individual’s per- ceived difficulties exceed their coping capacity. If not promptly addressed, they may escalate into tragic outcomes such as self- harm, suicide, or even harm to others [1]. Given the potentially severe consequences of psychological crises, support hotlines play a critical rol...
-
[2]
Speech-based Psychological Crisis Assessment using LLMs
Task and Dataset We collected the dataset from a Chinese psychological support hotline, consisting of 154 calls, where all calls were annotated by two experts, both the Chief Director of Hotline Assessment, into three classes: ( 0) no crisis, (1) low crisis, and (2) medium-to- high crisis, and inter-annotator agreement was assessed to verify annotation co...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
I was forced to move out of the school dorm
Methods 3.1. Overall Framework L L M Speech Crisis Level ASR SpeechLLM Rich Transcription Figure 2:The overall framework of our proposed method. Our proposed crisis classification framework processes con- versational hotline data through a two-step pipeline comprising multimodal data preprocessing and an LLM-based classifier. As illustrated in Fig. 2, the...
-
[4]
Evaluation Protocol A 5-fold cross-validation scheme was applied to ensure a robust evaluation
Experimental Setup 4.1. Evaluation Protocol A 5-fold cross-validation scheme was applied to ensure a robust evaluation. In each fold, the data was partitioned into an 80% training set and a 20% test set. We report the mean and stan- dard deviation of the macro F1-score across all five folds as our primary performance metric. Accuracy is also reported. 4.2...
-
[5]
We evaluate the effectiveness of our proposed method and conduct ablation studies on key components
Results and Discussion In this section, the empirical results of our proposed crisis classi- fication system alongside detailed discussions are presented. We evaluate the effectiveness of our proposed method and conduct ablation studies on key components. 5.1. Main Result We first evaluate the overall effectiveness of the proposed frame- work. The results...
-
[6]
Conclusion This paper proposes a novel framework for psychological crisis classification by fine-tuning LLM on authentic hotline calls. We introduced paralinguistic injection, which effectively capitalises on the superior text-processing capabilities of LLMs by mapping vital acoustic cues into textual markers, allowing LLM reasoning to jointly leverage se...
-
[7]
K. J. Richard, W. Julia, and A. M. Rick,Crisis Intervention Strate- gies, Chapter 1, 9th ed. OH, USA: Cengage, 2025
work page 2025
-
[8]
Crisis lines: Current status and recommendations for research and policy,
S. Zabelski, A. R. Kaniuka, R. A. Robertson, and R. J. Cramer, “Crisis lines: Current status and recommendations for research and policy,”Psychiatric services, vol. 74, no. 5, pp. 505–512, 2023
work page 2023
-
[9]
Preventing suicide and self-harm,
P. Tyson, C. Law, S. Reed, E. Johnsey, O. Aruna, and S. Hall, “Preventing suicide and self-harm,”Crisis, vol. 37, no. 5, pp. 353– 360, 2016
work page 2016
-
[10]
Hotline services in China during COVID-19 pandemic,
J. Wang, H. Wei, and L. Zhou, “Hotline services in China during COVID-19 pandemic,”Journal of affective disorders, vol. 275, p. 125, 2020
work page 2020
-
[11]
The effectiveness of crisis line services: a systematic review,
A. S. Hoffberg, K. A. Stearns-Yoder, and L. A. Brenner, “The effectiveness of crisis line services: a systematic review,”Frontiers in public health, vol. 7, p. 399, 2020
work page 2020
-
[12]
L. You, X. Jia, Y . Ding, Q. An, and B. Li, “A study on the com- petence characteristics of psychological hotline counselors during the outbreak of COVID-19,”Frontiers in Psychology, vol. 12, p. 566460, 2021
work page 2021
-
[13]
Z. Su, H. Jiang, Y . Yang, X. Hou, Y . Su, and L. Yang, “Acoustic features for identifying suicide risk in crisis hotline callers: Ma- chine learning approach,”Journal of Medical Internet Research, vol. 27, p. e67772, 2025
work page 2025
-
[14]
M. Broadbent, M. Medina Grespan, K. Axford, X. Zhang, V . Sriku- mar, B. Kious, and Z. Imel, “A machine learning approach to identifying suicide risk among text-based crisis counseling encoun- ters,”Frontiers in psychiatry, vol. 14, p. 1110527, 2023
work page 2023
-
[15]
Machine learning–based evaluation of suicide risk assessment in crisis counseling calls,
Z. E. Imel, B. Pace, B. Pendergraft, J. Pruett, M. Tanana, C. S. Soma, K. A. Comtois, and D. C. Atkins, “Machine learning–based evaluation of suicide risk assessment in crisis counseling calls,” Psychiatric services, vol. 75, no. 11, pp. 1068–1074, 2024
work page 2024
-
[16]
C. Song, Q. Zhao, J. Li, Y . Chen, Y . Tong, and G. Fu, “An ex- ploratory deep learning approach for predicting subsequent suici- dal acts in chinese psychological support hotlines,”arXiv preprint arXiv:2408.16463, 2024
-
[17]
Large language models for mental health applications: Systematic review,
Z. Guo, A. Lai, J. H. Thygesen, J. Farrington, T. Keen, and K. Li, “Large language models for mental health applications: Systematic review,”JMIR mental health, vol. 11, no. 1, p. e57400, 2024
work page 2024
-
[18]
Large language models improve alzheimer’s disease diagnosis using multi-modality data,
Y . Feng, X. Xu, Y . Zhuang, and M. Zhang, “Large language models improve alzheimer’s disease diagnosis using multi-modality data,” inProc. MedAI, 2023
work page 2023
-
[19]
Efficient and effective fine-tuning method for depression detection from conversation,
K. Ikeuchi, T. Kishimoto, F. Nakai, T. Horigome, M. Kitazawa, and T. Ohtsuki, “Efficient and effective fine-tuning method for depression detection from conversation,” inProc. EMBC, 2025
work page 2025
-
[20]
Advanced deep learning and large language models for suicide ideation detection on social media,
M. Qorich and R. El Ouazzani, “Advanced deep learning and large language models for suicide ideation detection on social media,” Progress in Artificial Intelligence, vol. 13, no. 2, pp. 135–147, 2024
work page 2024
-
[21]
Y . Chen, J. Li, C. Song, Q. Zhao, Y . Tong, and G. Fu, “Deep learning and large language models for audio and text analysis in predicting suicidal acts in chinese psychological support hotlines,” arXiv preprint arXiv:2409.06164, 2024
-
[22]
G. Deng, S. Rao, T. Lin, A. Dai, P. Wang, J. Xie, H. Song, K. Zhao, D. Xu, Z. Chenget al., “Evaluating Large Language Models in cri- sis detection: A real-world benchmark from psychological support hotlines,”arXiv preprint arXiv:2506.01329, 2025
-
[23]
K. J. Richard, W. Julia, and A. M. Rick,Crisis Intervention Strate- gies, Chapter 3, 9th ed. OH, USA: Cengage, 2025
work page 2025
-
[24]
S. Teng, J. Liu, R. K. Jain, S. Chai, R. Hou, T. Tateyama, L. Lin, and Y .-W. Chen, “Enhancing depression detection with chain-of- thought prompting: From emotion to reasoning using large lan- guage models,” inProc. EMBC, 2025
work page 2025
-
[25]
A. Patil and A. K. Gedhu, “Cognitive-mental-LLM: Evaluating reasoning in large language models for mental health prediction via online text,”arXiv preprint arXiv:2503.10095, 2025
-
[26]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv e-prints, pp. arXiv–2412, 2024
work page 2024
-
[27]
Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End- to-End Speech Recognition,”arXiv preprint arXiv:2206.08317, 2022
-
[28]
Step-audio-r1 technical report, 2025
F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-Audio-R1 technical report,” arXiv preprint arXiv:2511.15848, 2025
-
[29]
gpt-oss-120b & gpt-oss-20b Model Card
S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Baoet al., “gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,
F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,”IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015
work page 2015
-
[31]
OpenSMILE: The Munich versatile and fast open-source audio feature extractor,
F. Eyben, M. W¨ollmer, and B. Schuller, “OpenSMILE: The Munich versatile and fast open-source audio feature extractor,” inProc. ACM Multimedia, 2010
work page 2010
-
[32]
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
LoRA: Low-Rank Adaptation of Large Language Models
Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wanget al., “LoRA: Low-Rank Adaptation of Large Language Models,”arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Q. Wang, H. B. Sailor, T. Liu, W. Zhang, M. Huzaifah, N. Lertcheva, S. Sun, N. F. Chen, J. Wu, and A. Aw, “Bench- marking contextual and paralinguistic reasoning in speech-LLMs: A case study with in-the-wild data,” inProc. EMNLP, 2025
work page 2025
-
[35]
Incorporating contextual paralinguistic understanding in large speech-language models,
Q. Wang, H. B. Sailor, J. H. Wong, T. Liu, S. Sun, W. Zhang, M. Huzaifah, N. Chen, and A. T. Aw, “Incorporating contextual paralinguistic understanding in large speech-language models,” in Proc. ASRU, 2025
work page 2025
-
[36]
Librispeech: an asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal process- ing (ICASSP). IEEE, 2015, pp. 5206–5210
work page 2015
-
[37]
emotion2vec: Self-supervised pre-training for speech emotion representation,
Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion representation,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 15 747–15 760
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.