pith. sign in

arxiv: 2605.17370 · v2 · pith:JITKR2EInew · submitted 2026-05-17 · 💻 cs.AI

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

Pith reviewed 2026-05-20 13:19 UTC · model grok-4.3

classification 💻 cs.AI
keywords CBT-Audio datasetaudio language modelsdistress estimationcognitive behavioral therapyspoken sessionsmultimodal AIvocal cuespatient distress
0
0 comments X

The pith

Adding audio to transcripts improves distress estimates from CBT sessions in most audio language models tested.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds CBT-Audio, a collection of real spoken cognitive behavioral therapy recordings, to test whether audio language models can judge patient distress more accurately than text-only models. It runs the same models on three input types: audio alone, transcript alone, and both together. Results show that the combined input beats the transcript alone for eight of the ten model families, with clear gains in four, and the largest lift occurs when a patient's tone contradicts their words. A sympathetic reader would care because therapists routinely use vocal cues to gauge distress and adapt their responses, yet most current AI systems for therapy work only from text and therefore miss this information.

Core claim

CBT-Audio supplies 1,802 patient turns drawn from 96 publicly available CBT session recordings together with turn-level distress intensity labels that were validated on an expert-annotated subset. Ten open-source audio language models were evaluated under three conditions. Supplying both audio and transcript improved distress estimation over transcript alone in eight of the ten model families, with statistically significant gains in four families. Case studies indicate the improvement is largest precisely when verbal content and vocal delivery diverge.

What carries the argument

The three-condition evaluation (audio only, transcript only, audio plus transcript) performed on the CBT-Audio dataset, which isolates the incremental value of vocal information for patient distress estimation.

If this is right

  • Audio language models can detect mismatches between what a patient says and how they say it, which text models miss by design.
  • Therapy-support AI can be built to use vocal cues when deciding how to respond or when to flag high-distress moments.
  • The public CBT-Audio dataset provides a shared benchmark for testing future audio models on mental-health interaction tasks.
  • Similar multimodal evaluations can be applied to other spoken clinical conversations where tone carries clinical meaning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time systems could listen to ongoing sessions and alert therapists to moments when vocal tone signals higher distress than the words alone suggest.
  • The same dataset and evaluation protocol could be extended to video to test whether facial expressions add still more signal beyond audio and text.

Load-bearing premise

The turn-level distress labels assigned to every patient turn accurately reflect the patient's true state of distress.

What would settle it

Re-annotate the full set of 1,802 turns with multiple experts and re-run the model comparisons; if the performance advantage of audio-plus-transcript disappears on the expert-only labels, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.17370 by Adam G. Dunn, Amit Saha, Anastasia Serafimovska, Anastasia Suraev, Jinman Kim, Ping-hsiu Lin, Qixuan Hu, Shuchang Ye, Sydney Su, Usman Naseem, Xumou Zhang.

Figure 1
Figure 1. Figure 1: Pipeline from publicly available CBT sessions to model evaluation: (1) Patient turn extraction using speaker [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Modality improvement landscape. The x-axis shows how much audio-only improves MAE over transcript [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Case studies showing how input condition affects distress estimation. Each case shows the preceding [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Label distribution across three independent GPT-audio-1.5 SSR runs and the final aggregated SSR label. The [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between SSR-based labels and direct numeric prompting. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the CBT-Audio dataset of 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress intensity labels validated on an experts-annotated subset. It evaluates 10 open-source audio language models under three input conditions (audio only, transcript only, or both) and reports that adding audio to transcript input improves distress estimation over transcript alone in 8 of 10 model families, with significant gains in 4, and clearest benefits in case studies where verbal content and vocal delivery diverge.

Significance. If the results hold under rigorous validation, this work provides a valuable new benchmark for multimodal audio language models in mental health, addressing the scarcity of spoken CBT data and highlighting the clinical relevance of vocal cues beyond text. The empirical focus on open-source models and the public dataset release are strengths that enable reproducible follow-up research.

major comments (2)
  1. Dataset description (abstract and methods): The distress labels are validated only on an experts-annotated subset rather than the full 1,802 turns. This is load-bearing for the central claim because systematic noise or bias in the unvalidated majority could artifactually produce the reported audio gains, especially in the case studies where verbal content and vocal delivery diverge.
  2. Results section (abstract claims): Performance gains are reported in 8 of 10 models with significance in 4, yet no details are given on the statistical tests performed, error bars, data splits, or full validation process. This undermines assessment of whether the audio+transcript improvements are reliable.
minor comments (1)
  1. Abstract: Specify the size of the experts-annotated subset and the exact validation procedure to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in the manuscript. Below we provide point-by-point responses to the major comments and describe the revisions we plan to implement.

read point-by-point responses
  1. Referee: Dataset description (abstract and methods): The distress labels are validated only on an experts-annotated subset rather than the full 1,802 turns. This is load-bearing for the central claim because systematic noise or bias in the unvalidated majority could artifactually produce the reported audio gains, especially in the case studies where verbal content and vocal delivery diverge.

    Authors: We acknowledge the referee's concern regarding the validation of distress labels. The current manuscript indicates that labels were validated on an experts-annotated subset. To address this, we will revise the Methods section to provide a more detailed description of how the labels were obtained for the full 1,802 turns and how the subset was selected for expert validation. Furthermore, we will include additional experiments reporting model performance exclusively on the validated subset to demonstrate that the observed audio gains persist in this more rigorously labeled portion of the data. We will also add a discussion of potential label noise as a limitation. revision: yes

  2. Referee: Results section (abstract claims): Performance gains are reported in 8 of 10 models with significance in 4, yet no details are given on the statistical tests performed, error bars, data splits, or full validation process. This undermines assessment of whether the audio+transcript improvements are reliable.

    Authors: We agree that additional details on the statistical analysis are necessary for a complete assessment. In the revised version, we will add comprehensive information on the statistical tests performed to determine significance, include error bars in all relevant figures and tables, specify the data splitting strategy used for evaluation, and elaborate on the full validation process. These additions will allow readers to better evaluate the reliability of the reported improvements when adding audio input. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset introduction and model evaluation

full rationale

The paper presents CBT-Audio as a new dataset of 1,802 patient turns from CBT recordings, with turn-level distress labels, and reports direct empirical comparisons of 10 audio language models under audio-only, transcript-only, and combined inputs. No derivation chain, equations, fitted parameters, or predictions are claimed. Results (audio+transcript improves over transcript in 8/10 families) are measured against the introduced labels without reducing to prior self-referential quantities or self-citation load-bearing steps. The work is self-contained against external benchmarks (open-source models) and introduces new data, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical dataset and evaluation paper with no mathematical derivations or new physical entities; relies on standard assumptions about label quality and data representativeness.

axioms (1)
  • domain assumption Expert-annotated subset provides reliable validation for turn-level distress labels across the full dataset
    Paper depends on this for claiming label quality without full expert review.

pith-pipeline@v0.9.0 · 5849 in / 1230 out tokens · 69697 ms · 2026-05-20T13:19:24.982987+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 7 internal anchors

  1. [1]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

  2. [2]

    Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

    Oren Asman, John Torous, and Amir Tal. Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

  3. [3]

    Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023

    Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023

  4. [4]

    Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

    Suha Ballout. Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

  5. [5]

    Suhas Bn, Dominik Mattioli, Andrew M Sherrill, Rosa I Arriaga, Christopher Wiese, and Saeed Abdullah. How real are synthetic therapy conversations? evaluating fidelity in prolonged exposure dialogues.Findings of the Association for Computational Linguistics: EMNLP, 2025:20986–20995, 2025

  6. [6]

    Sherrill, Rosa I

    Suhas BN, Andrew M. Sherrill, Rosa I. Arriaga, Christopher Wiese, and Saeed Abdullah. Thousand voices of trauma: A large-scale synthetic dataset for modeling prolonged exposure therapy conversations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=qrFvHgZa7l

  7. [7]

    pyannote

    Hervé Bredin. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In24th Interspeech Conference (INTERSPEECH 2023), pages 1983–1987. ISCA, 2023

  8. [8]

    it’s not only attention we need

    Andreas Bucher, Sarah Egger, Inna Vashkite, Wenyuan Wu, and Gerhard Schwabe. “it’s not only attention we need”: Systematic review of large language models in mental health care.JMIR mental health, 12:e78410, 2025

  9. [9]

    Iemocap: Interactive emotional dyadic motion capture database

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008

  10. [10]

    The empirical status of cognitive- behavioral therapy: A review of meta-analyses.Clinical psychology review, 26(1):17–31, 2006

    Andrew C Butler, Jason E Chapman, Evan M Forman, and Aaron T Beck. The empirical status of cognitive- behavioral therapy: A review of meta-analyses.Clinical psychology review, 26(1):17–31, 2006

  11. [11]

    V oice acoustical measure- ment of the severity of major depression.Brain and cognition, 56(1):30–35, 2004

    Michael Cannizzaro, Brian Harel, Nicole Reilly, Phillip Chappell, and Peter J Snyder. V oice acoustical measure- ment of the severity of major depression.Brain and cognition, 56(1):30–35, 2004. 9

  12. [12]

    Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

  13. [13]

    John R Crawford and Julie D Henry. The positive and negative affect schedule (panas): Construct validity, measurement properties and normative data in a large non-clinical sample.British journal of clinical psychology, 43(3):245–265, 2004

  14. [14]

    Kimi-Audio Technical Report

    Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025

  15. [15]

    Ultravox: A fast multimodal llm for real-time voice

    AI Fixie. Ultravox: A fast multimodal llm for real-time voice. https://huggingface.co/fixie-ai/ ultravox-v0_5-llama-3_1-8b , 2024. Official project website: https://ultravox.ai. Model evaluated: ultravox-v0_5-llama-3_1-8b

  16. [16]

    Nonverbal communication in psychotherapy.Psychiatry (Edgmont), 7(6): 38, 2010

    Gretchen N Foley and Julie P Gentile. Nonverbal communication in psychotherapy.Psychiatry (Edgmont), 7(6): 38, 2010

  17. [17]

    Using psychological artificial intelligence (tess) to relieve symptoms of depression and anxiety: randomized controlled trial.JMIR mental health, 5(4):e9782, 2018

    Russell Fulmer, Angela Joerin, Breanna Gentile, Lysanne Lakerink, and Michiel Rauws. Using psychological artificial intelligence (tess) to relieve symptoms of depression and anxiety: randomized controlled trial.JMIR mental health, 5(4):e9782, 2018

  18. [18]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

  19. [19]

    Gemma 4 e4b instruct (gemma-4-e4b-it)

    Google DeepMind. Gemma 4 e4b instruct (gemma-4-e4b-it). https://huggingface.co/google/ gemma-4-E4B-it , 2026. Official release announcement on Hugging Face: https://huggingface.co/blog/ gemma4. Model evaluated: gemma-4-E4B-it

  20. [20]

    Exploring physicians’ verbal and nonverbal responses to cues/concerns: Learning from incongruent communication.Patient education and counseling, 100(11):1979–1989, 2017

    Rita Gorawara-Bhat, Linda Hafskjold, Paul Gulbrandsen, and Hilde Eide. Exploring physicians’ verbal and nonverbal responses to cues/concerns: Learning from incongruent communication.Patient education and counseling, 100(11):1979–1989, 2017

  21. [21]

    The distress analysis interview corpus of human and computer interviews

    Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. The distress analysis interview corpus of human and computer interviews. InLrec, volume 14, pages 3123–3128. Reykjavik, 2014

  22. [22]

    Can large language models replace therapists? evaluating performance at simple cognitive behavioral therapy tasks.JMIR AI, 3(1):e52500, 2024

    Nathan Hodson and Simon Williamson. Can large language models replace therapists? evaluating performance at simple cognitive behavioral therapy tasks.JMIR AI, 3(1):e52500, 2024

  23. [23]

    The efficacy of cognitive behavioral therapy: A review of meta-analyses.Cognitive therapy and research, 36(5):427–440, 2012

    Stefan G Hofmann, Anu Asnaani, Imke JJ V onk, Alice T Sawyer, and Angela Fang. The efficacy of cognitive behavioral therapy: A review of meta-analyses.Cognitive therapy and research, 36(5):427–440, 2012

  24. [24]

    A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

    Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

  25. [25]

    Speech emotion recognition in mental health: Systematic review of voice-based applications

    Eric Jordan, Raphaël Terrisse, Valeria Lucarini, Motasem Alrahabi, Marie-Odile Krebs, Julien Desclés, and Christophe Lemey. Speech emotion recognition in mental health: Systematic review of voice-based applications. JMIR mental health, 12(1):e74260, 2025

  26. [26]

    Mitchel Kappen, Gert Vanhollebeke, Jonas Van Der Donckt, Sofie Van Hoecke, and Marie-Anne Vanderhasselt. Acoustic and prosodic speech features reflect physiological stress but not isolated negative affect: a multi-paradigm study on psychosocial stressors.Scientific Reports, 14(1):5515, 2024

  27. [27]

    Cactus: Towards psychological counseling conversations using cognitive behavioral theory

    Suyeon Lee, Sunghwan Mac Kim, Minju Kim, Dongjin Kang, Dongil Yang, Harim Kim, Minseok Kang, Dayi Jung, Min Hee Kim, Seungbeen Lee, et al. Cactus: Towards psychological counseling conversations using cognitive behavioral theory. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14245–14274, 2024

  28. [28]

    V oxtral.arXiv preprint arXiv:2507.13264, 2025

    Alexander H Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, et al. V oxtral.arXiv preprint arXiv:2507.13264, 2025

  29. [29]

    Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5): e0196391, 2018

  30. [30]

    Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025

    Benjamin F Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C Luhmann, Robbie Dow, Kli Pappas, and Thomas V Wiecki. Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025. 10

  31. [31]

    Opportunities and risks of large language models in psychiatry.NPP—Digital Psychiatry and Neuroscience, 2(1): 8, 2024

    Nick Obradovich, Sahib S Khalsa, Waqas U Khan, Jina Suh, Roy H Perlis, Olusola Ajilore, and Martin P Paulus. Opportunities and risks of large language models in psychiatry.NPP—Digital Psychiatry and Neuroscience, 2(1): 8, 2024

  32. [32]

    Oxford University Press, 2003

    Pierre Philippot, Robert S Feldman, and Erik J Coats.Nonverbal behavior in clinical settings. Oxford University Press, 2003

  33. [33]

    Powerset multi-class cross entropy loss for neural speaker diarization

    Alexis Plaquet and Hervé Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. In24th Interspeech Conference (INTERSPEECH 2023), pages 3222–3226. ISCA, 2023

  34. [34]

    Meld: A multimodal multi-party dataset for emotion recognition in conversations

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

  35. [35]

    Smile: Single-turn to multi-turn inclu- sive language expansion via chatgpt for mental health support

    Huachuan Qiu, Hongliang He, Shuai Zhang, Anqi Li, and Zhenzhong Lan. Smile: Single-turn to multi-turn inclu- sive language expansion via chatgpt for mental health support. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 615–636, 2024

  36. [36]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

  37. [37]

    Beyond verbal behavior: An empirical analysis of speech rates in psychotherapy sessions.Frontiers in psychology, 9:978, 2018

    Diego Rocco, Massimiliano Pastore, Alessandro Gennaro, Sergio Salvatore, Mauro Cozzolino, and Maristella Scorza. Beyond verbal behavior: An empirical analysis of speech rates in psychotherapy sessions.Frontiers in psychology, 9:978, 2018

  38. [38]

    Barriers to improvement of mental health services in low-income and middle-income countries.The Lancet, 370(9593):1164–1174, 2007

    Benedetto Saraceno, Mark van Ommeren, Rajaie Batniji, Alex Cohen, Oye Gureje, John Mahoney, Devi Sridhar, and Chris Underhill. Barriers to improvement of mental health services in low-income and middle-income countries.The Lancet, 370(9593):1164–1174, 2007

  39. [39]

    Are large language models possible to conduct cognitive behavioral therapy? In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3695–3700

    Hao Shen, Zihan Li, Minqiang Yang, Minghui Ni, Yongfeng Tao, Zhengyang Yu, Weihao Zheng, Chen Xu, and Bin Hu. Are large language models possible to conduct cognitive behavioral therapy? In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3695–3700. IEEE, 2024

  40. [40]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

  41. [41]

    Rate of speech and emotional-cognitive regulation in the psychotherapeutic process: a pilot study.Research in Psychotherapy: Psychopathology, Process and Outcome, 19(2), 2016

    Marco Tonti and Omar CG Gelo. Rate of speech and emotional-cognitive regulation in the psychotherapeutic process: a pilot study.Research in Psychotherapy: Psychopathology, Process and Outcome, 19(2), 2016

  42. [42]

    Mental status examination

    Rachel M V oss and Joe M Das. Mental status examination. InStatPearls [Internet]. StatPearls Publishing, 2024

  43. [43]

    A systematic review of large language models in mental health: Opportunities, challenges, and future directions.Electronics, 15(3):524, 2026

    Evdokia V oultsiou and Lefteris Moussiades. A systematic review of large language models in mental health: Opportunities, challenges, and future directions.Electronics, 15(3):524, 2026

  44. [44]

    Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions.arXiv preprint arXiv:2508.20764, 2025

    Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, and Honglei Guo. Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions.arXiv preprint arXiv:2508.20764, 2025

  45. [45]

    Development and validation of brief measures of positive and negative affect: the panas scales.Journal of personality and social psychology, 54(6):1063, 1988

    David Watson, Lee Anna Clark, and Auke Tellegen. Development and validation of brief measures of positive and negative affect: the panas scales.Journal of personality and social psychology, 54(6):1063, 1988

  46. [46]

    Therapist emotional reactions and client resistance in cognitive behavioral therapy.Psychotherapy, 49(2):163, 2012

    Henny A Westra, Adi Aviram, Laura Connors, Angela Kertes, and Mariyam Ahmed. Therapist emotional reactions and client resistance in cognitive behavioral therapy.Psychotherapy, 49(2):163, 2012

  47. [47]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URL https://arxiv.org/abs/2503.20215

  48. [48]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  49. [49]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  50. [50]

    Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy

    Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C Chiu, Shaun M Eack, Fei Fang, William Yang Wang, and Zhiyu Chen. Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tec...

  51. [51]

    Diacbt: A long-periodic dialogue corpus guided by cognitive conceptualization diagram for cbt-based psychological counseling.arXiv preprint arXiv:2509.02999, 2025

    Yougen Zhou, Ningning Zhou, Qin Chen, Jie Zhou, Aimin Zhou, and Liang He. Diacbt: A long-periodic dialogue corpus guided by cognitive conceptualization diagram for cbt-based psychological counseling.arXiv preprint arXiv:2509.02999, 2025. 11 A Source Recordings Table 3 lists the 96 publicly available CBT educational recordings used to construct CBT-Audio. ...

  52. [52]

    TEXT context (previous conversation turns) - to understand the topic being discussed

  53. [53]

    The patient feels anxious and stressed, struggling to maintain composure

    AUDIO clip (patient's current turn) - to assess their emotional state Your task: Based on the AUDIO, describe how the patient FEELS emotionally. Use the audio (tone, pace, voice quality) as evidence, but describe their emotional state, not just how they sound.,→ The context is provided only to help you understand the conversation topic. Examples of good d...