CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings
Pith reviewed 2026-05-20 13:19 UTC · model grok-4.3
The pith
Adding audio to transcripts improves distress estimates from CBT sessions in most audio language models tested.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CBT-Audio supplies 1,802 patient turns drawn from 96 publicly available CBT session recordings together with turn-level distress intensity labels that were validated on an expert-annotated subset. Ten open-source audio language models were evaluated under three conditions. Supplying both audio and transcript improved distress estimation over transcript alone in eight of the ten model families, with statistically significant gains in four families. Case studies indicate the improvement is largest precisely when verbal content and vocal delivery diverge.
What carries the argument
The three-condition evaluation (audio only, transcript only, audio plus transcript) performed on the CBT-Audio dataset, which isolates the incremental value of vocal information for patient distress estimation.
If this is right
- Audio language models can detect mismatches between what a patient says and how they say it, which text models miss by design.
- Therapy-support AI can be built to use vocal cues when deciding how to respond or when to flag high-distress moments.
- The public CBT-Audio dataset provides a shared benchmark for testing future audio models on mental-health interaction tasks.
- Similar multimodal evaluations can be applied to other spoken clinical conversations where tone carries clinical meaning.
Where Pith is reading between the lines
- Real-time systems could listen to ongoing sessions and alert therapists to moments when vocal tone signals higher distress than the words alone suggest.
- The same dataset and evaluation protocol could be extended to video to test whether facial expressions add still more signal beyond audio and text.
Load-bearing premise
The turn-level distress labels assigned to every patient turn accurately reflect the patient's true state of distress.
What would settle it
Re-annotate the full set of 1,802 turns with multiple experts and re-run the model comparisons; if the performance advantage of audio-plus-transcript disappears on the expert-only labels, the central claim is falsified.
Figures
read the original abstract
Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the CBT-Audio dataset of 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress intensity labels validated on an experts-annotated subset. It evaluates 10 open-source audio language models under three input conditions (audio only, transcript only, or both) and reports that adding audio to transcript input improves distress estimation over transcript alone in 8 of 10 model families, with significant gains in 4, and clearest benefits in case studies where verbal content and vocal delivery diverge.
Significance. If the results hold under rigorous validation, this work provides a valuable new benchmark for multimodal audio language models in mental health, addressing the scarcity of spoken CBT data and highlighting the clinical relevance of vocal cues beyond text. The empirical focus on open-source models and the public dataset release are strengths that enable reproducible follow-up research.
major comments (2)
- Dataset description (abstract and methods): The distress labels are validated only on an experts-annotated subset rather than the full 1,802 turns. This is load-bearing for the central claim because systematic noise or bias in the unvalidated majority could artifactually produce the reported audio gains, especially in the case studies where verbal content and vocal delivery diverge.
- Results section (abstract claims): Performance gains are reported in 8 of 10 models with significance in 4, yet no details are given on the statistical tests performed, error bars, data splits, or full validation process. This undermines assessment of whether the audio+transcript improvements are reliable.
minor comments (1)
- Abstract: Specify the size of the experts-annotated subset and the exact validation procedure to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us identify areas for improvement in the manuscript. Below we provide point-by-point responses to the major comments and describe the revisions we plan to implement.
read point-by-point responses
-
Referee: Dataset description (abstract and methods): The distress labels are validated only on an experts-annotated subset rather than the full 1,802 turns. This is load-bearing for the central claim because systematic noise or bias in the unvalidated majority could artifactually produce the reported audio gains, especially in the case studies where verbal content and vocal delivery diverge.
Authors: We acknowledge the referee's concern regarding the validation of distress labels. The current manuscript indicates that labels were validated on an experts-annotated subset. To address this, we will revise the Methods section to provide a more detailed description of how the labels were obtained for the full 1,802 turns and how the subset was selected for expert validation. Furthermore, we will include additional experiments reporting model performance exclusively on the validated subset to demonstrate that the observed audio gains persist in this more rigorously labeled portion of the data. We will also add a discussion of potential label noise as a limitation. revision: yes
-
Referee: Results section (abstract claims): Performance gains are reported in 8 of 10 models with significance in 4, yet no details are given on the statistical tests performed, error bars, data splits, or full validation process. This undermines assessment of whether the audio+transcript improvements are reliable.
Authors: We agree that additional details on the statistical analysis are necessary for a complete assessment. In the revised version, we will add comprehensive information on the statistical tests performed to determine significance, include error bars in all relevant figures and tables, specify the data splitting strategy used for evaluation, and elaborate on the full validation process. These additions will allow readers to better evaluate the reliability of the reported improvements when adding audio input. revision: yes
Circularity Check
No circularity: purely empirical dataset introduction and model evaluation
full rationale
The paper presents CBT-Audio as a new dataset of 1,802 patient turns from CBT recordings, with turn-level distress labels, and reports direct empirical comparisons of 10 audio language models under audio-only, transcript-only, and combined inputs. No derivation chain, equations, fitted parameters, or predictions are claimed. Results (audio+transcript improves over transcript in 8/10 families) are measured against the introduced labels without reducing to prior self-referential quantities or self-citation load-bearing steps. The work is self-contained against external benchmarks (open-source models) and introduces new data, satisfying the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert-annotated subset provides reliable validation for turn-level distress labels across the full dataset
Reference graph
Works this paper leans on
-
[1]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Oren Asman, John Torous, and Amir Tal. Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025
work page 2025
-
[3]
Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023
Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023
work page 2023
-
[4]
Suha Ballout. Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025
work page 2025
-
[5]
Suhas Bn, Dominik Mattioli, Andrew M Sherrill, Rosa I Arriaga, Christopher Wiese, and Saeed Abdullah. How real are synthetic therapy conversations? evaluating fidelity in prolonged exposure dialogues.Findings of the Association for Computational Linguistics: EMNLP, 2025:20986–20995, 2025
work page 2025
-
[6]
Suhas BN, Andrew M. Sherrill, Rosa I. Arriaga, Christopher Wiese, and Saeed Abdullah. Thousand voices of trauma: A large-scale synthetic dataset for modeling prolonged exposure therapy conversations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=qrFvHgZa7l
work page 2025
- [7]
-
[8]
it’s not only attention we need
Andreas Bucher, Sarah Egger, Inna Vashkite, Wenyuan Wu, and Gerhard Schwabe. “it’s not only attention we need”: Systematic review of large language models in mental health care.JMIR mental health, 12:e78410, 2025
work page 2025
-
[9]
Iemocap: Interactive emotional dyadic motion capture database
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008
work page 2008
-
[10]
Andrew C Butler, Jason E Chapman, Evan M Forman, and Aaron T Beck. The empirical status of cognitive- behavioral therapy: A review of meta-analyses.Clinical psychology review, 26(1):17–31, 2006
work page 2006
-
[11]
Michael Cannizzaro, Brian Harel, Nicole Reilly, Phillip Chappell, and Peter J Snyder. V oice acoustical measure- ment of the severity of major depression.Brain and cognition, 56(1):30–35, 2004. 9
work page 2004
-
[12]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024
work page 2024
-
[13]
John R Crawford and Julie D Henry. The positive and negative affect schedule (panas): Construct validity, measurement properties and normative data in a large non-clinical sample.British journal of clinical psychology, 43(3):245–265, 2004
work page 2004
-
[14]
Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Ultravox: A fast multimodal llm for real-time voice
AI Fixie. Ultravox: A fast multimodal llm for real-time voice. https://huggingface.co/fixie-ai/ ultravox-v0_5-llama-3_1-8b , 2024. Official project website: https://ultravox.ai. Model evaluated: ultravox-v0_5-llama-3_1-8b
work page 2024
-
[16]
Nonverbal communication in psychotherapy.Psychiatry (Edgmont), 7(6): 38, 2010
Gretchen N Foley and Julie P Gentile. Nonverbal communication in psychotherapy.Psychiatry (Edgmont), 7(6): 38, 2010
work page 2010
-
[17]
Russell Fulmer, Angela Joerin, Breanna Gentile, Lysanne Lakerink, and Michiel Rauws. Using psychological artificial intelligence (tess) to relieve symptoms of depression and anxiety: randomized controlled trial.JMIR mental health, 5(4):e9782, 2018
work page 2018
-
[18]
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Gemma 4 e4b instruct (gemma-4-e4b-it)
Google DeepMind. Gemma 4 e4b instruct (gemma-4-e4b-it). https://huggingface.co/google/ gemma-4-E4B-it , 2026. Official release announcement on Hugging Face: https://huggingface.co/blog/ gemma4. Model evaluated: gemma-4-E4B-it
work page 2026
-
[20]
Rita Gorawara-Bhat, Linda Hafskjold, Paul Gulbrandsen, and Hilde Eide. Exploring physicians’ verbal and nonverbal responses to cues/concerns: Learning from incongruent communication.Patient education and counseling, 100(11):1979–1989, 2017
work page 1979
-
[21]
The distress analysis interview corpus of human and computer interviews
Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. The distress analysis interview corpus of human and computer interviews. InLrec, volume 14, pages 3123–3128. Reykjavik, 2014
work page 2014
-
[22]
Nathan Hodson and Simon Williamson. Can large language models replace therapists? evaluating performance at simple cognitive behavioral therapy tasks.JMIR AI, 3(1):e52500, 2024
work page 2024
-
[23]
Stefan G Hofmann, Anu Asnaani, Imke JJ V onk, Alice T Sawyer, and Angela Fang. The efficacy of cognitive behavioral therapy: A review of meta-analyses.Cognitive therapy and research, 36(5):427–440, 2012
work page 2012
-
[24]
Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025
work page 2025
-
[25]
Speech emotion recognition in mental health: Systematic review of voice-based applications
Eric Jordan, Raphaël Terrisse, Valeria Lucarini, Motasem Alrahabi, Marie-Odile Krebs, Julien Desclés, and Christophe Lemey. Speech emotion recognition in mental health: Systematic review of voice-based applications. JMIR mental health, 12(1):e74260, 2025
work page 2025
-
[26]
Mitchel Kappen, Gert Vanhollebeke, Jonas Van Der Donckt, Sofie Van Hoecke, and Marie-Anne Vanderhasselt. Acoustic and prosodic speech features reflect physiological stress but not isolated negative affect: a multi-paradigm study on psychosocial stressors.Scientific Reports, 14(1):5515, 2024
work page 2024
-
[27]
Cactus: Towards psychological counseling conversations using cognitive behavioral theory
Suyeon Lee, Sunghwan Mac Kim, Minju Kim, Dongjin Kang, Dongil Yang, Harim Kim, Minseok Kang, Dayi Jung, Min Hee Kim, Seungbeen Lee, et al. Cactus: Towards psychological counseling conversations using cognitive behavioral theory. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14245–14274, 2024
work page 2024
-
[28]
V oxtral.arXiv preprint arXiv:2507.13264, 2025
Alexander H Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, et al. V oxtral.arXiv preprint arXiv:2507.13264, 2025
-
[29]
Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5): e0196391, 2018
work page 2018
-
[30]
Benjamin F Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C Luhmann, Robbie Dow, Kli Pappas, and Thomas V Wiecki. Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025. 10
-
[31]
Nick Obradovich, Sahib S Khalsa, Waqas U Khan, Jina Suh, Roy H Perlis, Olusola Ajilore, and Martin P Paulus. Opportunities and risks of large language models in psychiatry.NPP—Digital Psychiatry and Neuroscience, 2(1): 8, 2024
work page 2024
-
[32]
Pierre Philippot, Robert S Feldman, and Erik J Coats.Nonverbal behavior in clinical settings. Oxford University Press, 2003
work page 2003
-
[33]
Powerset multi-class cross entropy loss for neural speaker diarization
Alexis Plaquet and Hervé Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. In24th Interspeech Conference (INTERSPEECH 2023), pages 3222–3226. ISCA, 2023
work page 2023
-
[34]
Meld: A multimodal multi-party dataset for emotion recognition in conversations
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019
work page 2019
-
[35]
Huachuan Qiu, Hongliang He, Shuai Zhang, Anqi Li, and Zhenzhong Lan. Smile: Single-turn to multi-turn inclu- sive language expansion via chatgpt for mental health support. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 615–636, 2024
work page 2024
-
[36]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023
work page 2023
-
[37]
Diego Rocco, Massimiliano Pastore, Alessandro Gennaro, Sergio Salvatore, Mauro Cozzolino, and Maristella Scorza. Beyond verbal behavior: An empirical analysis of speech rates in psychotherapy sessions.Frontiers in psychology, 9:978, 2018
work page 2018
-
[38]
Benedetto Saraceno, Mark van Ommeren, Rajaie Batniji, Alex Cohen, Oye Gureje, John Mahoney, Devi Sridhar, and Chris Underhill. Barriers to improvement of mental health services in low-income and middle-income countries.The Lancet, 370(9593):1164–1174, 2007
work page 2007
-
[39]
Hao Shen, Zihan Li, Minqiang Yang, Minghui Ni, Yongfeng Tao, Zhengyang Yu, Weihao Zheng, Chen Xu, and Bin Hu. Are large language models possible to conduct cognitive behavioral therapy? In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3695–3700. IEEE, 2024
work page 2024
-
[40]
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Marco Tonti and Omar CG Gelo. Rate of speech and emotional-cognitive regulation in the psychotherapeutic process: a pilot study.Research in Psychotherapy: Psychopathology, Process and Outcome, 19(2), 2016
work page 2016
-
[42]
Rachel M V oss and Joe M Das. Mental status examination. InStatPearls [Internet]. StatPearls Publishing, 2024
work page 2024
-
[43]
Evdokia V oultsiou and Lefteris Moussiades. A systematic review of large language models in mental health: Opportunities, challenges, and future directions.Electronics, 15(3):524, 2026
work page 2026
-
[44]
Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, and Honglei Guo. Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions.arXiv preprint arXiv:2508.20764, 2025
-
[45]
David Watson, Lee Anna Clark, and Auke Tellegen. Development and validation of brief measures of positive and negative affect: the panas scales.Journal of personality and social psychology, 54(6):1063, 1988
work page 1988
-
[46]
Henny A Westra, Adi Aviram, Laura Connors, Angela Kertes, and Mariyam Ahmed. Therapist emotional reactions and client resistance in cognitive behavioral therapy.Psychotherapy, 49(2):163, 2012
work page 2012
-
[47]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URL https://arxiv.org/abs/2503.20215
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy
Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C Chiu, Shaun M Eack, Fei Fang, William Yang Wang, and Zhiyu Chen. Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tec...
work page 2025
-
[51]
Yougen Zhou, Ningning Zhou, Qin Chen, Jie Zhou, Aimin Zhou, and Liang He. Diacbt: A long-periodic dialogue corpus guided by cognitive conceptualization diagram for cbt-based psychological counseling.arXiv preprint arXiv:2509.02999, 2025. 11 A Source Recordings Table 3 lists the 96 publicly available CBT educational recordings used to construct CBT-Audio. ...
-
[52]
TEXT context (previous conversation turns) - to understand the topic being discussed
-
[53]
The patient feels anxious and stressed, struggling to maintain composure
AUDIO clip (patient's current turn) - to assess their emotional state Your task: Based on the AUDIO, describe how the patient FEELS emotionally. Use the audio (tone, pace, voice quality) as evidence, but describe their emotional state, not just how they sound.,→ The context is provided only to help you understand the conversation topic. Examples of good d...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.