CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

Adam G. Dunn; Amit Saha; Anastasia Serafimovska; Anastasia Suraev; Jinman Kim; Ping-hsiu Lin; Qixuan Hu; Shuchang Ye; Sydney Su; Usman Naseem

arxiv: 2605.17370 · v2 · pith:JITKR2EInew · submitted 2026-05-17 · 💻 cs.AI

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

Qixuan Hu , Shuchang Ye , Xumou Zhang , Anastasia Serafimovska , Anastasia Suraev , Amit Saha , Ping-hsiu Lin , Sydney Su

show 3 more authors

Usman Naseem Adam G. Dunn Jinman Kim

This is my paper

Pith reviewed 2026-05-20 13:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords CBT-Audio datasetaudio language modelsdistress estimationcognitive behavioral therapyspoken sessionsmultimodal AIvocal cuespatient distress

0 comments

The pith

Adding audio to transcripts improves distress estimates from CBT sessions in most audio language models tested.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds CBT-Audio, a collection of real spoken cognitive behavioral therapy recordings, to test whether audio language models can judge patient distress more accurately than text-only models. It runs the same models on three input types: audio alone, transcript alone, and both together. Results show that the combined input beats the transcript alone for eight of the ten model families, with clear gains in four, and the largest lift occurs when a patient's tone contradicts their words. A sympathetic reader would care because therapists routinely use vocal cues to gauge distress and adapt their responses, yet most current AI systems for therapy work only from text and therefore miss this information.

Core claim

CBT-Audio supplies 1,802 patient turns drawn from 96 publicly available CBT session recordings together with turn-level distress intensity labels that were validated on an expert-annotated subset. Ten open-source audio language models were evaluated under three conditions. Supplying both audio and transcript improved distress estimation over transcript alone in eight of the ten model families, with statistically significant gains in four families. Case studies indicate the improvement is largest precisely when verbal content and vocal delivery diverge.

What carries the argument

The three-condition evaluation (audio only, transcript only, audio plus transcript) performed on the CBT-Audio dataset, which isolates the incremental value of vocal information for patient distress estimation.

If this is right

Audio language models can detect mismatches between what a patient says and how they say it, which text models miss by design.
Therapy-support AI can be built to use vocal cues when deciding how to respond or when to flag high-distress moments.
The public CBT-Audio dataset provides a shared benchmark for testing future audio models on mental-health interaction tasks.
Similar multimodal evaluations can be applied to other spoken clinical conversations where tone carries clinical meaning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time systems could listen to ongoing sessions and alert therapists to moments when vocal tone signals higher distress than the words alone suggest.
The same dataset and evaluation protocol could be extended to video to test whether facial expressions add still more signal beyond audio and text.

Load-bearing premise

The turn-level distress labels assigned to every patient turn accurately reflect the patient's true state of distress.

What would settle it

Re-annotate the full set of 1,802 turns with multiple experts and re-run the model comparisons; if the performance advantage of audio-plus-transcript disappears on the expert-only labels, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.17370 by Adam G. Dunn, Amit Saha, Anastasia Serafimovska, Anastasia Suraev, Jinman Kim, Ping-hsiu Lin, Qixuan Hu, Shuchang Ye, Sydney Su, Usman Naseem, Xumou Zhang.

**Figure 1.** Figure 1: Pipeline from publicly available CBT sessions to model evaluation: (1) Patient turn extraction using speaker [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Modality improvement landscape. The x-axis shows how much audio-only improves MAE over transcript [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Case studies showing how input condition affects distress estimation. Each case shows the preceding [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Label distribution across three independent GPT-audio-1.5 SSR runs and the final aggregated SSR label. The [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between SSR-based labels and direct numeric prompting. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CBT-Audio gives a new spoken dataset and basic audio-vs-text comparisons for distress in CBT, but the partial label validation is the weakest link and makes the reported gains hard to trust.

read the letter

The main takeaway is that this paper supplies a dataset of 1,802 patient turns from 96 public CBT recordings and runs a straightforward three-condition test on ten open audio language models. Adding audio to transcripts beats transcript-only input in eight of the ten cases, with significance in four, and the mismatch examples illustrate where vocal cues help. That is a useful step because spoken CBT data has been scarce for privacy reasons, and the work directly targets the gap between what patients say and how they say it. The authors deserve credit for releasing the dataset and for keeping the evaluation simple enough to compare audio-only, text-only, and combined inputs across model families. The case studies on verbal-vocal divergence are the clearest part of the story and show why audio might matter in real sessions. The soft spot is the label quality. The paper states that turn-level distress labels were validated only on an expert-annotated subset rather than the full set. If the unvalidated majority carries systematic noise or bias, then the apparent audio gains, especially in the mismatch cases, could be artifacts rather than real signal. The abstract also gives no details on the statistical tests, error bars, or exact train-test splits, so it is difficult to judge how stable the four significant improvements actually are. This work is aimed at researchers building multimodal models for mental-health interaction or CBT support tools. A reader who needs a starting point for audio distress benchmarks will find the dataset and the basic comparison useful even if the claims need tightening. I would send it to peer review because the dataset itself is new and the question is relevant; the authors can address the validation coverage and reporting gaps in revision without starting over.

Referee Report

2 major / 1 minor

Summary. The paper introduces the CBT-Audio dataset of 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress intensity labels validated on an experts-annotated subset. It evaluates 10 open-source audio language models under three input conditions (audio only, transcript only, or both) and reports that adding audio to transcript input improves distress estimation over transcript alone in 8 of 10 model families, with significant gains in 4, and clearest benefits in case studies where verbal content and vocal delivery diverge.

Significance. If the results hold under rigorous validation, this work provides a valuable new benchmark for multimodal audio language models in mental health, addressing the scarcity of spoken CBT data and highlighting the clinical relevance of vocal cues beyond text. The empirical focus on open-source models and the public dataset release are strengths that enable reproducible follow-up research.

major comments (2)

Dataset description (abstract and methods): The distress labels are validated only on an experts-annotated subset rather than the full 1,802 turns. This is load-bearing for the central claim because systematic noise or bias in the unvalidated majority could artifactually produce the reported audio gains, especially in the case studies where verbal content and vocal delivery diverge.
Results section (abstract claims): Performance gains are reported in 8 of 10 models with significance in 4, yet no details are given on the statistical tests performed, error bars, data splits, or full validation process. This undermines assessment of whether the audio+transcript improvements are reliable.

minor comments (1)

Abstract: Specify the size of the experts-annotated subset and the exact validation procedure to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in the manuscript. Below we provide point-by-point responses to the major comments and describe the revisions we plan to implement.

read point-by-point responses

Referee: Dataset description (abstract and methods): The distress labels are validated only on an experts-annotated subset rather than the full 1,802 turns. This is load-bearing for the central claim because systematic noise or bias in the unvalidated majority could artifactually produce the reported audio gains, especially in the case studies where verbal content and vocal delivery diverge.

Authors: We acknowledge the referee's concern regarding the validation of distress labels. The current manuscript indicates that labels were validated on an experts-annotated subset. To address this, we will revise the Methods section to provide a more detailed description of how the labels were obtained for the full 1,802 turns and how the subset was selected for expert validation. Furthermore, we will include additional experiments reporting model performance exclusively on the validated subset to demonstrate that the observed audio gains persist in this more rigorously labeled portion of the data. We will also add a discussion of potential label noise as a limitation. revision: yes
Referee: Results section (abstract claims): Performance gains are reported in 8 of 10 models with significance in 4, yet no details are given on the statistical tests performed, error bars, data splits, or full validation process. This undermines assessment of whether the audio+transcript improvements are reliable.

Authors: We agree that additional details on the statistical analysis are necessary for a complete assessment. In the revised version, we will add comprehensive information on the statistical tests performed to determine significance, include error bars in all relevant figures and tables, specify the data splitting strategy used for evaluation, and elaborate on the full validation process. These additions will allow readers to better evaluate the reliability of the reported improvements when adding audio input. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset introduction and model evaluation

full rationale

The paper presents CBT-Audio as a new dataset of 1,802 patient turns from CBT recordings, with turn-level distress labels, and reports direct empirical comparisons of 10 audio language models under audio-only, transcript-only, and combined inputs. No derivation chain, equations, fitted parameters, or predictions are claimed. Results (audio+transcript improves over transcript in 8/10 families) are measured against the introduced labels without reducing to prior self-referential quantities or self-citation load-bearing steps. The work is self-contained against external benchmarks (open-source models) and introduces new data, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical dataset and evaluation paper with no mathematical derivations or new physical entities; relies on standard assumptions about label quality and data representativeness.

axioms (1)

domain assumption Expert-annotated subset provides reliable validation for turn-level distress labels across the full dataset
Paper depends on this for claiming label quality without full expert review.

pith-pipeline@v0.9.0 · 5849 in / 1230 out tokens · 69697 ms · 2026-05-20T13:19:24.982987+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 7 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

Oren Asman, John Torous, and Amir Tal. Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

work page 2025
[3]

Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023

work page 2023
[4]

Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

Suha Ballout. Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

work page 2025
[5]

Suhas Bn, Dominik Mattioli, Andrew M Sherrill, Rosa I Arriaga, Christopher Wiese, and Saeed Abdullah. How real are synthetic therapy conversations? evaluating fidelity in prolonged exposure dialogues.Findings of the Association for Computational Linguistics: EMNLP, 2025:20986–20995, 2025

work page 2025
[6]

Sherrill, Rosa I

Suhas BN, Andrew M. Sherrill, Rosa I. Arriaga, Christopher Wiese, and Saeed Abdullah. Thousand voices of trauma: A large-scale synthetic dataset for modeling prolonged exposure therapy conversations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=qrFvHgZa7l

work page 2025
[7]

pyannote

Hervé Bredin. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In24th Interspeech Conference (INTERSPEECH 2023), pages 1983–1987. ISCA, 2023

work page 2023
[8]

it’s not only attention we need

Andreas Bucher, Sarah Egger, Inna Vashkite, Wenyuan Wu, and Gerhard Schwabe. “it’s not only attention we need”: Systematic review of large language models in mental health care.JMIR mental health, 12:e78410, 2025

work page 2025
[9]

Iemocap: Interactive emotional dyadic motion capture database

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008

work page 2008
[10]

The empirical status of cognitive- behavioral therapy: A review of meta-analyses.Clinical psychology review, 26(1):17–31, 2006

Andrew C Butler, Jason E Chapman, Evan M Forman, and Aaron T Beck. The empirical status of cognitive- behavioral therapy: A review of meta-analyses.Clinical psychology review, 26(1):17–31, 2006

work page 2006
[11]

V oice acoustical measure- ment of the severity of major depression.Brain and cognition, 56(1):30–35, 2004

Michael Cannizzaro, Brian Harel, Nicole Reilly, Phillip Chappell, and Peter J Snyder. V oice acoustical measure- ment of the severity of major depression.Brain and cognition, 56(1):30–35, 2004. 9

work page 2004
[12]

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

work page 2024
[13]

John R Crawford and Julie D Henry. The positive and negative affect schedule (panas): Construct validity, measurement properties and normative data in a large non-clinical sample.British journal of clinical psychology, 43(3):245–265, 2004

work page 2004
[14]

Kimi-Audio Technical Report

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Ultravox: A fast multimodal llm for real-time voice

AI Fixie. Ultravox: A fast multimodal llm for real-time voice. https://huggingface.co/fixie-ai/ ultravox-v0_5-llama-3_1-8b , 2024. Official project website: https://ultravox.ai. Model evaluated: ultravox-v0_5-llama-3_1-8b

work page 2024
[16]

Nonverbal communication in psychotherapy.Psychiatry (Edgmont), 7(6): 38, 2010

Gretchen N Foley and Julie P Gentile. Nonverbal communication in psychotherapy.Psychiatry (Edgmont), 7(6): 38, 2010

work page 2010
[17]

Using psychological artificial intelligence (tess) to relieve symptoms of depression and anxiety: randomized controlled trial.JMIR mental health, 5(4):e9782, 2018

Russell Fulmer, Angela Joerin, Breanna Gentile, Lysanne Lakerink, and Michiel Rauws. Using psychological artificial intelligence (tess) to relieve symptoms of depression and anxiety: randomized controlled trial.JMIR mental health, 5(4):e9782, 2018

work page 2018
[18]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Gemma 4 e4b instruct (gemma-4-e4b-it)

Google DeepMind. Gemma 4 e4b instruct (gemma-4-e4b-it). https://huggingface.co/google/ gemma-4-E4B-it , 2026. Official release announcement on Hugging Face: https://huggingface.co/blog/ gemma4. Model evaluated: gemma-4-E4B-it

work page 2026
[20]

Exploring physicians’ verbal and nonverbal responses to cues/concerns: Learning from incongruent communication.Patient education and counseling, 100(11):1979–1989, 2017

Rita Gorawara-Bhat, Linda Hafskjold, Paul Gulbrandsen, and Hilde Eide. Exploring physicians’ verbal and nonverbal responses to cues/concerns: Learning from incongruent communication.Patient education and counseling, 100(11):1979–1989, 2017

work page 1979
[21]

The distress analysis interview corpus of human and computer interviews

Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. The distress analysis interview corpus of human and computer interviews. InLrec, volume 14, pages 3123–3128. Reykjavik, 2014

work page 2014
[22]

Can large language models replace therapists? evaluating performance at simple cognitive behavioral therapy tasks.JMIR AI, 3(1):e52500, 2024

Nathan Hodson and Simon Williamson. Can large language models replace therapists? evaluating performance at simple cognitive behavioral therapy tasks.JMIR AI, 3(1):e52500, 2024

work page 2024
[23]

The efficacy of cognitive behavioral therapy: A review of meta-analyses.Cognitive therapy and research, 36(5):427–440, 2012

Stefan G Hofmann, Anu Asnaani, Imke JJ V onk, Alice T Sawyer, and Angela Fang. The efficacy of cognitive behavioral therapy: A review of meta-analyses.Cognitive therapy and research, 36(5):427–440, 2012

work page 2012
[24]

A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

work page 2025
[25]

Speech emotion recognition in mental health: Systematic review of voice-based applications

Eric Jordan, Raphaël Terrisse, Valeria Lucarini, Motasem Alrahabi, Marie-Odile Krebs, Julien Desclés, and Christophe Lemey. Speech emotion recognition in mental health: Systematic review of voice-based applications. JMIR mental health, 12(1):e74260, 2025

work page 2025
[26]

Mitchel Kappen, Gert Vanhollebeke, Jonas Van Der Donckt, Sofie Van Hoecke, and Marie-Anne Vanderhasselt. Acoustic and prosodic speech features reflect physiological stress but not isolated negative affect: a multi-paradigm study on psychosocial stressors.Scientific Reports, 14(1):5515, 2024

work page 2024
[27]

Cactus: Towards psychological counseling conversations using cognitive behavioral theory

Suyeon Lee, Sunghwan Mac Kim, Minju Kim, Dongjin Kang, Dongil Yang, Harim Kim, Minseok Kang, Dayi Jung, Min Hee Kim, Seungbeen Lee, et al. Cactus: Towards psychological counseling conversations using cognitive behavioral theory. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14245–14274, 2024

work page 2024
[28]

V oxtral.arXiv preprint arXiv:2507.13264, 2025

Alexander H Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, et al. V oxtral.arXiv preprint arXiv:2507.13264, 2025

work page arXiv 2025
[29]

Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5): e0196391, 2018

work page 2018
[30]

Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025

Benjamin F Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C Luhmann, Robbie Dow, Kli Pappas, and Thomas V Wiecki. Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025. 10

work page arXiv 2025
[31]

Opportunities and risks of large language models in psychiatry.NPP—Digital Psychiatry and Neuroscience, 2(1): 8, 2024

Nick Obradovich, Sahib S Khalsa, Waqas U Khan, Jina Suh, Roy H Perlis, Olusola Ajilore, and Martin P Paulus. Opportunities and risks of large language models in psychiatry.NPP—Digital Psychiatry and Neuroscience, 2(1): 8, 2024

work page 2024
[32]

Oxford University Press, 2003

Pierre Philippot, Robert S Feldman, and Erik J Coats.Nonverbal behavior in clinical settings. Oxford University Press, 2003

work page 2003
[33]

Powerset multi-class cross entropy loss for neural speaker diarization

Alexis Plaquet and Hervé Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. In24th Interspeech Conference (INTERSPEECH 2023), pages 3222–3226. ISCA, 2023

work page 2023
[34]

Meld: A multimodal multi-party dataset for emotion recognition in conversations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

work page 2019
[35]

Smile: Single-turn to multi-turn inclu- sive language expansion via chatgpt for mental health support

Huachuan Qiu, Hongliang He, Shuai Zhang, Anqi Li, and Zhenzhong Lan. Smile: Single-turn to multi-turn inclu- sive language expansion via chatgpt for mental health support. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 615–636, 2024

work page 2024
[36]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

work page 2023
[37]

Beyond verbal behavior: An empirical analysis of speech rates in psychotherapy sessions.Frontiers in psychology, 9:978, 2018

Diego Rocco, Massimiliano Pastore, Alessandro Gennaro, Sergio Salvatore, Mauro Cozzolino, and Maristella Scorza. Beyond verbal behavior: An empirical analysis of speech rates in psychotherapy sessions.Frontiers in psychology, 9:978, 2018

work page 2018
[38]

Barriers to improvement of mental health services in low-income and middle-income countries.The Lancet, 370(9593):1164–1174, 2007

Benedetto Saraceno, Mark van Ommeren, Rajaie Batniji, Alex Cohen, Oye Gureje, John Mahoney, Devi Sridhar, and Chris Underhill. Barriers to improvement of mental health services in low-income and middle-income countries.The Lancet, 370(9593):1164–1174, 2007

work page 2007
[39]

Are large language models possible to conduct cognitive behavioral therapy? In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3695–3700

Hao Shen, Zihan Li, Minqiang Yang, Minghui Ni, Yongfeng Tao, Zhengyang Yu, Weihao Zheng, Chen Xu, and Bin Hu. Are large language models possible to conduct cognitive behavioral therapy? In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3695–3700. IEEE, 2024

work page 2024
[40]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Rate of speech and emotional-cognitive regulation in the psychotherapeutic process: a pilot study.Research in Psychotherapy: Psychopathology, Process and Outcome, 19(2), 2016

Marco Tonti and Omar CG Gelo. Rate of speech and emotional-cognitive regulation in the psychotherapeutic process: a pilot study.Research in Psychotherapy: Psychopathology, Process and Outcome, 19(2), 2016

work page 2016
[42]

Mental status examination

Rachel M V oss and Joe M Das. Mental status examination. InStatPearls [Internet]. StatPearls Publishing, 2024

work page 2024
[43]

A systematic review of large language models in mental health: Opportunities, challenges, and future directions.Electronics, 15(3):524, 2026

Evdokia V oultsiou and Lefteris Moussiades. A systematic review of large language models in mental health: Opportunities, challenges, and future directions.Electronics, 15(3):524, 2026

work page 2026
[44]

Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions.arXiv preprint arXiv:2508.20764, 2025

Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, and Honglei Guo. Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions.arXiv preprint arXiv:2508.20764, 2025

work page arXiv 2025
[45]

Development and validation of brief measures of positive and negative affect: the panas scales.Journal of personality and social psychology, 54(6):1063, 1988

David Watson, Lee Anna Clark, and Auke Tellegen. Development and validation of brief measures of positive and negative affect: the panas scales.Journal of personality and social psychology, 54(6):1063, 1988

work page 1988
[46]

Therapist emotional reactions and client resistance in cognitive behavioral therapy.Psychotherapy, 49(2):163, 2012

Henny A Westra, Adi Aviram, Laura Connors, Angela Kertes, and Mariyam Ahmed. Therapist emotional reactions and client resistance in cognitive behavioral therapy.Psychotherapy, 49(2):163, 2012

work page 2012
[47]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URL https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy

Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C Chiu, Shaun M Eack, Fei Fang, William Yang Wang, and Zhiyu Chen. Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tec...

work page 2025
[51]

Diacbt: A long-periodic dialogue corpus guided by cognitive conceptualization diagram for cbt-based psychological counseling.arXiv preprint arXiv:2509.02999, 2025

Yougen Zhou, Ningning Zhou, Qin Chen, Jie Zhou, Aimin Zhou, and Liang He. Diacbt: A long-periodic dialogue corpus guided by cognitive conceptualization diagram for cbt-based psychological counseling.arXiv preprint arXiv:2509.02999, 2025. 11 A Source Recordings Table 3 lists the 96 publicly available CBT educational recordings used to construct CBT-Audio. ...

work page arXiv 2025
[52]

TEXT context (previous conversation turns) - to understand the topic being discussed

work page
[53]

The patient feels anxious and stressed, struggling to maintain composure

AUDIO clip (patient's current turn) - to assess their emotional state Your task: Based on the AUDIO, describe how the patient FEELS emotionally. Use the audio (tone, pace, voice quality) as evidence, but describe their emotional state, not just how they sound.,→ The context is provided only to help you understand the conversation topic. Examples of good d...

work page

[1] [1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

Oren Asman, John Torous, and Amir Tal. Responsible design, integration, and use of generative ai in mental health.JMIR Mental Health, 12(1):e70439, 2025

work page 2025

[3] [3]

Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio.INTERSPEECH 2023, 2023

work page 2023

[4] [4]

Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

Suha Ballout. Trauma, mental health workforce shortages, and health equity: A crisis in public health.International Journal of Environmental Research and Public Health, 22(4):620, 2025

work page 2025

[5] [5]

Suhas Bn, Dominik Mattioli, Andrew M Sherrill, Rosa I Arriaga, Christopher Wiese, and Saeed Abdullah. How real are synthetic therapy conversations? evaluating fidelity in prolonged exposure dialogues.Findings of the Association for Computational Linguistics: EMNLP, 2025:20986–20995, 2025

work page 2025

[6] [6]

Sherrill, Rosa I

Suhas BN, Andrew M. Sherrill, Rosa I. Arriaga, Christopher Wiese, and Saeed Abdullah. Thousand voices of trauma: A large-scale synthetic dataset for modeling prolonged exposure therapy conversations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=qrFvHgZa7l

work page 2025

[7] [7]

pyannote

Hervé Bredin. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In24th Interspeech Conference (INTERSPEECH 2023), pages 1983–1987. ISCA, 2023

work page 2023

[8] [8]

it’s not only attention we need

Andreas Bucher, Sarah Egger, Inna Vashkite, Wenyuan Wu, and Gerhard Schwabe. “it’s not only attention we need”: Systematic review of large language models in mental health care.JMIR mental health, 12:e78410, 2025

work page 2025

[9] [9]

Iemocap: Interactive emotional dyadic motion capture database

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008

work page 2008

[10] [10]

The empirical status of cognitive- behavioral therapy: A review of meta-analyses.Clinical psychology review, 26(1):17–31, 2006

Andrew C Butler, Jason E Chapman, Evan M Forman, and Aaron T Beck. The empirical status of cognitive- behavioral therapy: A review of meta-analyses.Clinical psychology review, 26(1):17–31, 2006

work page 2006

[11] [11]

V oice acoustical measure- ment of the severity of major depression.Brain and cognition, 56(1):30–35, 2004

Michael Cannizzaro, Brian Harel, Nicole Reilly, Phillip Chappell, and Peter J Snyder. V oice acoustical measure- ment of the severity of major depression.Brain and cognition, 56(1):30–35, 2004. 9

work page 2004

[12] [12]

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

work page 2024

[13] [13]

John R Crawford and Julie D Henry. The positive and negative affect schedule (panas): Construct validity, measurement properties and normative data in a large non-clinical sample.British journal of clinical psychology, 43(3):245–265, 2004

work page 2004

[14] [14]

Kimi-Audio Technical Report

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Ultravox: A fast multimodal llm for real-time voice

AI Fixie. Ultravox: A fast multimodal llm for real-time voice. https://huggingface.co/fixie-ai/ ultravox-v0_5-llama-3_1-8b , 2024. Official project website: https://ultravox.ai. Model evaluated: ultravox-v0_5-llama-3_1-8b

work page 2024

[16] [16]

Nonverbal communication in psychotherapy.Psychiatry (Edgmont), 7(6): 38, 2010

Gretchen N Foley and Julie P Gentile. Nonverbal communication in psychotherapy.Psychiatry (Edgmont), 7(6): 38, 2010

work page 2010

[17] [17]

Using psychological artificial intelligence (tess) to relieve symptoms of depression and anxiety: randomized controlled trial.JMIR mental health, 5(4):e9782, 2018

Russell Fulmer, Angela Joerin, Breanna Gentile, Lysanne Lakerink, and Michiel Rauws. Using psychological artificial intelligence (tess) to relieve symptoms of depression and anxiety: randomized controlled trial.JMIR mental health, 5(4):e9782, 2018

work page 2018

[18] [18]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Gemma 4 e4b instruct (gemma-4-e4b-it)

Google DeepMind. Gemma 4 e4b instruct (gemma-4-e4b-it). https://huggingface.co/google/ gemma-4-E4B-it , 2026. Official release announcement on Hugging Face: https://huggingface.co/blog/ gemma4. Model evaluated: gemma-4-E4B-it

work page 2026

[20] [20]

Exploring physicians’ verbal and nonverbal responses to cues/concerns: Learning from incongruent communication.Patient education and counseling, 100(11):1979–1989, 2017

Rita Gorawara-Bhat, Linda Hafskjold, Paul Gulbrandsen, and Hilde Eide. Exploring physicians’ verbal and nonverbal responses to cues/concerns: Learning from incongruent communication.Patient education and counseling, 100(11):1979–1989, 2017

work page 1979

[21] [21]

The distress analysis interview corpus of human and computer interviews

Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. The distress analysis interview corpus of human and computer interviews. InLrec, volume 14, pages 3123–3128. Reykjavik, 2014

work page 2014

[22] [22]

Can large language models replace therapists? evaluating performance at simple cognitive behavioral therapy tasks.JMIR AI, 3(1):e52500, 2024

Nathan Hodson and Simon Williamson. Can large language models replace therapists? evaluating performance at simple cognitive behavioral therapy tasks.JMIR AI, 3(1):e52500, 2024

work page 2024

[23] [23]

The efficacy of cognitive behavioral therapy: A review of meta-analyses.Cognitive therapy and research, 36(5):427–440, 2012

Stefan G Hofmann, Anu Asnaani, Imke JJ V onk, Alice T Sawyer, and Angela Fang. The efficacy of cognitive behavioral therapy: A review of meta-analyses.Cognitive therapy and research, 36(5):427–440, 2012

work page 2012

[24] [24]

A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

work page 2025

[25] [25]

Speech emotion recognition in mental health: Systematic review of voice-based applications

Eric Jordan, Raphaël Terrisse, Valeria Lucarini, Motasem Alrahabi, Marie-Odile Krebs, Julien Desclés, and Christophe Lemey. Speech emotion recognition in mental health: Systematic review of voice-based applications. JMIR mental health, 12(1):e74260, 2025

work page 2025

[26] [26]

Mitchel Kappen, Gert Vanhollebeke, Jonas Van Der Donckt, Sofie Van Hoecke, and Marie-Anne Vanderhasselt. Acoustic and prosodic speech features reflect physiological stress but not isolated negative affect: a multi-paradigm study on psychosocial stressors.Scientific Reports, 14(1):5515, 2024

work page 2024

[27] [27]

Cactus: Towards psychological counseling conversations using cognitive behavioral theory

Suyeon Lee, Sunghwan Mac Kim, Minju Kim, Dongjin Kang, Dongil Yang, Harim Kim, Minseok Kang, Dayi Jung, Min Hee Kim, Seungbeen Lee, et al. Cactus: Towards psychological counseling conversations using cognitive behavioral theory. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14245–14274, 2024

work page 2024

[28] [28]

V oxtral.arXiv preprint arXiv:2507.13264, 2025

Alexander H Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, et al. V oxtral.arXiv preprint arXiv:2507.13264, 2025

work page arXiv 2025

[29] [29]

Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5): e0196391, 2018

work page 2018

[30] [30]

Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025

Benjamin F Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C Luhmann, Robbie Dow, Kli Pappas, and Thomas V Wiecki. Llms reproduce human purchase intent via semantic similarity elicitation of likert ratings.arXiv preprint arXiv:2510.08338, 2025. 10

work page arXiv 2025

[31] [31]

Opportunities and risks of large language models in psychiatry.NPP—Digital Psychiatry and Neuroscience, 2(1): 8, 2024

Nick Obradovich, Sahib S Khalsa, Waqas U Khan, Jina Suh, Roy H Perlis, Olusola Ajilore, and Martin P Paulus. Opportunities and risks of large language models in psychiatry.NPP—Digital Psychiatry and Neuroscience, 2(1): 8, 2024

work page 2024

[32] [32]

Oxford University Press, 2003

Pierre Philippot, Robert S Feldman, and Erik J Coats.Nonverbal behavior in clinical settings. Oxford University Press, 2003

work page 2003

[33] [33]

Powerset multi-class cross entropy loss for neural speaker diarization

Alexis Plaquet and Hervé Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. In24th Interspeech Conference (INTERSPEECH 2023), pages 3222–3226. ISCA, 2023

work page 2023

[34] [34]

Meld: A multimodal multi-party dataset for emotion recognition in conversations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

work page 2019

[35] [35]

Smile: Single-turn to multi-turn inclu- sive language expansion via chatgpt for mental health support

Huachuan Qiu, Hongliang He, Shuai Zhang, Anqi Li, and Zhenzhong Lan. Smile: Single-turn to multi-turn inclu- sive language expansion via chatgpt for mental health support. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 615–636, 2024

work page 2024

[36] [36]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

work page 2023

[37] [37]

Beyond verbal behavior: An empirical analysis of speech rates in psychotherapy sessions.Frontiers in psychology, 9:978, 2018

Diego Rocco, Massimiliano Pastore, Alessandro Gennaro, Sergio Salvatore, Mauro Cozzolino, and Maristella Scorza. Beyond verbal behavior: An empirical analysis of speech rates in psychotherapy sessions.Frontiers in psychology, 9:978, 2018

work page 2018

[38] [38]

Barriers to improvement of mental health services in low-income and middle-income countries.The Lancet, 370(9593):1164–1174, 2007

Benedetto Saraceno, Mark van Ommeren, Rajaie Batniji, Alex Cohen, Oye Gureje, John Mahoney, Devi Sridhar, and Chris Underhill. Barriers to improvement of mental health services in low-income and middle-income countries.The Lancet, 370(9593):1164–1174, 2007

work page 2007

[39] [39]

Are large language models possible to conduct cognitive behavioral therapy? In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3695–3700

Hao Shen, Zihan Li, Minqiang Yang, Minghui Ni, Yongfeng Tao, Zhengyang Yu, Weihao Zheng, Chen Xu, and Bin Hu. Are large language models possible to conduct cognitive behavioral therapy? In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3695–3700. IEEE, 2024

work page 2024

[40] [40]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Rate of speech and emotional-cognitive regulation in the psychotherapeutic process: a pilot study.Research in Psychotherapy: Psychopathology, Process and Outcome, 19(2), 2016

Marco Tonti and Omar CG Gelo. Rate of speech and emotional-cognitive regulation in the psychotherapeutic process: a pilot study.Research in Psychotherapy: Psychopathology, Process and Outcome, 19(2), 2016

work page 2016

[42] [42]

Mental status examination

Rachel M V oss and Joe M Das. Mental status examination. InStatPearls [Internet]. StatPearls Publishing, 2024

work page 2024

[43] [43]

A systematic review of large language models in mental health: Opportunities, challenges, and future directions.Electronics, 15(3):524, 2026

Evdokia V oultsiou and Lefteris Moussiades. A systematic review of large language models in mental health: Opportunities, challenges, and future directions.Electronics, 15(3):524, 2026

work page 2026

[44] [44]

Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions.arXiv preprint arXiv:2508.20764, 2025

Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, and Honglei Guo. Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions.arXiv preprint arXiv:2508.20764, 2025

work page arXiv 2025

[45] [45]

Development and validation of brief measures of positive and negative affect: the panas scales.Journal of personality and social psychology, 54(6):1063, 1988

David Watson, Lee Anna Clark, and Auke Tellegen. Development and validation of brief measures of positive and negative affect: the panas scales.Journal of personality and social psychology, 54(6):1063, 1988

work page 1988

[46] [46]

Therapist emotional reactions and client resistance in cognitive behavioral therapy.Psychotherapy, 49(2):163, 2012

Henny A Westra, Adi Aviram, Laura Connors, Angela Kertes, and Mariyam Ahmed. Therapist emotional reactions and client resistance in cognitive behavioral therapy.Psychotherapy, 49(2):163, 2012

work page 2012

[47] [47]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URL https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy

Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C Chiu, Shaun M Eack, Fei Fang, William Yang Wang, and Zhiyu Chen. Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tec...

work page 2025

[51] [51]

Diacbt: A long-periodic dialogue corpus guided by cognitive conceptualization diagram for cbt-based psychological counseling.arXiv preprint arXiv:2509.02999, 2025

Yougen Zhou, Ningning Zhou, Qin Chen, Jie Zhou, Aimin Zhou, and Liang He. Diacbt: A long-periodic dialogue corpus guided by cognitive conceptualization diagram for cbt-based psychological counseling.arXiv preprint arXiv:2509.02999, 2025. 11 A Source Recordings Table 3 lists the 96 publicly available CBT educational recordings used to construct CBT-Audio. ...

work page arXiv 2025

[52] [52]

TEXT context (previous conversation turns) - to understand the topic being discussed

work page

[53] [53]

The patient feels anxious and stressed, struggling to maintain composure

AUDIO clip (patient's current turn) - to assess their emotional state Your task: Based on the AUDIO, describe how the patient FEELS emotionally. Use the audio (tone, pace, voice quality) as evidence, but describe their emotional state, not just how they sound.,→ The context is provided only to help you understand the conversation topic. Examples of good d...

work page