pith. sign in

arxiv: 2605.22286 · v1 · pith:XAF777QMnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes

Pith reviewed 2026-05-22 07:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords depression trackingcounseling transcriptsPHQ-8 predictionLLM clinical signalslongitudinal mental healthsemantic embeddingssession regimes
0
0 comments X

The pith

EmoTrack predicts PHQ-8 depression scores from counseling transcripts by combining LLM clinical signals with frozen semantic embeddings for use in both single and repeated sessions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops EmoTrack to predict depression severity measured by PHQ-8 from text transcripts of counseling conversations. It targets the practical problem of making accurate predictions whether only one session is available or multiple sessions provide ongoing context. The method pulls clinical signals from a large language model, pairs them with fixed turn-level semantic embeddings, and trains separate predictors for individual symptoms on the combined representation. When earlier sessions exist, a compact memory mechanism folds in that history without rebuilding the full model. The authors also release LongCounsel, a new multi-session dataset, to test tracking when symptom information is only partially revealed across visits.

Core claim

EmoTrack predicts PHQ-8 scores by extracting clinical signals via LLM and combining them with frozen turn-level semantic embeddings to form a transcript representation, then training symptom-specific predictors on it; prior sessions can be incorporated through compact cross-session memory when available. This yields a 13.5 percent relative reduction in mean absolute error on the DAIC-WOZ single-session benchmark while remaining competitive with the strongest longitudinal baseline on the new LongCounsel dataset.

What carries the argument

LLM-extracted clinical signals combined with frozen turn-level semantic embeddings to build a transcript representation, followed by symptom-specific predictors and optional compact cross-session memory for longitudinal use.

If this is right

  • Higher single-session accuracy would let automated systems flag high-severity cases for immediate human review in text-based counseling.
  • Frozen embeddings and pre-extracted signals reduce the need for large labeled datasets and limit overfitting when data is scarce.
  • Compact cross-session memory supports tracking symptom changes over time without full model retraining on each new visit.
  • The LongCounsel dataset provides a benchmark for testing repeated-session prediction under partial symptom disclosure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same signal-plus-embedding structure could be adapted to predict other clinical scales if matching LLM prompts are written for them.
  • Integration into live counseling platforms could help route therapist attention toward sessions with rising risk scores.
  • Direct tests on transcripts from varied cultural or linguistic settings would be required to confirm that the LLM signals stay consistent outside the original training domains.

Load-bearing premise

The clinical signals pulled from the LLM remain reliable and unbiased no matter the counseling style, language, or amount of symptom disclosure in the transcripts.

What would settle it

A clear drop in accuracy on transcripts from a different language or counseling style, where the LLM signals turn out inconsistent or biased, would show the method does not deliver the claimed robustness.

Figures

Figures reproduced from arXiv: 2605.22286 by Bingsheng He, Jiayi Li, Zhaomin Wu.

Figure 1
Figure 1. Figure 1: Overview of EmoTrack model. Given the client-side current-session transcript, the model [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cross-benchmark comparison and ablations. The left panel summarizes cross-benchmark [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MAE of EmoTrack against baselines on LONGCOUNSEL-8. the full model, while the AIDA-only variant performs substantially worse, indicating that turn-level semantic representations are especially important for capturing symptom expressions in multi-session counseling dialogue. In contrast, DAIC-WOZ is more sensitive to the removal of AIDA features: the embedding-only variant exhibits a larger performance drop… view at source ↗
Figure 4
Figure 4. Figure 4: Runtime-matched self-report fidelity audit for behavior-cue generation. Pruning improves [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RealCBT similarity audit for generated conversations. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

Text-based counseling is an important interface for AI mental-health support, where transcripts may be used to monitor depression severity and flag sessions requiring timely human review. However, robust PHQ-8 prediction across session regimes remains challenging: fine-tuning-based methods can exploit richer supervision but may generalize poorly under data scarcity, while prompt-based LLM methods are data-efficient but usually treat each transcript holistically and provide limited support for longitudinal context. We study robust depression tracking from counseling transcripts across single-session and multi-session regimes. We introduce LongCounsel, a multi-session counseling dataset with session-level PHQ-8 supervision for evaluating repeated-session tracking under partial symptom disclosure and cross-session continuity. We further propose EmoTrack, a PHQ-8 prediction framework that combines LLM-extracted clinical signals with frozen turn-level semantic embeddings and trains symptom-specific predictors over the resulting transcript representation. When prior sessions are available, EmoTrack can further incorporate them through compact cross-session memory. Experiments on LongCounsel and DAIC-WOZ show that EmoTrack achieves a clear gain on the real single-session benchmark, including a 13.5% relative MAE reduction over the strongest DAIC-WOZ baseline, and remains competitive with the strongest longitudinal baseline on LongCounsel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EmoTrack, a PHQ-8 prediction framework that extracts clinical signals via LLM, combines them with frozen turn-level semantic embeddings, and trains symptom-specific predictors; optional compact cross-session memory handles longitudinal context. A new multi-session counseling dataset LongCounsel is presented for evaluating tracking under partial disclosure. Experiments report a 13.5% relative MAE reduction versus the strongest DAIC-WOZ baseline on single-session data and competitive results against longitudinal baselines on LongCounsel.

Significance. If the gains survive proper controls, the work supplies a practically relevant dataset and a hybrid architecture that improves data efficiency while supporting cross-regime robustness. The emphasis on symptom-specific predictors and memory for continuity is a constructive direction for mental-health transcript modeling. The empirical focus on real single- and multi-session regimes adds value beyond purely prompt-based or fully fine-tuned baselines.

major comments (2)
  1. [§3.2] §3.2 (LLM clinical signal extraction): No inter-annotator agreement, prompt-sensitivity analysis, or correlation with gold PHQ-8 items on the target counseling transcripts is reported. Because these signals are the direct input to the symptom-specific predictors, the absence of validation leaves open the possibility that the reported 13.5% MAE gain is an artifact of the particular LLM and prompt rather than the EmoTrack architecture.
  2. [Experiments section and Table 2] Experiments section and Table 2: The central single-session result lacks accompanying ablation tables that isolate the contribution of the LLM signals versus the embedding + predictor components, and no error analysis or leakage controls between DAIC-WOZ and LongCounsel regimes are shown. This makes it impossible to confirm that the improvement survives standard hyperparameter and data-partition checks.
minor comments (2)
  1. [§3.3] Notation for the cross-session memory update rule is introduced without an explicit equation or pseudocode listing the size and update parameters.
  2. [Figure 1] Figure 1 would benefit from an additional panel showing an example transcript with the extracted clinical signals highlighted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive and detailed review. The comments identify important opportunities to strengthen the validation and experimental rigor of EmoTrack. We respond to each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (LLM clinical signal extraction): No inter-annotator agreement, prompt-sensitivity analysis, or correlation with gold PHQ-8 items on the target counseling transcripts is reported. Because these signals are the direct input to the symptom-specific predictors, the absence of validation leaves open the possibility that the reported 13.5% MAE gain is an artifact of the particular LLM and prompt rather than the EmoTrack architecture.

    Authors: We thank the referee for this observation. Inter-annotator agreement does not apply here, as the clinical signals are produced by a deterministic LLM prompt rather than multiple human annotators. We agree, however, that explicit checks on prompt robustness and alignment with PHQ-8 items would reduce the concern that gains are prompt-specific artifacts. In the revised manuscript we will add a short prompt-sensitivity subsection to §3.2 that reports MAE variance across three alternative prompt phrasings on the DAIC-WOZ test set. We will also report Pearson correlations between each extracted symptom signal and its corresponding PHQ-8 item on the same held-out data, thereby providing direct evidence that the signals carry clinically relevant information. revision: yes

  2. Referee: [Experiments section and Table 2] Experiments section and Table 2: The central single-session result lacks accompanying ablation tables that isolate the contribution of the LLM signals versus the embedding + predictor components, and no error analysis or leakage controls between DAIC-WOZ and LongCounsel regimes are shown. This makes it impossible to confirm that the improvement survives standard hyperparameter and data-partition checks.

    Authors: We accept that the current experimental presentation would be clearer with additional controls. While Table 2 already compares the complete EmoTrack system against strong baselines, we will insert a dedicated ablation table that isolates the LLM-signal component by comparing the full model against an otherwise identical architecture that uses only the frozen turn-level embeddings and symptom-specific predictors. We will also add a concise error-analysis paragraph discussing representative over- and under-prediction cases. Finally, we will expand the data-partition description to explicitly state that DAIC-WOZ and LongCounsel contain disjoint participants and sessions, document the exact train/validation/test splits, and confirm that hyperparameter selection was performed solely on the validation folds of each dataset independently, thereby ruling out cross-regime leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance on held-out benchmarks

full rationale

The paper introduces a new dataset (LongCounsel) and an EmoTrack framework that extracts clinical signals via LLM, combines them with embeddings, and trains symptom-specific predictors. All central claims are empirical MAE numbers on held-out test sets (DAIC-WOZ single-session benchmark and LongCounsel). No mathematical derivation, fitted parameter renamed as prediction, or self-citation chain is present in the provided text. The 13.5% relative MAE reduction is a measured outcome on external data, not a quantity forced by construction from the model's inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard supervised learning assumptions plus the domain assumption that LLM-generated clinical signals are sufficiently accurate proxies for depression symptoms. No new physical entities or ad-hoc constants are introduced.

free parameters (2)
  • symptom-specific predictor weights
    Learned during training on the target dataset; central to the reported performance.
  • cross-session memory size and update rule
    Chosen to balance context retention with computational cost.
axioms (2)
  • domain assumption LLM-extracted clinical signals are faithful to the underlying depression symptoms
    Invoked when the framework treats LLM outputs as reliable input features without further calibration.
  • domain assumption Frozen turn-level embeddings preserve clinically relevant semantics
    Required for the claim that only lightweight predictors need training.

pith-pipeline@v0.9.0 · 5751 in / 1511 out tokens · 45014 ms · 2026-05-22T07:50:09.308954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Hamraz: A culture- based persian conversation dataset for person-centered therapy using llm agents

    Mohammad Amin Abbasi, Farnaz Sadat Mirnezami, Ali Neshati, and Hassan Naderi. Hamraz: A culture- based persian conversation dataset for person-centered therapy using llm agents. InProceedings of the First on Natural Language Processing and Language Models for Digital Humanities, pages 1–24, 2025

  2. [2]

    Analyzing symptom-based depression level estimation through the prism of psychiatric expertise

    Navneet Agarwal, Kirill Milintsevich, Lucie Metivier, Maud Rotharmel, Gaël Dias, and Sonia Dollfus. Analyzing symptom-based depression level estimation through the prism of psychiatric expertise. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 974–983, 2024

  3. [3]

    Suhas Bn, Dominik Mattioli, Andrew M Sherrill, Rosa I Arriaga, Christopher Wiese, and Saeed Abdullah. How real are synthetic therapy conversations? evaluating fidelity in prolonged exposure dialogues.Findings of the Association for Computational Linguistics: EMNLP, 2025:20986–20995, 2025

  4. [4]

    National Health and Nutrition Examination Survey, 2024

    Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey, 2024. URL https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/ 2021/DataFiles/DPQ_L.htm

  5. [5]

    Depression detection in clinical interviews with llm-empowered structural element graph

    Zhuang Chen, Jiawen Deng, Jinfeng Zhou, Jincenzi Wu, Tieyun Qian, and Minlie Huang. Depression detection in clinical interviews with llm-empowered structural element graph. InNorth American Chapter of the Association for Computational Linguistics, 2024

  6. [6]

    Every-Wurtz

    Cheryl R. Every-Wurtz. Counseling and psychotherapy transcripts, client narratives, and reference works. https://search.alexanderstreet.com/psyc, 2009. Commercial database, Alexander Street, part of Clarivate. Accessed 2026-04-02

  7. [7]

    The distress analysis interview corpus of human and computer interviews

    Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. The distress analysis interview corpus of human and computer interviews. InInternational Conference on Language Resources and Evaluation, 2014

  8. [8]

    Examining spanish counseling with midas: a motivational interviewing dataset in spanish.ArXiv, abs/2502.08458, 2025

    Aylin Gunal, Bowen Yi, John Piette, Rada Mihalcea, and Ver’onica P’erez-Rosas. Examining spanish counseling with midas: a motivational interviewing dataset in spanish.ArXiv, abs/2502.08458, 2025

  9. [9]

    Kmi: A dataset of korean motivational interviewing dialogues for psychotherapy

    Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, and Sungzoon Cho. Kmi: A dataset of korean motivational interviewing dialogues for psychotherapy. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages...

  10. [10]

    Mirror: Multimodal cognitive reframing therapy for rolling with resistance

    Subin Kim, Hoonrae Kim, Jihyun Lee, Yejin Jeon, and Gary Lee. Mirror: Multimodal cognitive reframing therapy for rolling with resistance. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14851–14880, 2025

  11. [11]

    Strine, Robert L

    Kurt Kroenke, Tara W. Strine, Robert L. Spitzer, Janet B. W. Williams, Joyce T. Berry, and Ali H. Mokdad. The PHQ-8 as a measure of current depression in the general population.Journal of Affective Disorders, 114(1–3):163–173, 2009. doi: 10.1016/j.jad.2008.06.026

  12. [12]

    Depression detection on social media with large language models

    Xiaochong Lan, Zhiguang Han, Yiming Cheng, Li Sheng, Jie Feng, Chen Gao, and Yong Li. Depression detection on social media with large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2155–2171, 2025

  13. [13]

    Automatic depression severity assessment with deep learning using parameter-efficient tuning.Frontiers in Psychiatry, 14:1160291, 2023

    Clinton Lau, Xiaodan Zhu, and Wai-Yip Chan. Automatic depression severity assessment with deep learning using parameter-efficient tuning.Frontiers in Psychiatry, 14:1160291, 2023

  14. [14]

    Interpretable depression assessment using a large language model.PLOS Digital Health, 5(2):e0001205, 2026

    Jae-Joong Lee, Jihoon Han, and Choong-Wan Woo. Interpretable depression assessment using a large language model.PLOS Digital Health, 5(2):e0001205, 2026

  15. [15]

    Cactus: Towards psychological counseling conversations using cognitive behavioral theory

    Suyeon Lee, Sunghwan Mac Kim, Minju Kim, Dongjin Kang, Dongil Yang, Harim Kim, Minseok Kang, Dayi Jung, Min Hee Kim, Seungbeen Lee, et al. Cactus: Towards psychological counseling conversations using cognitive behavioral theory. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14245–14274, 2024

  16. [16]

    Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu

    June M. Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. Chatcounselor: A large language models for mental health support.ArXiv, abs/2309.15461, 2023

  17. [17]

    Liu, Mengxia Gao, Sahand Sabour, Zhuang Chen, Minlie Huang, and Tatia M

    June M. Liu, Mengxia Gao, Sahand Sabour, Zhuang Chen, Minlie Huang, and Tatia M. C. Lee. Enhanced large language models for effective screening of depression and anxiety.Communications Medicine, 5(1), November 2025. ISSN 2730-664X. doi: 10.1038/s43856-025-01158-1. URL https://doi.org/10.1038/ s43856-025-01158-1. 10

  18. [18]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  19. [19]

    Mariko Makhmutova, Raghu Kainkaryam, Marta Ferreira, Jae Min, Martin Jaggi, and Ieuan Clay. Predict- ing changes in depression severity using the psyche-d (prediction of severity change-depression) model involving person-generated health data: longitudinal case-control observational study.JMIR mHealth and uHealth, 10(3):e34148, 2022

  20. [20]

    Shad Akhtar, and Tanmoy Chakraborty

    Ganeshan Malhotra, Abdul Waheed, Aseem Srivastava, Md. Shad Akhtar, and Tanmoy Chakraborty. Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations. Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2021

  21. [21]

    Evaluating large language models for depression symptom estimation

    Dhia Eddine Merzougui, Gaël Dias, Jeremie Pantin, and Fabrice Maurel. Evaluating large language models for depression symptom estimation. InArtificial Intelligence in Medicine: 23rd International Conference, AIME 2025, Pavia, Italy, June 23-26, 2025, Proceedings, Part II, pages 272–276, Berlin, Heidelberg, 2025. Springer-Verlag. ISBN 978-3-031-95840-3. do...

  22. [22]

    Towards automatic text-based estimation of depression through symptom prediction.Brain informatics, 10(1):4, 2023

    Kirill Milintsevich, Kairit Sirts, and Gaël Dias. Towards automatic text-based estimation of depression through symptom prediction.Brain informatics, 10(1):4, 2023

  23. [23]

    Moodcapture: Depression detection using in-the-wild smartphone images

    Subigya Nepal, Arvind Pillai, Weichen Wang, Tess Griffin, Amanda C Collins, Michael Heinz, Damien Lekkas, Shayan Mirjafari, Matthew Nemesure, George Price, et al. Moodcapture: Depression detection using in-the-wild smartphone images. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2024

  24. [24]

    Psycheval: A multi-session and multi-therapy benchmark for high-realism and comprehensive ai psychological counselor.arXiv preprint arXiv:2601.01802, 2026

    Qianjun Pan, Junyi Wang, Jie Zhou, Yutao Yang, Junsong Li, Kaiyin Xu, Yougen Zhou, Yihan Li, Jingyuan Zhao, Qin Chen, et al. Psycheval: A multi-session and multi-therapy benchmark for high-realism and comprehensive ai psychological counselor.arXiv preprint arXiv:2601.01802, 2026

  25. [25]

    Kokorochat: A japanese psychological counseling dialogue dataset collected via role-playing by trained counselors.ArXiv, abs/2506.01357, 2025

    Zhiyang Qi, Takumasa Kaneko, Keiko Takamizo, Mariko Ukiyo, and Michimasa Inaba. Kokorochat: A japanese psychological counseling dialogue dataset collected via role-playing by trained counselors.ArXiv, abs/2506.01357, 2025

  26. [26]

    Psydial: A large-scale long-term conversational dataset for mental health support

    Huachuan Qiu and Zhenzhong Lan. Psydial: A large-scale long-term conversational dataset for mental health support. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21624–21655, 2025

  27. [27]

    Psyguard: An automated system for suicide detection and risk assessment in psychological counseling

    Huachuan Qiu, Lizhi Ma, and Zhenzhong Lan. Psyguard: An automated system for suicide detection and risk assessment in psychological counseling. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4581–4607, 2024

  28. [28]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id= qwen3.5

  29. [29]

    Transforming social media text into predictive tools for depression through ai: A test-case study on the beck depression inventory-ii.PLOS Digital Health, 4(6):e0000848, 2025

    Federico Ravenda, Antonio Preti, Michele Poletti, Antonietta Mira, Fabio Crestani, and Andrea Raballo. Transforming social media text into predictive tools for depression through ai: A test-case study on the beck depression inventory-ii.PLOS Digital Health, 4(6):e0000848, 2025

  30. [30]

    Llm questionnaire completion for automatic psychiatric assessment

    Gony Rosenman, Talma Hendler, and Lior Wolf. Llm questionnaire completion for automatic psychiatric assessment. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 403–415, 2024

  31. [31]

    Harnessing multimodal approaches for depression detection using large language models and facial expressions.npj Mental Health Research, 2024

    Misha Sadeghi, Robert Richer, Bernhard Egger, Lena Schindler-Gmelch, Lydia Helene Rupp, Farnaz Rahimi, Matthias Berking, and Bjoern M Eskofier. Harnessing multimodal approaches for depression detection using large language models and facial expressions.npj Mental Health Research, 2024

  32. [32]

    Aseem Srivastava, Tharun Suresh, S. P. Lord, Md. Shad Akhtar, and Tanmoy Chakraborty. Counseling summarization using mental health knowledge guided utterance filtering.Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022

  33. [33]

    Thousand voices of trauma: A large-scale synthetic dataset for modeling prolonged exposure therapy conversations

    BN Suhas, Andrew M Sherrill, Rosa I Arriaga, Christopher Wiese, and Saeed Abdullah. Thousand voices of trauma: A large-scale synthetic dataset for modeling prolonged exposure therapy conversations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  34. [34]

    Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions

    Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, and Honglei Guo. Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions. InConference on Empirical Methods in Natural Language Processing, 2025. 11

  35. [35]

    Yuxi Wang, Diana Inkpen, and Prasadith Kirinde Gamaarachchige. Explainable depression detection using large language models on social media data.Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), 2024

  36. [36]

    Zixiu "Alex" Wu, Simone Balloccu, Vivek Kumar (Ph.D), Rim Helaoui, Ehud Reiter, Diego Reforgiato Recupero, and Daniele Riboni. Anno-mi: A dataset of expert-annotated counselling dialogues.ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6177–6181, 2022

  37. [37]

    Psydt: Using llms to construct the digital twin of psychological counselor with personalized counseling style for psychological counseling

    Haojie Xie, Yirong Chen, Xiaofen Xing, Jingkai Lin, and Xiangmin Xu. Psydt: Using llms to construct the digital twin of psychological counselor with personalized counseling style for psychological counseling. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1081–1115, 2025

  38. [38]

    Wagenaar, George Demiris, and Li Shen

    Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost B. Wagenaar, George Demiris, and Li Shen. Mentalchat16k: A benchmark dataset for conversational mental health assistance.Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, 2025

  39. [39]

    Dey, and Dakuo Wang

    Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, and Dakuo Wang. Mental-llm: Leveraging large language models for mental health prediction via online text data.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 8(1), March 2024. doi: 10.1145/3643540. URL https://doi.org/10.1145/3643540

  40. [40]

    Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling

    Chenhao Zhang, Renhao Li, Minghuan Tan, Min Yang, Jingwei Zhu, Di Yang, Jiahao Zhao, Guancheng Ye, Chengming Li, and Xiping Hu. Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13947–13966, 2024

  41. [41]

    Explainable depression detection in clinical inter- views with personalized retrieval-augmented generation

    Linhai Zhang, Ziyang Gao, Deyu Zhou, and Yulan He. Explainable depression detection in clinical inter- views with personalized retrieval-augmented generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9927–9944, 2025

  42. [42]

    When llms meets acoustic landmarks: An efficient approach to integrate speech into large language models for depression detection

    Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, and Julien Epps. When llms meets acoustic landmarks: An efficient approach to integrate speech into large language models for depression detection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 146–158, 2024. 12 Appendix Table o...