EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes
Pith reviewed 2026-05-22 07:50 UTC · model grok-4.3
The pith
EmoTrack predicts PHQ-8 depression scores from counseling transcripts by combining LLM clinical signals with frozen semantic embeddings for use in both single and repeated sessions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EmoTrack predicts PHQ-8 scores by extracting clinical signals via LLM and combining them with frozen turn-level semantic embeddings to form a transcript representation, then training symptom-specific predictors on it; prior sessions can be incorporated through compact cross-session memory when available. This yields a 13.5 percent relative reduction in mean absolute error on the DAIC-WOZ single-session benchmark while remaining competitive with the strongest longitudinal baseline on the new LongCounsel dataset.
What carries the argument
LLM-extracted clinical signals combined with frozen turn-level semantic embeddings to build a transcript representation, followed by symptom-specific predictors and optional compact cross-session memory for longitudinal use.
If this is right
- Higher single-session accuracy would let automated systems flag high-severity cases for immediate human review in text-based counseling.
- Frozen embeddings and pre-extracted signals reduce the need for large labeled datasets and limit overfitting when data is scarce.
- Compact cross-session memory supports tracking symptom changes over time without full model retraining on each new visit.
- The LongCounsel dataset provides a benchmark for testing repeated-session prediction under partial symptom disclosure.
Where Pith is reading between the lines
- The same signal-plus-embedding structure could be adapted to predict other clinical scales if matching LLM prompts are written for them.
- Integration into live counseling platforms could help route therapist attention toward sessions with rising risk scores.
- Direct tests on transcripts from varied cultural or linguistic settings would be required to confirm that the LLM signals stay consistent outside the original training domains.
Load-bearing premise
The clinical signals pulled from the LLM remain reliable and unbiased no matter the counseling style, language, or amount of symptom disclosure in the transcripts.
What would settle it
A clear drop in accuracy on transcripts from a different language or counseling style, where the LLM signals turn out inconsistent or biased, would show the method does not deliver the claimed robustness.
Figures
read the original abstract
Text-based counseling is an important interface for AI mental-health support, where transcripts may be used to monitor depression severity and flag sessions requiring timely human review. However, robust PHQ-8 prediction across session regimes remains challenging: fine-tuning-based methods can exploit richer supervision but may generalize poorly under data scarcity, while prompt-based LLM methods are data-efficient but usually treat each transcript holistically and provide limited support for longitudinal context. We study robust depression tracking from counseling transcripts across single-session and multi-session regimes. We introduce LongCounsel, a multi-session counseling dataset with session-level PHQ-8 supervision for evaluating repeated-session tracking under partial symptom disclosure and cross-session continuity. We further propose EmoTrack, a PHQ-8 prediction framework that combines LLM-extracted clinical signals with frozen turn-level semantic embeddings and trains symptom-specific predictors over the resulting transcript representation. When prior sessions are available, EmoTrack can further incorporate them through compact cross-session memory. Experiments on LongCounsel and DAIC-WOZ show that EmoTrack achieves a clear gain on the real single-session benchmark, including a 13.5% relative MAE reduction over the strongest DAIC-WOZ baseline, and remains competitive with the strongest longitudinal baseline on LongCounsel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EmoTrack, a PHQ-8 prediction framework that extracts clinical signals via LLM, combines them with frozen turn-level semantic embeddings, and trains symptom-specific predictors; optional compact cross-session memory handles longitudinal context. A new multi-session counseling dataset LongCounsel is presented for evaluating tracking under partial disclosure. Experiments report a 13.5% relative MAE reduction versus the strongest DAIC-WOZ baseline on single-session data and competitive results against longitudinal baselines on LongCounsel.
Significance. If the gains survive proper controls, the work supplies a practically relevant dataset and a hybrid architecture that improves data efficiency while supporting cross-regime robustness. The emphasis on symptom-specific predictors and memory for continuity is a constructive direction for mental-health transcript modeling. The empirical focus on real single- and multi-session regimes adds value beyond purely prompt-based or fully fine-tuned baselines.
major comments (2)
- [§3.2] §3.2 (LLM clinical signal extraction): No inter-annotator agreement, prompt-sensitivity analysis, or correlation with gold PHQ-8 items on the target counseling transcripts is reported. Because these signals are the direct input to the symptom-specific predictors, the absence of validation leaves open the possibility that the reported 13.5% MAE gain is an artifact of the particular LLM and prompt rather than the EmoTrack architecture.
- [Experiments section and Table 2] Experiments section and Table 2: The central single-session result lacks accompanying ablation tables that isolate the contribution of the LLM signals versus the embedding + predictor components, and no error analysis or leakage controls between DAIC-WOZ and LongCounsel regimes are shown. This makes it impossible to confirm that the improvement survives standard hyperparameter and data-partition checks.
minor comments (2)
- [§3.3] Notation for the cross-session memory update rule is introduced without an explicit equation or pseudocode listing the size and update parameters.
- [Figure 1] Figure 1 would benefit from an additional panel showing an example transcript with the extracted clinical signals highlighted.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive and detailed review. The comments identify important opportunities to strengthen the validation and experimental rigor of EmoTrack. We respond to each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (LLM clinical signal extraction): No inter-annotator agreement, prompt-sensitivity analysis, or correlation with gold PHQ-8 items on the target counseling transcripts is reported. Because these signals are the direct input to the symptom-specific predictors, the absence of validation leaves open the possibility that the reported 13.5% MAE gain is an artifact of the particular LLM and prompt rather than the EmoTrack architecture.
Authors: We thank the referee for this observation. Inter-annotator agreement does not apply here, as the clinical signals are produced by a deterministic LLM prompt rather than multiple human annotators. We agree, however, that explicit checks on prompt robustness and alignment with PHQ-8 items would reduce the concern that gains are prompt-specific artifacts. In the revised manuscript we will add a short prompt-sensitivity subsection to §3.2 that reports MAE variance across three alternative prompt phrasings on the DAIC-WOZ test set. We will also report Pearson correlations between each extracted symptom signal and its corresponding PHQ-8 item on the same held-out data, thereby providing direct evidence that the signals carry clinically relevant information. revision: yes
-
Referee: [Experiments section and Table 2] Experiments section and Table 2: The central single-session result lacks accompanying ablation tables that isolate the contribution of the LLM signals versus the embedding + predictor components, and no error analysis or leakage controls between DAIC-WOZ and LongCounsel regimes are shown. This makes it impossible to confirm that the improvement survives standard hyperparameter and data-partition checks.
Authors: We accept that the current experimental presentation would be clearer with additional controls. While Table 2 already compares the complete EmoTrack system against strong baselines, we will insert a dedicated ablation table that isolates the LLM-signal component by comparing the full model against an otherwise identical architecture that uses only the frozen turn-level embeddings and symptom-specific predictors. We will also add a concise error-analysis paragraph discussing representative over- and under-prediction cases. Finally, we will expand the data-partition description to explicitly state that DAIC-WOZ and LongCounsel contain disjoint participants and sessions, document the exact train/validation/test splits, and confirm that hyperparameter selection was performed solely on the validation folds of each dataset independently, thereby ruling out cross-regime leakage. revision: yes
Circularity Check
No circularity: empirical performance on held-out benchmarks
full rationale
The paper introduces a new dataset (LongCounsel) and an EmoTrack framework that extracts clinical signals via LLM, combines them with embeddings, and trains symptom-specific predictors. All central claims are empirical MAE numbers on held-out test sets (DAIC-WOZ single-session benchmark and LongCounsel). No mathematical derivation, fitted parameter renamed as prediction, or self-citation chain is present in the provided text. The 13.5% relative MAE reduction is a measured outcome on external data, not a quantity forced by construction from the model's inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- symptom-specific predictor weights
- cross-session memory size and update rule
axioms (2)
- domain assumption LLM-extracted clinical signals are faithful to the underlying depression symptoms
- domain assumption Frozen turn-level embeddings preserve clinically relevant semantics
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EmoTrack ... combines LLM-extracted clinical signals with frozen turn-level semantic embeddings and trains symptom-specific predictors ... compact cross-session memory
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hamraz: A culture- based persian conversation dataset for person-centered therapy using llm agents
Mohammad Amin Abbasi, Farnaz Sadat Mirnezami, Ali Neshati, and Hassan Naderi. Hamraz: A culture- based persian conversation dataset for person-centered therapy using llm agents. InProceedings of the First on Natural Language Processing and Language Models for Digital Humanities, pages 1–24, 2025
work page 2025
-
[2]
Analyzing symptom-based depression level estimation through the prism of psychiatric expertise
Navneet Agarwal, Kirill Milintsevich, Lucie Metivier, Maud Rotharmel, Gaël Dias, and Sonia Dollfus. Analyzing symptom-based depression level estimation through the prism of psychiatric expertise. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 974–983, 2024
work page 2024
-
[3]
Suhas Bn, Dominik Mattioli, Andrew M Sherrill, Rosa I Arriaga, Christopher Wiese, and Saeed Abdullah. How real are synthetic therapy conversations? evaluating fidelity in prolonged exposure dialogues.Findings of the Association for Computational Linguistics: EMNLP, 2025:20986–20995, 2025
work page 2025
-
[4]
National Health and Nutrition Examination Survey, 2024
Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey, 2024. URL https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/ 2021/DataFiles/DPQ_L.htm
work page 2024
-
[5]
Depression detection in clinical interviews with llm-empowered structural element graph
Zhuang Chen, Jiawen Deng, Jinfeng Zhou, Jincenzi Wu, Tieyun Qian, and Minlie Huang. Depression detection in clinical interviews with llm-empowered structural element graph. InNorth American Chapter of the Association for Computational Linguistics, 2024
work page 2024
-
[6]
Cheryl R. Every-Wurtz. Counseling and psychotherapy transcripts, client narratives, and reference works. https://search.alexanderstreet.com/psyc, 2009. Commercial database, Alexander Street, part of Clarivate. Accessed 2026-04-02
work page 2009
-
[7]
The distress analysis interview corpus of human and computer interviews
Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. The distress analysis interview corpus of human and computer interviews. InInternational Conference on Language Resources and Evaluation, 2014
work page 2014
-
[8]
Aylin Gunal, Bowen Yi, John Piette, Rada Mihalcea, and Ver’onica P’erez-Rosas. Examining spanish counseling with midas: a motivational interviewing dataset in spanish.ArXiv, abs/2502.08458, 2025
-
[9]
Kmi: A dataset of korean motivational interviewing dialogues for psychotherapy
Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, and Sungzoon Cho. Kmi: A dataset of korean motivational interviewing dialogues for psychotherapy. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages...
work page 2025
-
[10]
Mirror: Multimodal cognitive reframing therapy for rolling with resistance
Subin Kim, Hoonrae Kim, Jihyun Lee, Yejin Jeon, and Gary Lee. Mirror: Multimodal cognitive reframing therapy for rolling with resistance. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14851–14880, 2025
work page 2025
-
[11]
Kurt Kroenke, Tara W. Strine, Robert L. Spitzer, Janet B. W. Williams, Joyce T. Berry, and Ali H. Mokdad. The PHQ-8 as a measure of current depression in the general population.Journal of Affective Disorders, 114(1–3):163–173, 2009. doi: 10.1016/j.jad.2008.06.026
-
[12]
Depression detection on social media with large language models
Xiaochong Lan, Zhiguang Han, Yiming Cheng, Li Sheng, Jie Feng, Chen Gao, and Yong Li. Depression detection on social media with large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2155–2171, 2025
work page 2025
-
[13]
Clinton Lau, Xiaodan Zhu, and Wai-Yip Chan. Automatic depression severity assessment with deep learning using parameter-efficient tuning.Frontiers in Psychiatry, 14:1160291, 2023
work page 2023
-
[14]
Jae-Joong Lee, Jihoon Han, and Choong-Wan Woo. Interpretable depression assessment using a large language model.PLOS Digital Health, 5(2):e0001205, 2026
work page 2026
-
[15]
Cactus: Towards psychological counseling conversations using cognitive behavioral theory
Suyeon Lee, Sunghwan Mac Kim, Minju Kim, Dongjin Kang, Dongil Yang, Harim Kim, Minseok Kang, Dayi Jung, Min Hee Kim, Seungbeen Lee, et al. Cactus: Towards psychological counseling conversations using cognitive behavioral theory. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14245–14274, 2024
work page 2024
-
[16]
Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu
June M. Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. Chatcounselor: A large language models for mental health support.ArXiv, abs/2309.15461, 2023
-
[17]
Liu, Mengxia Gao, Sahand Sabour, Zhuang Chen, Minlie Huang, and Tatia M
June M. Liu, Mengxia Gao, Sahand Sabour, Zhuang Chen, Minlie Huang, and Tatia M. C. Lee. Enhanced large language models for effective screening of depression and anxiety.Communications Medicine, 5(1), November 2025. ISSN 2730-664X. doi: 10.1038/s43856-025-01158-1. URL https://doi.org/10.1038/ s43856-025-01158-1. 10
-
[18]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019
work page 2019
-
[19]
Mariko Makhmutova, Raghu Kainkaryam, Marta Ferreira, Jae Min, Martin Jaggi, and Ieuan Clay. Predict- ing changes in depression severity using the psyche-d (prediction of severity change-depression) model involving person-generated health data: longitudinal case-control observational study.JMIR mHealth and uHealth, 10(3):e34148, 2022
work page 2022
-
[20]
Shad Akhtar, and Tanmoy Chakraborty
Ganeshan Malhotra, Abdul Waheed, Aseem Srivastava, Md. Shad Akhtar, and Tanmoy Chakraborty. Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations. Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2021
work page 2021
-
[21]
Evaluating large language models for depression symptom estimation
Dhia Eddine Merzougui, Gaël Dias, Jeremie Pantin, and Fabrice Maurel. Evaluating large language models for depression symptom estimation. InArtificial Intelligence in Medicine: 23rd International Conference, AIME 2025, Pavia, Italy, June 23-26, 2025, Proceedings, Part II, pages 272–276, Berlin, Heidelberg, 2025. Springer-Verlag. ISBN 978-3-031-95840-3. do...
-
[22]
Kirill Milintsevich, Kairit Sirts, and Gaël Dias. Towards automatic text-based estimation of depression through symptom prediction.Brain informatics, 10(1):4, 2023
work page 2023
-
[23]
Moodcapture: Depression detection using in-the-wild smartphone images
Subigya Nepal, Arvind Pillai, Weichen Wang, Tess Griffin, Amanda C Collins, Michael Heinz, Damien Lekkas, Shayan Mirjafari, Matthew Nemesure, George Price, et al. Moodcapture: Depression detection using in-the-wild smartphone images. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2024
work page 2024
-
[24]
Qianjun Pan, Junyi Wang, Jie Zhou, Yutao Yang, Junsong Li, Kaiyin Xu, Yougen Zhou, Yihan Li, Jingyuan Zhao, Qin Chen, et al. Psycheval: A multi-session and multi-therapy benchmark for high-realism and comprehensive ai psychological counselor.arXiv preprint arXiv:2601.01802, 2026
-
[25]
Zhiyang Qi, Takumasa Kaneko, Keiko Takamizo, Mariko Ukiyo, and Michimasa Inaba. Kokorochat: A japanese psychological counseling dialogue dataset collected via role-playing by trained counselors.ArXiv, abs/2506.01357, 2025
-
[26]
Psydial: A large-scale long-term conversational dataset for mental health support
Huachuan Qiu and Zhenzhong Lan. Psydial: A large-scale long-term conversational dataset for mental health support. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21624–21655, 2025
work page 2025
-
[27]
Psyguard: An automated system for suicide detection and risk assessment in psychological counseling
Huachuan Qiu, Lizhi Ma, and Zhenzhong Lan. Psyguard: An automated system for suicide detection and risk assessment in psychological counseling. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4581–4607, 2024
work page 2024
-
[28]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id= qwen3.5
work page 2026
-
[29]
Federico Ravenda, Antonio Preti, Michele Poletti, Antonietta Mira, Fabio Crestani, and Andrea Raballo. Transforming social media text into predictive tools for depression through ai: A test-case study on the beck depression inventory-ii.PLOS Digital Health, 4(6):e0000848, 2025
work page 2025
-
[30]
Llm questionnaire completion for automatic psychiatric assessment
Gony Rosenman, Talma Hendler, and Lior Wolf. Llm questionnaire completion for automatic psychiatric assessment. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 403–415, 2024
work page 2024
-
[31]
Misha Sadeghi, Robert Richer, Bernhard Egger, Lena Schindler-Gmelch, Lydia Helene Rupp, Farnaz Rahimi, Matthias Berking, and Bjoern M Eskofier. Harnessing multimodal approaches for depression detection using large language models and facial expressions.npj Mental Health Research, 2024
work page 2024
-
[32]
Aseem Srivastava, Tharun Suresh, S. P. Lord, Md. Shad Akhtar, and Tanmoy Chakraborty. Counseling summarization using mental health knowledge guided utterance filtering.Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022
work page 2022
-
[33]
BN Suhas, Andrew M Sherrill, Rosa I Arriaga, Christopher Wiese, and Saeed Abdullah. Thousand voices of trauma: A large-scale synthetic dataset for modeling prolonged exposure therapy conversations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
work page 2025
-
[34]
Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions
Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, and Honglei Guo. Feel the difference? a comparative analysis of emotional arcs in real and llm-generated cbt sessions. InConference on Empirical Methods in Natural Language Processing, 2025. 11
work page 2025
-
[35]
Yuxi Wang, Diana Inkpen, and Prasadith Kirinde Gamaarachchige. Explainable depression detection using large language models on social media data.Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), 2024
work page 2024
-
[36]
Zixiu "Alex" Wu, Simone Balloccu, Vivek Kumar (Ph.D), Rim Helaoui, Ehud Reiter, Diego Reforgiato Recupero, and Daniele Riboni. Anno-mi: A dataset of expert-annotated counselling dialogues.ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6177–6181, 2022
work page 2022
-
[37]
Haojie Xie, Yirong Chen, Xiaofen Xing, Jingkai Lin, and Xiangmin Xu. Psydt: Using llms to construct the digital twin of psychological counselor with personalized counseling style for psychological counseling. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1081–1115, 2025
work page 2025
-
[38]
Wagenaar, George Demiris, and Li Shen
Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost B. Wagenaar, George Demiris, and Li Shen. Mentalchat16k: A benchmark dataset for conversational mental health assistance.Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, 2025
work page 2025
-
[39]
Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, and Dakuo Wang. Mental-llm: Leveraging large language models for mental health prediction via online text data.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 8(1), March 2024. doi: 10.1145/3643540. URL https://doi.org/10.1145/3643540
-
[40]
Chenhao Zhang, Renhao Li, Minghuan Tan, Min Yang, Jingwei Zhu, Di Yang, Jiahao Zhao, Guancheng Ye, Chengming Li, and Xiping Hu. Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13947–13966, 2024
work page 2024
-
[41]
Linhai Zhang, Ziyang Gao, Deyu Zhou, and Yulan He. Explainable depression detection in clinical inter- views with personalized retrieval-augmented generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9927–9944, 2025
work page 2025
-
[42]
Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, and Julien Epps. When llms meets acoustic landmarks: An efficient approach to integrate speech into large language models for depression detection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 146–158, 2024. 12 Appendix Table o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.