pith. sign in

arxiv: 2606.10796 · v1 · pith:JA3UI6SHnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Pith reviewed 2026-06-27 13:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords depression detectiontraining-free LLMclinical interviewschain-of-thoughttoken entropymulti-factor predictionDAIC-WOZE-DAIC
0
0 comments X

The pith

A training-free framework uses any frozen LLM to diagnose depression by breaking interviews into five clinical themes, scoring rationale reliability with token entropy, and combining weighted signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Dep-LLM as a method that mirrors clinical psychiatrist reasoning on off-the-shelf LLMs to detect depression from lengthy interviews. It first decomposes dialogues into five themes with evidence-based rationales via chain-of-thought, then modulates each signal by its token-level entropy to boost reliable parts and dampen uncertain ones, and finally aggregates them for the diagnosis. This setup targets the problems of sparse clues in long multi-topic talks and the scarcity of private labeled data, showing gains over zero-shot baselines on 21 LLMs and over supervised models on DAIC-WOZ and E-DAIC without any training.

Core claim

Dep-LLM is a three-stage training-free framework on frozen foundation LLMs: Chain-of-Thought Depression Multi-factor Analysis decomposes long dialogues into five clinically aligned themes and produces evidence-grounded rationales; Confidence Analysis and Modulation quantifies epistemic reliability via token-level entropy and applies intra-label and inter-theme modulation to amplify trustworthy signals; Collaborative Multi-factor Prediction dynamically integrates the weighted multi-factor signals into the final diagnosis, achieving higher accuracy, macro F1, and weighted F1 than zero-shot and supervised baselines on the two datasets.

What carries the argument

The Confidence Analysis and Modulation module, which derives epistemic reliability from token-level entropy of each rationale and performs intra-label and inter-theme modulation to strengthen trustworthy signals without training.

Load-bearing premise

Token-level entropy from the LLM output gives a reliable measure of how trustworthy each generated rationale is, so the modulation step can correctly boost good signals and suppress weak ones.

What would settle it

Apply the full Dep-LLM pipeline versus plain zero-shot on the same LLMs and DAIC-WOZ interviews; if accuracy and F1 scores show no consistent improvement or drop below the unmodulated baseline, the entropy-based modulation adds no value.

Figures

Figures reproduced from arXiv: 2606.10796 by Buzhou Tang, Ronghuan Jiang, Xianbing Zhao, Yiqing Lyu.

Figure 1
Figure 1. Figure 1: Dep-LLM decomposes and analyzes dialogues via structural [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic illustration of the proposed Dep-LLM framework with three components. The CoT Depression Multi-factor Analysis applies CoT techniques [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Possibility, rationale and confidence in multi-factor Analysis of two case example. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Quantitative impact on prediction of confidence and modulation [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The impact that value shift of δ ∗ exerts on WA*F1. Score of different models λ- 2 1 2 λ+ λ- 2 1 2 λ+ λ+ λ- 0 1 2 2 WA*F1High Low 0 1 0 (a) Gemma3-12B-Instruct (b) Llama4-17B-Instruct 1 (c) Qwen2.5-14B-Instruct 1 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The impact that value shift of λ +, λ− exerts on WA*F1. Score of different models Since Dep-LLM is a training-free framework, the two deci￾sion hyperparameters introduced in Section III-D — depression threshold δ ∗ and neutral label calibrators λ +, λ− — are not learned from data but configured by design. To verify the robustness of the default configuration δ ∗ = 1, λ+ = λ − = 1 and to provide practical c… view at source ↗
read the original abstract

Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Dep-LLM, a training-free framework for automatic depression detection from clinical interviews on frozen LLMs. It decomposes dialogues via CoT into five clinically aligned themes for evidence-grounded rationales, applies token-level entropy to quantify epistemic reliability followed by intra-label and inter-theme modulation, and integrates weighted multi-factor signals for final diagnosis. Experiments on DAIC-WOZ and E-DAIC claim that the method surpasses zero-shot baselines on nearly all of 21 foundation LLMs across 9 metrics (accuracy, macro F1, weighted F1) and further outperforms SOTA supervised domain-specific and commercial LLMs.

Significance. If the entropy-modulation mechanism is validated and the reported gains hold, the work would be significant for enabling privacy-preserving, training-free ADD systems that generalize across many LLMs without labeled data or fine-tuning. The structured mirroring of clinical psychiatrist reasoning and the emphasis on multi-factor evidence handling in long contexts are clear strengths.

major comments (1)
  1. [Confidence Analysis and Modulation module (as described in abstract and §3)] The headline performance claim (outperformance on 21 LLMs and SOTA models) is load-bearing on the Confidence Analysis and Modulation module correctly amplifying trustworthy signals. However, no ablation, correlation analysis between token entropy and rationale correctness on depression indicators, or human validation is supplied to show that lower entropy predicts higher accuracy; without this, the modulation step cannot be credited for the gains and the training-free contribution reduces to structured CoT alone.
minor comments (2)
  1. [Abstract] The abstract asserts quantitative superiority across 9 metrics but supplies no numeric values, tables, error bars, or statistical tests, which hinders immediate assessment of effect sizes.
  2. [Experimental setup] Reproducibility would benefit from explicit listing of the five clinical themes, exact prompt templates, and LLM inference hyperparameters (temperature, top-p) used in the experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on validating the Confidence Analysis and Modulation module. We address the concern directly below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Confidence Analysis and Modulation module (as described in abstract and §3)] The headline performance claim (outperformance on 21 LLMs and SOTA models) is load-bearing on the Confidence Analysis and Modulation module correctly amplifying trustworthy signals. However, no ablation, correlation analysis between token entropy and rationale correctness on depression indicators, or human validation is supplied to show that lower entropy predicts higher accuracy; without this, the modulation step cannot be credited for the gains and the training-free contribution reduces to structured CoT alone.

    Authors: We agree that the current manuscript lacks explicit ablation studies isolating the modulation module, correlation analyses between token entropy and rationale correctness on depression indicators, and human validation of the entropy-accuracy link. The reported gains are shown relative to zero-shot baselines, but these do not fully disentangle the modulation from the structured CoT decomposition. The module design draws on established LLM literature treating token entropy as a proxy for epistemic uncertainty, yet we acknowledge this does not constitute direct validation for the depression-detection task. In the revised manuscript we will add: (1) an ablation removing the Confidence Analysis and Modulation module while retaining the CoT and collaborative prediction stages, (2) quantitative correlation analysis between per-rationale entropy and indicator-level accuracy where ground-truth alignment permits, and (3) discussion of proxy validation approaches given clinical data constraints. These changes will clarify the module's incremental contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework evaluated on public datasets

full rationale

The paper describes a three-stage empirical framework (CoT multi-factor analysis, entropy-based confidence modulation, collaborative prediction) applied to frozen LLMs and evaluated on DAIC-WOZ and E-DAIC datasets. No equations, derivations, or self-citations are presented that reduce performance claims to fitted parameters, self-definitions, or load-bearing prior work by the same authors. Results are reported as direct comparisons against zero-shot baselines, supervised models, and commercial LLMs, rendering the claims externally falsifiable without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract; the framework rests on domain assumptions about LLM reasoning capabilities rather than introducing fitted parameters or new entities.

axioms (2)
  • domain assumption Frozen LLMs can generate clinically aligned rationales when given chain-of-thought prompts structured around five depression-related themes
    Invoked by the first module to handle long-context dependencies.
  • domain assumption Token-level entropy is a valid proxy for the epistemic reliability of each rationale
    Central to the Confidence Analysis and Modulation module.

pith-pipeline@v0.9.1-grok · 5835 in / 1399 out tokens · 26673 ms · 2026-06-27T13:33:46.303208+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 32 canonical work pages · 4 internal anchors

  1. [1]

    https://www.who.int/ publications/i/item/9789240031029

    World Health Organization,World Mental Health Today and Mental Health Atlas 2024, World Health Organization, Geneva, 2024. [Online]. Available: https://www.who.int/publications/i/item/9789240114487

  2. [2]

    Zero-shot speech-based depression and anxiety assessment with llms,

    E. Loweimi and Others, “Zero-shot speech-based depression and anxiety assessment with llms,” inProceedings of Interspeech 2025, 2025, pp. 489–

  3. [3]

    Available: https://www.research.ed.ac.uk/en/publications/ zero-shot-speech-based-depression-and-anxiety-assessment-with-llm

    [Online]. Available: https://www.research.ed.ac.uk/en/publications/ zero-shot-speech-based-depression-and-anxiety-assessment-with-llm

  4. [4]

    Arlington, V A: American Psychiatric Publishing, 2013

    American Psychiatric Association,Diagnostic and Statistical Manual of Mental Disorders (5th ed.). Arlington, V A: American Psychiatric Publishing, 2013. [Online]. Available: https://doi.org/10.1176/appi.books. 9780890425596

  5. [5]

    Rapid and accurate diagnosis of mental disorders in the general population: Validity of the computerized adaptive testing–mental health (cat-mh) module,

    B. B. Brodey, S. E. Purcell, K. Rhea, P. Maier, M. B. First, L. Zweede, M. Sinisterra, M. B. Nunn, M.-P. Austin, and I. S. Brodey, “Rapid and accurate diagnosis of mental disorders in the general population: Validity of the computerized adaptive testing–mental health (cat-mh) module,”Journal of Medical Internet Research, vol. 20, no. 3, p. e10685, 2018. [...

  6. [6]

    Reliability and validity of severity dimensions of psychopathology assessed using the structured clinical interview for dsm-5 (scid),

    S. A. Shankman, C. J. Funkhouser, D. N. Klein, J. Davila, D. Lerner, and D. Hee, “Reliability and validity of severity dimensions of psychopathology assessed using the structured clinical interview for dsm-5 (scid),”International Journal of Methods in Psychiatric Research, vol. 27, no. 1, p. e1590, 2018. [Online]. Available: https://pubmed.ncbi.nlm.nih.go...

  7. [7]

    Leveraging large language models for automated depression screening,

    B. G. Teferra and A. Perivolaris, “Leveraging large language models for automated depression screening,”Frontiers in Psychiatry, vol. 15, 2024. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC12303271/

  8. [8]

    SpeechT-RAG: Reliable depression detection in LLMs with retrieval-augmented generation using speech timing information,

    X. Zhang, H. Liu, Q. Zhang, B. Ahmed, and J. Epps, “SpeechT-RAG: Reliable depression detection in LLMs with retrieval-augmented generation using speech timing information,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguisti...

  9. [9]

    A survey of large language models in mental health disorder detection on social media,

    Z. Ge, N. Hu, D. Li, Y . Wang, S. Qi, Y . Xu, H. Shi, and J. Zhang, “A survey of large language models in mental health disorder detection on social media,” in2025 IEEE 41st International Conference on Data Engineering Workshops (ICDEW). IEEE, May 2025, p. 164–176. [Online]. Available: http://dx.doi.org/10.1109/ICDEW67478.2025.00027

  10. [10]

    Predicting depression in screening interviews from latent categorization of interview prompts,

    A. Rinaldi, J. Fox Tree, and S. Chaturvedi, “Predicting depression in screening interviews from latent categorization of interview prompts,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 7...

  11. [11]

    Hique: Hierarchical question embedding network for multimodal depression detection,

    J. Jung, C. Kang, J. Yoon, S. Kim, and J. Han, “Hique: Hierarchical question embedding network for multimodal depression detection,” in Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, ser. CIKM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1049–1059. [Online]. Available: https://doi.or...

  12. [12]

    Explainable depression detection with multi-aspect features using a hybrid deep learning model on social media,

    H. Zogan, I. Razzak, X. Wang, S. Jameel, and G. Xu, “Explainable depression detection with multi-aspect features using a hybrid deep learning model on social media,”World Wide Web, vol. 25, no. 1, pp. 281–304, 2022. [Online]. Available: https://link.springer.com/article/10. 1007/s11280-021-00992-2

  13. [13]

    Mitigating interviewer bias in multimodal depression detection: An approach with adversarial learning and contextual positional encoding,

    E. Zhang and C. Poellabauer, “Mitigating interviewer bias in multimodal depression detection: An approach with adversarial learning and contextual positional encoding,” inFindings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, 2025, pp. 12 169–12 188. [Online]. Available: https://aclanthology.org/2...

  14. [14]

    MentalBERT: Publicly available pretrained language models for mental healthcare,

    S. Ji, T. Zhang, L. Ansari, J. Fu, P. Tiwari, and E. Cambria, “MentalBERT: Publicly available pretrained language models for mental healthcare,” inProceedings of the Thirteenth Language Resources and Evaluation Conference, N. Calzolari, F. B ´echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Od...

  15. [15]

    Mental-llm: Leveraging large language models for mental health prediction via online text data,

    X. Xu, B. Yao, Y . Dong, S. Gabriel, H. Yu, J. Hendler, M. Ghassemi, A. K. Dey, and D. Wang, “Mental-llm: Leveraging large language models for mental health prediction via online text data,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 8, no. 1, pp. 1–32, 2024. [Online]. Available: https://arxiv.org/abs/2307.14385

  16. [16]

    Proceedings of the

    K. Yang, T. Zhang, Z. Kuang, Q. Xie, J. Huang, and S. Ananiadou, “Mentallama: Interpretable mental health analysis on social media with large language models,” inProceedings of the ACM Web Conference 2024. ACM, May 2024, p. 4489–4500. [Online]. Available: http://dx.doi.org/10.1145/3589334.3648137

  17. [17]

    Explainable depression detection in clinical interviews with personalized retrieval-augmented generation,

    L. Zhang, Z. Gao, D. Zhou, and Y . He, “Explainable depression detection in clinical interviews with personalized retrieval-augmented generation,” inFindings of the Association for Computational Linguistics: ACL

  18. [18]

    9927–9944

    Association for Computational Linguistics, 2025, pp. 9927–9944. [Online]. Available: https://aclanthology.org/2025.findings-acl.517

  19. [19]

    Agentmental: An interactive multi-agent framework for explainable and adaptive mental health assessment,

    J. Hu and Others, “Agentmental: An interactive multi-agent framework for explainable and adaptive mental health assessment,”arXiv preprint arXiv:2508.11567, 2025. [Online]. Available: https://arxiv.org/abs/2508. 11567

  20. [20]

    Intermind: Doctor-patient-family interactive depression assessment empowered by large language models,

    Z. Zhou, J. Liu, S. Wang, S. Hao, Y . Guo, and R. Hong, “Intermind: Doctor-patient-family interactive depression assessment empowered by large language models,” inProceedings of the 33rd ACM International Conference on Multimedia, ser. MM ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 5480–5489. [Online]. Available: https://doi.org...

  21. [21]

    Depression detection based on multilevel semantic features,

    X. Yao, L. Ying, T. He, L. Ren, R. Xu, and K. Mao, “Depression detection based on multilevel semantic features,” inArtificial Neural Networks and Machine Learning – ICANN 2024: 33rd International Conference on Artificial Neural Networks, Lugano, Switzerland, September 17–20, 2024, Proceedings, Part VIII. Berlin, Heidelberg: Springer-Verlag, 2024, p. 44–55...

  22. [22]

    AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition,

    F. Ringeval, B. Schuller, M. Valstar, N. Cummins, R. Cowie, L. Tavabi, M. Schmitt, S. Alisamir, S. Amiriparian, E.-M. Messner, S. Song, S. Liu, Z. Zhao, A. Mallol-Ragolta, Z. Ren, M. Soleymani, and M. Pantic, “Avec 2019 workshop and challenge: State-of-mind, detecting depression with ai, and cross-cultural affect recognition,” p. 3–12, 2019. [Online]. Ava...

  23. [23]

    Medical hallucination in foundation models and their impact on healthcare,

    Y . Kim, H. Jeong, S. Chen, S. S. Li, M. Lu, K. Alhamoud, J. Mun, C. Grau, M. Jung, R. Gameiro, L. Fan, E. Park, T. Lin, J. Yoon, W. Yoon, M. Sap, Y . Tsvetkov, P. Liang, X. Xu, X. Liu, D. McDuff, H. Lee, H. W. Park, S. Tulebaev, and C. Breazeal, “Medical hallucination in foundation models and their impact on healthcare,”medRxiv, 2025. [Online]. Available...

  24. [24]

    doi:10.1038/s41746-025-01670-7 , issn =

    E. Asgari, N. Monta ˜na-Brown, M. Dubois, S. Khalil, J. Balloch, J. A. Yeung, and D. Pimenta, “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, 2025. [Online]. Available: https://doi.org/10.1038/s41746-025-01670-7

  25. [25]

    Predicting depression in screening interviews from interactive multi-theme collaboration,

    X. Zhao, Y . Lyu, D. Wang, and B. Tang, “Predicting depression in screening interviews from interactive multi-theme collaboration,” in Findings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 23 025–23 035. [Onlin...

  26. [26]

    Integrating expert knowledge into large language models improves performance for psychiatric reasoning and diagnosis,

    K. V . Sarma, K. E. Hanss, A. J. M. Halls, A. Krystal, D. F. Becker, A. L. Glowinski, and A. J. Butte, “Integrating expert knowledge into large language models improves performance for psychiatric reasoning and diagnosis,”Psychiatry Research, vol. 355, p. 116844,

  27. [27]

    Available: https://www.sciencedirect.com/science/article/ pii/S0165178125004895

    [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0165178125004895

  28. [28]

    Interpretable depression assessment using a large language model,

    J.-J. Lee, J. Han, and C.-W. Woo, “Interpretable depression assessment using a large language model,”PLOS Digital Health, vol. 5, no. 2, p. e0001205, Feb. 2026. [Online]. Available: https://doi.org/10.1371/journal.pdig.0001205

  29. [29]

    Enhancing textgcn for depression detection on social media with emotion representation,

    H. Mao and Q. Han, “Enhancing textgcn for depression detection on social media with emotion representation,”Frontiers in Psychology, JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 vol. 16, p. 1612769, 2025. [Online]. Available: https://www.frontiersin. org/journals/psychology/articles/10.3389/fpsyg.2025.1612769/full

  30. [30]

    Large language models in biomedicine and healthcare,

    J. Zhou, H. Li, S. Chen, Z. Chen, Z. Han, and X. Gao, “Large language models in biomedicine and healthcare,”npj Artificial Intelligence, vol. 1, no. 1, p. 44, dec 2025. [Online]. Available: https://doi.org/10.1038/s44387-025-00047-1

  31. [31]

    Integrating retrieval-augmented generation with large language models in nephrology: Advancing practical applications,

    J. Miao, C. Thongprayoon, S. Suppadungsuk, O. A. Garcia Valencia, and W. Cheungpasitporn, “Integrating retrieval-augmented generation with large language models in nephrology: Advancing practical applications,”Medicina, vol. 60, no. 3, 2024. [Online]. Available: https://www.mdpi.com/1648-9144/60/3/445

  32. [32]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837. [Online]. Available: https://arxiv.org/abs/2201.11903

  33. [33]

    Enhancing depression detection with chain-of-thought prompting: From emotion to reasoning using large language models,

    S. Teng, J. Liu, R. K. Jain, S. Chai, R. Hou, T. Tateyama, L. Lin, and Y .-W. Chen, “Enhancing depression detection with chain-of-thought prompting: From emotion to reasoning using large language models,”arXiv preprint arXiv:2502.05879, 2025. [Online]. Available: https://arxiv.org/abs/2502.05879

  34. [34]

    Detecting hallucinations in large language models using semantic entropy,

    S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024

  35. [35]

    Beyond semantic entropy: Boosting LLM uncertainty quantification with pairwise semantic similarity,

    D. Nguyen, A. Payani, and B. Mirzasoleiman, “Beyond semantic entropy: Boosting LLM uncertainty quantification with pairwise semantic similarity,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 4530–454...

  36. [36]

    Measuring large language model uncertainty in women’s health using semantic entropy and perplexity: a comparative study,

    J. C. Penny-Dimri, M. Bachmann, W. R. Cooke, S. Mathewlynn, S. Dockree, J. Tolladay, J. Kossen, L. Li, Y . Gal, and G. Davis Jones, “Measuring large language model uncertainty in women’s health using semantic entropy and perplexity: a comparative study,”The Lancet Obstetrics, Gynaecology, & Women’s Health, vol. 1, no. 1, pp. e47–e56, Sep. 2025. [Online]. ...

  37. [37]

    The distress analysis interview corpus of human and computer interviews

    J. Gratch, R. Artstein, G. M. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsellaet al., “The distress analysis interview corpus of human and computer interviews.” inLREC. Reykjavik, 2014, pp. 3123–3128. [Online]. Available: https://api.semanticscholar.org/CorpusID:14488823

  38. [38]

    Simsensei kiosk: a virtual human interviewer for healthcare decision support,

    D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, K. Georgila, J. Gratch, A. Hartholt, M. Lhommet, G. Lucas, S. Marsella, F. Morbini, A. Nazarian, S. Scherer, G. Stratou, A. Suri, D. Traum, R. Wood, Y . Xu, A. Rizzo, and L.-P. Morency, “Simsensei kiosk: a virtual human interviewer for healthcare decision support,” inProceedings of the 2014 Int...

  39. [39]

    Topic modeling based multi-modal depression detection,

    Y . Gong and C. Poellabauer, “Topic modeling based multi-modal depression detection,” inProceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, ser. A VEC ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 69–76. [Online]. Available: https://doi.org/10.1145/3133944.3133945

  40. [40]

    Tensor fusion network for multimodal sentiment analysis,

    A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, M. Palmer, R. Hwa, and S. Riedel, Eds. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 1103–1114. [Online]. Available...

  41. [41]

    Multimodal transformer for unaligned multimodal language sequences,

    Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. M `arquez, Eds. Florence, Italy: Association for Computational Linguistics, Ju...

  42. [42]

    Misa: Modality-invariant and -specific representations for multimodal sentiment analysis,

    D. Hazarika, R. Zimmermann, and S. Poria, “Misa: Modality-invariant and -specific representations for multimodal sentiment analysis,” 2020. [Online]. Available: https://arxiv.org/abs/2005.03545

  43. [43]

    A hierar- chical attention network-based approach for depression detection from transcribed clinical interviews,

    A. Mallol-Ragolta, Z. Zhao, L. Stappen, and B. Schuller, “A hierar- chical attention network-based approach for depression detection from transcribed clinical interviews,” 09 2019, pp. 221–225

  44. [44]

    Depmstat: Multimodal spatio-temporal attentional transformer for depression detection,

    Y . Tao, M. Yang, H. Li, Y . Wu, and B. Hu, “Depmstat: Multimodal spatio-temporal attentional transformer for depression detection,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 7, pp. 2956–2966, 2024

  45. [45]

    Ttfnet: Temporal-frequency features fusion network for speech based automatic depression recognition and assessment,

    X. Chen, Z. Shao, Y . Jiang, R. Chen, Y . Wang, B. Li, M. Niu, H. Chen, Q. Hu, J. Wu, C. Yang, and Y . Shang, “Ttfnet: Temporal-frequency features fusion network for speech based automatic depression recognition and assessment,”IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 10, pp. 7536–7548, 2025

  46. [46]

    Depmamba: Progressive fusion mamba for multimodal depression detection,

    J. Ye, J. Zhang, and H. Shan, “Depmamba: Progressive fusion mamba for multimodal depression detection,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  47. [47]

    Mmpf: Multimodal purification fusion for automatic depression detection,

    B. Yang, M. Cao, X. Zhu, S. Wang, C. Yang, R. Ni, and X. Liu, “Mmpf: Multimodal purification fusion for automatic depression detection,”IEEE Transactions on Computational Social Systems, vol. 11, no. 6, pp. 7421– 7434, 2024

  48. [48]

    Wavface: A multimodal transformer-based model for depression screening,

    R. Flores, M. Tlachac, A. Shrestha, and E. A. Rundensteiner, “Wavface: A multimodal transformer-based model for depression screening,”IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 5, pp. 3632– 3641, 2025

  49. [49]

    Depression detection in clinical interviews with LLM-empowered structural element graph,

    Z. Chen, J. Deng, J. Zhou, J. Wu, T. Qian, and M. Huang, “Depression detection in clinical interviews with LLM-empowered structural element graph,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard, Eds. ...

  50. [50]

    Psycollm: Enhancing llm for psychological understanding and evaluation,

    J. Hu, T. Dong, G. Luo, H. Ma, P. Zou, X. Sun, D. Guo, X. Yang, and M. Wang, “Psycollm: Enhancing llm for psychological understanding and evaluation,”IEEE Transactions on Computational Social Systems, vol. 12, no. 2, pp. 539–551, 2025

  51. [51]

    Depressllm: Interpretable domain-adapted language model for depression detection from real-world narratives,

    S. Moon, A. Lee, J. E. Kim, H.-J. Kang, I.-S. Shin, S.-W. Kim, J.-M. Kim, M. Jhon, and J.-W. Kim, “Depressllm: Interpretable domain-adapted language model for depression detection from real-world narratives,”

  52. [52]

    Available: https://arxiv.org/abs/2508.08591

    [Online]. Available: https://arxiv.org/abs/2508.08591

  53. [53]

    Harnessing multimodal approaches for depression detection using large language models and facial expressions,

    M. Sadeghi, R. Richer, B. Egger, L. Schindler-Gmelch, L. H. Rupp, F. Rahimi, M. Berking, and B. M. Eskofier, “Harnessing multimodal approaches for depression detection using large language models and facial expressions,”npj Mental Health Research, vol. 3, no. 1, p. 66,

  54. [54]

    Available: https://doi.org/10.1038/s44184-024-00112-8

    [Online]. Available: https://doi.org/10.1038/s44184-024-00112-8

  55. [55]

    Large language models for depression recognition in spoken language integrating psychological knowledge,

    Y . Li, S. Shao, M. Milling, and B. W. Schuller, “Large language models for depression recognition in spoken language integrating psychological knowledge,”Frontiers in Computer Science, vol. 7, Aug. 2025. [Online]. Available: http://dx.doi.org/10.3389/fcomp.2025.1629725

  56. [56]

    MAGI: Multi-agent guided interview for psychiatric assessment,

    G. Bi, Z. Chen, Z. Liu, H. Wang, X. Xiao, Y . Xie, W. Zhang, Y . Huang, Y . Chen, L. Peng, and M. Huang, “MAGI: Multi-agent guided interview for psychiatric assessment,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics,...

  57. [57]

    Wisemind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis,

    Y . Wu, G. Wan, J. Li, S. Zhao, L. Ma, T. Ye, M. Zhang, I. Pop, Y . Zhang, and J. Chen, “Wisemind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis,”npj Digital Medicine,

  58. [58]

    Available: https://doi.org/10.1038/s41746-026-02559-9

    [Online]. Available: https://doi.org/10.1038/s41746-026-02559-9

  59. [59]

    Eeyore: Realistic depression simulation via expert- in-the-loop supervised and preference optimization,

    S. Liu, B. Brie, W. Li, L. Biester, A. Lee, J. Pennebaker, and R. Mihalcea, “Eeyore: Realistic depression simulation via expert- in-the-loop supervised and preference optimization,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational ...

  60. [60]

    Zero-shot strike: Testing the generalisation capabilities of out-of-the-box llm models for depression detection,

    J. Ohse, B. Had ˇzi´c, P. Mohammed, N. Peperkorn, M. Danner, A. Yorita, N. Kubota, M. R ¨atsch, and Y . Shiban, “Zero-shot strike: Testing the generalisation capabilities of out-of-the-box llm models for depression detection,”Computer Speech & Language, vol. 88, p. 101663, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0885...

  61. [61]

    Depression detection on social media with large language models,

    X. Lan, Z. Han, Y . Cheng, L. Sheng, J. Feng, C. Gao, and Y . Li, “Depression detection on social media with large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, S. Potdar, L. Rojas-Barahona, and S. Montella, Eds. Suzhou (China): Association for Computational Linguistics, Nov. 20...

  62. [62]

    Towards explainable JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 multimodal depression recognition for clinical interviews,

    W. Zheng, Q. Xie, Z. Wang, J. Yu, and R. Xia, “Towards explainable JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 multimodal depression recognition for clinical interviews,” 2025. [Online]. Available: https://arxiv.org/abs/2501.16106

  63. [63]

    Uncertainty estimation of large language models in medical question answering,

    J. Wu, Y . Yu, and H.-Y . Zhou, “Uncertainty estimation of large language models in medical question answering,” 2024. [Online]. Available: https://arxiv.org/abs/2407.08662

  64. [64]

    Language Models (Mostly) Know What They Know

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Doddset al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022. [Online]. Available: https://arxiv.org/abs/2207.05221

  65. [65]

    SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models,

    P. Manakul, A. Liusie, and M. Gales, “SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 9004–9017. [Online]....

  66. [66]

    Geometric uncertainty for detecting and correcting hallucinations in llms,

    E. Phillips, S. Wu, S. Molaei, D. Belgrave, A. Thakur, and D. Clifton, “Geometric uncertainty for detecting and correcting hallucinations in llms,”arXiv preprint arXiv:2509.13813, 2025. [Online]. Available: https://arxiv.org/abs/2509.13813

  67. [67]

    ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

    K. Huang, J. Altosaar, and R. Ranganath, “Clinicalbert: Modeling clinical notes and predicting hospital readmission,” 2020. [Online]. Available: https://arxiv.org/abs/1904.05342

  68. [68]

    Biomistral: A collection of open-source pretrained large language models for medical domains,

    Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour, “Biomistral: A collection of open-source pretrained large language models for medical domains,” 2024. [Online]. Available: https://arxiv.org/abs/2402.10373

  69. [69]

    MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

    Z. Chen, A. H. Cano, A. Romanou, A. Bonnet, K. Matoba, F. Salvi, M. Pagliardini, S. Fan, A. K ¨opf, A. Mohtashami, A. Sallinen, A. Sakhaeirad, V . Swamy, I. Krawczuk, D. Bayazit, A. Marmet, S. Montariol, M.-A. Hartley, M. Jaggi, and A. Bosselut, “Meditron-70b: Scaling medical pretraining for large language models,” 2023. [Online]. Available: https://arxiv...