pith. sign in

arxiv: 2605.23977 · v1 · pith:65MM57DBnew · submitted 2026-05-13 · 💻 cs.CL · cs.SD· eess.AS

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

Pith reviewed 2026-06-30 21:17 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS
keywords depression detectionclinical interviewsbenchmark auditcross-validation stabilitymodality comparisongeneralizationE-DAIC dataset
0
0 comments X

The pith

Cross-validation and official test rankings disagree on top models for depression detection from interviews

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits evaluation practices in clinical-interview depression detection across multiple datasets and probes. It shows that rankings from development cross-validation align only moderately with official test rankings, with no overlap in top-3 models and the apparent winner being top only 32 percent of the time under bootstrapping. External validation reveals weak generalization, and stress tests indicate text models benefit more from symptom-dense interview segments than audio models do. These findings suggest that current benchmark setups may not reliably identify superior models or support claims of progress.

Core claim

Under subject-disjoint leave-one-subject-out cross-validation on E-DAIC, a hybrid model reaches macro-F1 of 0.723. Sweeping 96 configurations shows development CV and official test rankings align moderately, with the best CV model ranking 20th on test, the test winner 41st on CV, zero top-3 overlap, and the winner ranking first in only 32.3% of subject bootstraps. Strong in-domain baselines on CMDC and ANDROIDS transfer poorly zero-shot. On symptom-dense slices, text scores rise while audio scores stay flat.

What carries the argument

The four-probe audit consisting of strict subject-disjoint cross-validation, model configuration sweeps for ranking stability, external corpus validation, and symptom-density partitioned stress testing.

If this is right

  • Strict CV provides a conservative reference without relying on privileged holdout.
  • Official test rankings are unreliable due to low alignment with CV.
  • Zero-shot transfer to other datasets is weak despite high in-domain performance.
  • Text models show greater improvement on symptom-dense slices compared to audio.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If rankings are this unstable, reported state-of-the-art results may reflect split-specific artifacts more than true capability.
  • Audio features might provide more consistent signals across varying levels of expressed symptoms.
  • Future benchmarks could benefit from ensemble evaluation or multiple holdouts to assess ranking robustness.
  • The differential behavior of text and audio suggests hybrid systems or modality-specific evaluation strategies.

Load-bearing premise

The SRDS-based annotator correctly identifies symptom-dense versus symptom-light interview slices without systematic bias affecting text versus audio model comparisons.

What would settle it

Re-annotating the interview slices with a different method or human experts and re-running the stress-test to check if the text-audio gap persists.

Figures

Figures reproduced from arXiv: 2605.23977 by Jon Duke, Takehiro Ishikawa.

Figure 1
Figure 1. Figure 1: Mean symptom-dense-minus-symptom-light probability shift for audio and text models in the paired same-participant SRDS-defined symptom-density stress test on E-DAIC slices [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
read the original abstract

This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This paper audits clinical-interview depression detection benchmarks across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH via four probes: re-evaluation of E-DAIC under subject-disjoint LOSO CV yielding macro-F1=0.723; a sweep of 96 model configurations showing moderate misalignment between dev CV and official test rankings (best CV config ranks 20th on test, official-test winner ranks 41st on CV, zero top-3 overlap, 32.3% bootstrap rank-1 stability); external validation with strong in-domain but weak zero-shot transfer performance; and a stress-test on symptom-dense vs. symptom-light slices (defined by SRDS annotator) where text models improve but audio models remain flat.

Significance. If the reported misalignment between CV and test rankings is shown to exceed sampling variability, the work would indicate that official E-DAIC splits may not reliably support fine-grained leaderboard comparisons, with implications for benchmark design in clinical NLP. The multi-dataset external validation and bootstrap stability analysis are constructive elements that strengthen the audit's empirical grounding.

major comments (1)
  1. [96-configuration sweep (second probe)] In the section describing the sweep of 96 model configurations: the central claim that dev CV and official-test rankings align only moderately (best CV at rank 20 on test, test winner at rank 41 on CV, zero top-3 overlap) is presented without per-configuration macro-F1 values, standard errors, or a null distribution of rank instability under random subject-disjoint splits of comparable size; this leaves open whether the observed shifts exceed what finite test-set variance would produce, as noted by the limited number of test subjects.
minor comments (2)
  1. [Abstract and methods] The abstract and methods description provide specific performance numbers but omit details on exact model implementations, hyperparameter choices, data exclusion criteria, and the statistical tests used for the bootstrap analysis.
  2. [Stress-test section (fourth probe)] The stress-test section would benefit from explicit description of how the SRDS-based annotator was applied to define symptom-dense versus symptom-light slices, including any inter-annotator agreement metrics if available.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the 96-configuration sweep. The concern about whether observed rank shifts exceed finite-sample variance is well-taken, and we address it directly below.

read point-by-point responses
  1. Referee: In the section describing the sweep of 96 model configurations: the central claim that dev CV and official-test rankings align only moderately (best CV at rank 20 on test, test winner at rank 41 on CV, zero top-3 overlap) is presented without per-configuration macro-F1 values, standard errors, or a null distribution of rank instability under random subject-disjoint splits of comparable size; this leaves open whether the observed shifts exceed what finite test-set variance would produce, as noted by the limited number of test subjects.

    Authors: The manuscript already reports a subject-bootstrap analysis showing the apparent winner is rank-1 in only 32.3% of resamples, which directly quantifies rank instability. However, we agree that a formal null distribution under random subject-disjoint splits of the test set would strengthen the claim that the misalignment exceeds sampling variability. In revision we will add (i) the full table of 96 per-configuration macro-F1 scores with standard errors (in an appendix) and (ii) a comparison of observed rank shifts against the distribution obtained from 1000 random subject-disjoint partitions of the test subjects. This will allow readers to evaluate whether the reported misalignment is statistically distinguishable from test-set variance alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical audit on public data with standard protocols

full rationale

The paper conducts an empirical audit using subject-disjoint cross-validation, model sweeps, external transfer tests, and symptom-slice stress tests on public datasets (DAIC/E-DAIC, CMDC, etc.). All reported quantities—macro-F1 scores, rank alignments, bootstrap stabilities, and modality gaps—are direct measurements from these protocols. No equations, fitted parameters, or self-citations are invoked to derive the central observations by construction; the misalignment findings and transfer results stand as independent empirical outcomes without reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on standard assumptions about the representativeness of the listed datasets for clinical interviews and the reliability of the SRDS annotator for defining symptom density; no free parameters or invented entities are introduced because the work is an empirical audit rather than a new modeling derivation.

axioms (2)
  • domain assumption Subject-disjoint leave-one-subject-out cross-validation is the appropriate protocol for obtaining conservative out-of-fold performance estimates on E-DAIC.
    Invoked explicitly in the first probe to establish the macro-F1 reference point.
  • domain assumption The SRDS-based annotator provides an unbiased partition of interview slices into symptom-dense and symptom-light categories.
    Required for the fourth probe's stress-test comparing text and audio model behavior.

pith-pipeline@v0.9.1-grok · 5795 in / 1709 out tokens · 54401 ms · 2026-06-30T21:17:20.209413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Depressive disorder (depression),

    World Health Organization, "Depressive disorder (depression)," WHO Fact Sheet, Aug. 29, 2025. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/depression. Accessed: May 4, 2026

  2. [2]

    Global, regional and national burden of depressive disorders and attributable risk factors, from 1990 to 2021: results from the 2021 Global Burden of Disease study,

    J. Rong, X. Wang, P. Cheng, D. Li, and D. Zhao, “Global, regional and national burden of depressive disorders and attributable risk factors, from 1990 to 2021: results from the 2021 Global Burden of Disease study,” The British Journal of Psychiatry, vol. 227, no. 4, pp. 688–697, 2025, doi: 10.1192/bjp.2024.266

  3. [3]

    Digital health tools for the passive monitoring of depression: a systematic review of methods,

    V. De Angel et al., “Digital health tools for the passive monitoring of depression: a systematic review of methods,” npj Digital Medicine, vol. 5, art. 3, 2022, doi: 10.1038/s41746-021-00548-8

  4. [4]

    The role of digital biomarkers in physiological signal-based depression assessment: systematic review and meta-analysis,

    H. Lee, S.-G. Kang, and S. Lee, “The role of digital biomarkers in physiological signal-based depression assessment: systematic review and meta-analysis,” Journal of Medical Internet Research, vol. 28, e76432, 2026, doi: 10.2196/76432

  5. [5]

    The Distress Analysis Interview Corpus of human and computer interviews,

    J. Gratch et al., "The Distress Analysis Interview Corpus of human and computer interviews," in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 3123-3128, 2014

  6. [6]

    AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition,

    F. Ringeval et al., "AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition," in Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop (AVEC '19), 2019, doi: 10.1145/3347320.3357688

  7. [7]

    Multimodal Mental Health Digital Biomarker Analysis From Remote Interviews Using Facial, Vocal, Linguistic, and Cardiovascular Patterns,

    Z. Jiang, S. Seyedi, E. Griner, A. Abbasi, A. B. Rad, H. Kwon, R. O. Cotes, and G. D. Clifford, "Multimodal Mental Health Digital Biomarker Analysis From Remote Interviews Using Facial, Vocal, Linguistic, and Cardiovascular Patterns," IEEE Journal of Biomedical and Health Informatics, vol. 28, no. 3, pp. 1680- 1691, Mar. 2024, doi: 10.1109/JBHI.2024.3352075

  8. [8]

    Language-based detection of depression with machine learning: systematic review and meta-analysis,

    H. Fisher et al., "Language-based detection of depression with machine learning: systematic review and meta-analysis," npj Digital Medicine, vol. 9, art. 273, 2026, doi: 10.1038/s41746-026-02448-1

  9. [9]

    Diagnostic accuracy of deep learning using speech samples in depression: a systematic review and meta-analysis,

    L. Liu et al., "Diagnostic accuracy of deep learning using speech samples in depression: a systematic review and meta-analysis," Journal of the American Medical Informatics Association, vol. 31, no. 10, pp. 2394-2404, 2024, doi: 10.1093/jamia/ocae189

  10. [10]

    On over-fitting in model selection and subsequent selection bias in performance evaluation,

    G. C. Cawley and N. L. C. Talbot, "On over-fitting in model selection and subsequent selection bias in performance evaluation," Journal of Machine Learning Research, vol. 11, pp. 2079-2107, 2010

  11. [11]

    Selection bias in gene extraction on the basis of microarray gene- expression data,

    C. Ambroise and G. J. McLachlan, "Selection bias in gene extraction on the basis of microarray gene- expression data," Proceedings of the National Academy of Sciences, vol. 99, no. 10, pp. 6562-6566, 2002

  12. [12]

    Bias in error estimation when using cross-validation for model selection,

    V. S. Varma and R. Simon, "Bias in error estimation when using cross-validation for model selection," BMC Bioinformatics, vol. 7, art. 91, 2006

  13. [13]

    No unbiased estimator of the variance of k-fold cross-validation,

    Y. Bengio and Y. Grandvalet, "No unbiased estimator of the variance of k-fold cross-validation," Journal of Machine Learning Research, vol. 5, pp. 1089-1105, 2004

  14. [14]

    Cross-validation pitfalls when selecting and assessing regression and classification models,

    D. Krstajic et al., "Cross-validation pitfalls when selecting and assessing regression and classification models," Journal of Cheminformatics, vol. 6, art. 10, 2014

  15. [15]

    Cross-validation failure: Small sample sizes lead to large error bars,

    G. Varoquaux, "Cross-validation failure: Small sample sizes lead to large error bars," NeuroImage, vol. 180, pp. 68-77, 2018

  16. [16]

    Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study,

    I. Danylenko and O. Unold, "Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study," Applied Sciences, vol. 16, no. 1, art. 422, 2026, doi: 10.3390/app16010422

  17. [17]

    Assessing speaker independence on a speech-based depression level estimation system,

    P. López-Otero, L. Docío-Fernández, and C. García-Mateo, "Assessing speaker independence on a speech-based depression level estimation system," Pattern Recognition Letters, vol. 68, pp. 343-350, 2015, doi: 10.1016/j.patrec.2015.05.017

  18. [18]

    The need to approximate the use-case in clinical machine learning,

    S. Saeb, L. Lonini, A. Jayaraman, D. C. Mohr, and K. P. Kording, "The need to approximate the use-case in clinical machine learning," GigaScience, vol. 6, no. 5, pp. 1-9, 2017, doi: 10.1093/gigascience/gix019

  19. [19]

    Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection

    H.-C. Yeh, L. Sun, A. Mahapatra, S. S. Chandra, E. Mower Provost, and B. Sisman, "Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection," arXiv preprint arXiv:2604.14354, 2026, doi: 10.48550/arXiv.2604.14354

  20. [20]

    Semi-Structural Interview-Based Chinese Multimodal Depression Corpus Towards Automatic Preliminary Screening of Depressive Disorders,

    B. Zou et al., "Semi-Structural Interview-Based Chinese Multimodal Depression Corpus Towards Automatic Preliminary Screening of Depressive Disorders," IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 2823-2838, 2023, doi: 10.1109/TAFFC.2022.3181210

  21. [21]

    The Androids Corpus: A New Publicly Available Benchmark for Speech Based Depression Detection,

    F. Tao, A. Esposito, and A. Vinciarelli, "The Androids Corpus: A New Publicly Available Benchmark for Speech Based Depression Detection," in Proceedings of Interspeech 2023, pp. 4149-4153, 2023, doi: 10.21437/Interspeech.2023-894

  22. [22]

    Subject-level depression detection from Chinese clinical interview texts requires dataset- specific aggregation,

    S. Xia, "Subject-level depression detection from Chinese clinical interview texts requires dataset- specific aggregation," Research Square preprint, 2026, doi: 10.21203/rs.3.rs-9258754/v1

  23. [23]

    Depression detection in read and spontaneous speech: a multimodal approach for lesser-resourced languages,

    K. Daly and O. Olukoya, "Depression detection in read and spontaneous speech: a multimodal approach for lesser-resourced languages," Biomedical Signal Processing and Control, vol. 108, art. 107959, 2025, doi: 10.1016/j.bspc.2025.107959

  24. [24]

    Speech-Based Depression Assessment: A Comprehensive Survey,

    S. S. Leal, S. Ntalampiras, and R. Sassi, "Speech-Based Depression Assessment: A Comprehensive Survey," IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 1318-1333, Jul.-Sep. 2025, doi: 10.1109/TAFFC.2024.3521327

  25. [25]

    A systematic review on automated clinical depression diagnosis,

    K. Mao, Y. Wu, and J. Chen, "A systematic review on automated clinical depression diagnosis," npj Mental Health Research, vol. 2, art. 20, 2023, doi: 10.1038/s44184-023-00040-z

  26. [26]

    Speech and Text Foundation Models for Depression Detection: Cross-Task and Cross-Language Evaluation,

    L. Gomez-Zaragoza, J. Marin-Morales, M. Alcaniz, and M. Soleymani, "Speech and Text Foundation Models for Depression Detection: Cross-Task and Cross-Language Evaluation," in Proceedings of Interspeech 2025, pp. 5253-5257, 2025, doi: 10.21437/Interspeech.2025-1035

  27. [27]

    Detecting depression using vocal, facial and semantic communication cues,

    J. R. Williamson, E. Godoy, M. Cha, A. Schwarzentruber, P. Khorrami, Y. Gwon, H.-T. Kung, C. Dagli, and T. F. Quatieri, “Detecting depression using vocal, facial and semantic communication cues,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge (AVEC ’16), 2016, pp. 11–18, doi: 10.1145/2988257.2988263

  28. [28]

    Detecting depression with word-level multimodal fusion,

    M. Rohanian, J. Hough, and M. Purver, “Detecting depression with word-level multimodal fusion,” in Proceedings of Interspeech 2019, 2019, pp. 1443–1447, doi: 10.21437/Interspeech.2019-2283

  29. [29]

    Multi-level attention network using text, audio and video for depression prediction,

    A. Ray, S. Kumar, R. Reddy, P. Mukherjee, and R. Garg, “Multi-level attention network using text, audio and video for depression prediction,” in Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop (AVEC ’19), 2019, pp. 81–88, doi: 10.1145/3347320.3357697

  30. [30]

    Multi-level attention network using text, audio and video for depression prediction,

    M. R. Makiuchi, T. Warnita, K. Uto, and K. Shinoda, “Multimodal fusion of BERT-CNN and gated CNN representations for depression detection,” in Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop (AVEC ’19), 2019, pp. 55–63, doi: 10.1145/3347320.3357694

  31. [31]

    Harnessing multimodal approaches for depression detection using large language models and facial expressions,

    M. Sadeghi, R. Richer, B. Egger, L. Schindler-Gmelch, L. H. Rupp, F. Rahimi, M. Berking, and B. M. Eskofier, “Harnessing multimodal approaches for depression detection using large language models and facial expressions,” npj Mental Health Research, vol. 3, art. 66, 2024, doi: 10.1038/s44184-024- 00112-8

  32. [32]

    When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews,

    H. Watawana et al., "When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews," arXiv preprint arXiv:2603.24651, 2026

  33. [33]

    DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection from Clinical Interviews,

    S. Burdisso et al., "DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection from Clinical Interviews," in Proceedings of the 6th Clinical Natural Language Processing Workshop, Mexico City, Mexico, Jun. 2024, pp. 82-90, doi: 10.18653/v1/2024.clinicalnlp- 1.8

  34. [34]

    T+L and L-only code for Beyond Headline Scores: A Multi-Probe Audit of Clinical- Interview Depression Detection Benchmarks,

    T. Ishikawa, "T+L and L-only code for Beyond Headline Scores: A Multi-Probe Audit of Clinical- Interview Depression Detection Benchmarks," Zenodo, Version v2, Apr. 27, 2026, doi: 10.5281/zenodo.19813142. Concept DOI: 10.5281/zenodo.19813141

  35. [35]

    MODMA dataset: a Multi-modal Open Dataset for Mental-disorder Analysis,

    H. Cai, Y. Gao, S. Sun, N. Li, F. Tian, H. Xiao, J. Li, Z. Yang, X. Li, Q. Zhao, Z. Liu, Z. Yao, M. Yang, H. Peng, J. Zhu, X. Zhang, X. Hu, and B. Hu, "MODMA dataset: a Multi-modal Open Dataset for Mental-disorder Analysis," arXiv preprint arXiv:2002.09283, 2020

  36. [36]

    A multi-modal open dataset for mental-disorder analysis,

    H. Cai, Z. Yuan, Y. Gao, et al., "A multi-modal open dataset for mental-disorder analysis," Scientific Data, vol. 9, art. 178, 2022, doi: 10.1038/s41597-022-01211-x

  37. [37]

    A Multimodal Depression Consultation Dataset of Speech and Text with HAMD-17 Assessments,

    P. Cao et al., "A Multimodal Depression Consultation Dataset of Speech and Text with HAMD-17 Assessments," Scientific Data, vol. 12, art. 1577, 2025, doi: 10.1038/s41597-025-05817-9