Recognition: no theorem link
Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness
Pith reviewed 2026-05-12 03:49 UTC · model grok-4.3
The pith
LLMs differ sharply in reliable HADS scoring from noisy speech transcripts
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR and keyword groundedness above 93 percent; Llama-3.1-8B shows ASR-fragile consistency with ICC falling from 0.82 to 0.36 at 10 percent WER and groundedness of 77-81 percent, revealing score-evidence dissociation even as predictive validity holds for the stronger models.
What carries the argument
Intra-model consistency via intraclass correlation coefficient across three runs, robustness tested with Whisper Large/Medium/Small at varying word error rates, and keyword groundedness as proxy for evidence faithfulness in zero-shot HADS estimation from speech.
If this is right
- Predictive validity of the HADS estimates remains largely preserved under ASR errors for the consistent models.
- Inter-model agreement at the score level greatly exceeds agreement on the specific keywords used, indicating divergent reasoning paths.
- Models with fragile consistency and lower groundedness may produce numeric outputs unsuitable for clinical settings where evidence matters.
- High-groundedness models could enable more interpretable automated voice-based screening without major loss from real-world transcription noise.
Where Pith is reading between the lines
- If keyword counts prove too crude a faithfulness check, full human rating of model explanations would be required to confirm the dissociation findings.
- The pattern of model differences may extend to other clinical questionnaires or languages, calling for parallel ASR-robustness tests on those tasks.
- Prompts that force explicit evidence citation could narrow the score-evidence gap in weaker models like Llama-3.1.
- Deployment decisions for speech-based mental health tools should prioritize models that keep both consistency and grounding rather than relying on numeric agreement alone.
Load-bearing premise
That matching HADS-related keywords in the transcript is a sufficient and unbiased way to judge whether the model's assigned score is faithfully supported by the evidence.
What would settle it
If independent clinicians review the full model outputs and transcripts and find that high keyword groundedness does not match actual faithful reasoning or that low-groundedness scores are still clinically accurate.
Figures
read the original abstract
LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates three LLMs (Phi-4, Gemma-2-9B, Llama-3.1-8B) for zero-shot HADS score estimation from speech on 111 participants. It uses ground-truth transcripts and three Whisper ASR outputs, with three runs per model-condition pair, to assess intra-model consistency (ICC), ASR robustness, predictive validity, and evidence faithfulness via keyword groundedness. Findings show ICC > 0.89 with minimal ASR degradation for Phi-4 and Gemma-2-9B, ICC drop from 0.82 to 0.36 at 10% WER for Llama-3.1-8B, preserved predictive validity for robust models, and groundedness >93% vs. 77-81%, indicating score-evidence dissociation.
Significance. If the results hold, the work supplies useful empirical benchmarks on LLM reliability for mental health screening. The design with repeated runs, multiple ASR conditions, and concrete metrics on 111 participants enables assessment of consistency and robustness. Model-specific differences and the dissociation observation carry implications for clinical interpretability and safe deployment.
major comments (2)
- [Results (keyword groundedness)] Results section on keyword groundedness: The evidence faithfulness claim and score-evidence dissociation rest on a keyword-matching proxy that counts pre-defined HADS-related terms in outputs (>93% for Phi-4/Gemma-2-9B, 77-81% for Llama-3.1-8B). This proxy does not establish that numeric scores are conditioned on transcript content rather than priors or prompt artifacts, especially in zero-shot prompting. The manuscript includes no human faithfulness ratings, keyword ablations, or checks correlating groundedness with transcript-grounded reasoning, so the dissociation may be an artifact of the metric.
- [Methods] Methods section (experimental setup): The paper does not detail the exact zero-shot prompts, post-processing or exclusion rules for outputs, or validation of HADS ground-truth labels (e.g., self-report reliability). These omissions affect reproducibility of the ICC drops and predictive validity results, and could alter interpretation of ASR fragility for Llama-3.1-8B.
minor comments (2)
- [Abstract] Abstract: The statement that 'predictive validity is largely preserved' should include the specific metric and values (e.g., correlation or MAE) for precision.
- [Results] Tables/figures: Label all ASR conditions (Large/Medium/Small) and runs clearly, and include confidence intervals for ICC values.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which help clarify key aspects of our work on LLM reliability for HADS scoring. We address each major comment point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: Results section on keyword groundedness: The evidence faithfulness claim and score-evidence dissociation rest on a keyword-matching proxy that counts pre-defined HADS-related terms in outputs (>93% for Phi-4/Gemma-2-9B, 77-81% for Llama-3.1-8B). This proxy does not establish that numeric scores are conditioned on transcript content rather than priors or prompt artifacts, especially in zero-shot prompting. The manuscript includes no human faithfulness ratings, keyword ablations, or checks correlating groundedness with transcript-grounded reasoning, so the dissociation may be an artifact of the metric.
Authors: We agree that keyword groundedness serves as an objective but limited proxy and does not causally establish that scores are conditioned on transcript content versus model priors in zero-shot settings. The dissociation observation is supported by the large gap between high inter-model score agreement and low keyword agreement, which we interpret as evidence that models may converge on scores via divergent internal processes. To strengthen this, the revised manuscript will explicitly describe the metric as a preliminary proxy, add a dedicated limitations paragraph on its shortcomings, and include a new analysis correlating per-sample groundedness with intra-model consistency (ICC). We will also add a keyword ablation experiment (masking HADS terms in transcripts) to test sensitivity. Human ratings are not feasible to add at this stage without new data collection, but the proxy differences remain informative for the reported model-specific patterns. revision: partial
-
Referee: Methods section (experimental setup): The paper does not detail the exact zero-shot prompts, post-processing or exclusion rules for outputs, or validation of HADS ground-truth labels (e.g., self-report reliability). These omissions affect reproducibility of the ICC drops and predictive validity results, and could alter interpretation of ASR fragility for Llama-3.1-8B.
Authors: We thank the referee for highlighting these gaps. The revised Methods section will include the full zero-shot prompts, a complete description of post-processing steps and any output exclusion rules (e.g., invalid score filtering), and additional details on HADS ground-truth validation, including citations to established psychometric reliability studies for the self-report instrument. These changes will directly support reproducibility of the ICC and predictive validity findings and clarify the interpretation of Llama-3.1-8B's ASR fragility. revision: yes
Circularity Check
No circularity: purely empirical measurements on external data
full rationale
The paper reports direct empirical results from running three LLMs on 111 participants' ground-truth transcripts and ASR outputs, computing ICC for consistency, degradation under WER, and a keyword-count proxy for groundedness. No equations, fitted parameters, predictions, or derivations appear; no self-citations are invoked to justify uniqueness, ansatzes, or load-bearing premises. All quantities are computed from the external transcripts/ASR outputs and model generations without reducing to quantities defined inside the study itself.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Mental health disorders such as anxiety and depression impact millions worldwide each year [1]. Although early detection is critical for effective intervention [2], traditional screening relies on clinical interviews [3] or the Hospital Anxiety and Depres- sion Scale (HADS) [4], approaches that are resource-intensive, subjective, and difficul...
-
[2]
Are LLM predictions stable across repeated runs, and does ASR affect this stability?
-
[3]
Does ASR degrade predictive validity (correlation with HADS ground truth)?
-
[4]
Are LLM-cited keywords grounded in the transcript, and is keyword evidence stable across runs and models?
-
[5]
Are there model-specific vulnerability patterns that inform deployment decisions? Having introduced the dataset, HADS instrument, and arXiv:2605.09634v1 [cs.CL] 10 May 2026 prompt design in Section 2, we present the statistical analysis framework underpinning our evaluation in Section 3. Section 4 describes the experimental setup. Section 5 presents resul...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Data, HADS, and Prompt Design 2.1. The PsyVoiD corpus We use the PsyV oiD corpus [35], collected in Scotland (UK) during the COVID-19 lockdown to investigate the relationship between spontaneous English speech and psychological traits. The dataset comprises 111 participants (70 female, 41 male), aged 21–86 (median 62), 34 of whom (31%) report a prior clin...
-
[7]
All statistical tests are non- parametric, reflecting the ordinal nature of HADS scores
Statistical Analysis Framework In this section we describe the measures used to evaluate the three reliability dimensions. All statistical tests are non- parametric, reflecting the ordinal nature of HADS scores. You are a clinical psychologist and linguist, analyzing a spontaneous speech transcript recorded during the Covid-19 lockdown. Your task is to es...
-
[8]
- Detect depression-related cues (e.g., lack of motivation, hopelessness)
Psychological and Emotional Features - Detect anxiety-related cues (e.g., excessive worry, nervousness). - Detect depression-related cues (e.g., lack of motivation, hopelessness)
-
[9]
- Detect negative self-statements (e.g., I do not feel good, I cannot cope)
Linguistic and Behavioural Features - Identify hesitation markers (e.g., erm, uh, pauses) and their frequency. - Detect negative self-statements (e.g., I do not feel good, I cannot cope). - Assess certainty level (confident, unsure, detached)
-
[10]
Psychological Score Predictions - Predict HADS Anxiety score (0-21 scale, higher indicates greater anxiety) - Predict HADS Depression score (0-21 scale, higher indicates greater depression)
-
[11]
Justification Using Keywords - Provide keywords from the transcript that influenced each prediction. Transcript: {transcript} Figure 1:Zero-shot prompt used for HADS score estimation, integrating role specification, step-by-step decomposition, score prediction, and keyword justification. Adapted from [26]. 3.1. Intra-model consistency To assess whether re...
-
[12]
Experimental Setup 4.1. LLM configurations We evaluate three open-weight instruction-tuned LLMs, se- lected to span distinct model families, training pipelines, and parameter scales: Phi-4 (14.7B, Microsoft [41]), Gemma-2-9B (9B, Google [42]), and Llama-3.1-8B-Instruct (8B, Meta [43]). Following [26, 27], each model receives the zero-shot prompt (Section ...
-
[13]
Experimental Results and Discussion 5.1. Consistency across runs Table 1 reports ICC(3,1) and Friedmanp-values across all model–condition–subscale combinations. Phi-4 and Gemma-2- 9B show excellent consistency with minimal degradation across ASR conditions. Phi-4’s ICC ranges from 0.890 to 0.925 across all conditions and subscales (∆max = 0.035). Gemma-2-...
-
[14]
Conclusions and Future Work We presented the first joint analysis of intra-model consistency, ASR robustness, and keyword evidence faithfulness for LLM- based mental health screening. Phi-4 and Gemma-2-9B demon- strate excellent consistency (ICC>0.89) and stable predictive validity (ρ s = 0.38–0.56) across ASR conditions, whereas Llama-3.1-8B exhibits sev...
-
[15]
N. Salari, A. Hosseinian-Far, R. Jalali, A. Vaisi-Raygani, S. Ra- soulpoor, M. Mohammadi, S. Rasoulpoor, and B. Khaledi-Paveh, “Prevalence of stress, anxiety, depression among the general pop- ulation during the COVID-19 pandemic: a systematic review and meta-analysis,”Globalization and Health, vol. 16, p. 57, 2020
work page 2020
-
[16]
Early intervention—an implementation challenge for 21st century men- tal health care,
P. D. McGorry, A. Ratheesh, and B. O’Donoghue, “Early intervention—an implementation challenge for 21st century men- tal health care,”JAMA Psychiatry, vol. 75, no. 6, pp. 545–546, 2018
work page 2018
-
[17]
Anxiety and depression in a primary care clinic,
M. V on Korff, S. Shapiro, J. D. Burke, M. Teitlebaum, E. A. Skin- ner, P. German, R. W. Turner, L. Klein, and B. Burns, “Anxiety and depression in a primary care clinic,”Archives of General Psy- chiatry, vol. 44, no. 2, pp. 152–156, 1987
work page 1987
-
[18]
The Hospital Anxiety and De- pression Scale,
A. S. Zigmond and R. P. Snaith, “The Hospital Anxiety and De- pression Scale,”Acta Psychiatrica Scandinavica, vol. 67, no. 6, pp. 361–370, 1983
work page 1983
-
[19]
The heterogene- ity of mental health assessment,
J. J. Newson, D. Hunter, and T. C. Thiagarajan, “The heterogene- ity of mental health assessment,”Frontiers in Psychiatry, vol. 11, p. 76, 2020
work page 2020
-
[20]
Language use of de- pressed and depression-vulnerable college students,
S. Rude, E.-M. Gortner, and J. Pennebaker, “Language use of de- pressed and depression-vulnerable college students,”Cognition & Emotion, vol. 18, no. 8, pp. 1121–1133, 2004
work page 2004
-
[21]
The psychological mean- ing of words: LIWC and computerized text analysis methods,
Y . R. Tausczik and J. W. Pennebaker, “The psychological mean- ing of words: LIWC and computerized text analysis methods,”J. Language and Social Psychology, vol. 29, no. 1, pp. 24–54, 2010
work page 2010
-
[22]
E. C. Stadeet al., “Depression and anxiety have distinct and over- lapping language patterns: results from a clinical interview,”J. Psychopathology and Clinical Science, 2023
work page 2023
-
[23]
Detecting depression and mental illness on social media: an integrative review,
S. C. Guntuku, D. B. Yaden, M. L. Kern, L. H. Ungar, and J. C. Eichstaedt, “Detecting depression and mental illness on social media: an integrative review,”Current Opinion in Behavioral Sci- ences, vol. 18, pp. 43–49, 2017
work page 2017
-
[24]
Facebook language predicts depression in medical records,
J. C. Eichstaedtet al., “Facebook language predicts depression in medical records,”Proc. Natl Acad. Sci. USA, vol. 115, no. 44, pp. 11 203–11 208, 2018
work page 2018
-
[25]
A. Malhotra and R. Jindal, “Deep learning techniques for suicide and depression detection from online social media: a scoping re- view,”Applied Soft Computing, vol. 130, p. 109713, 2022
work page 2022
-
[26]
Natural language processing in mental health applications using non-clinical texts,
R. A. Calvo, D. N. Milne, M. S. Hussain, and H. Christensen, “Natural language processing in mental health applications using non-clinical texts,”Natural Language Engineering, vol. 23, no. 5, pp. 649–685, 2017
work page 2017
-
[27]
A review of depression and suicide risk assessment using speech analysis,
N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T. F. Quatieri, “A review of depression and suicide risk assessment using speech analysis,”Speech Communication, vol. 71, pp. 10– 49, 2015
work page 2015
-
[28]
Automatic speech emotion recognition using modulation spectral features,
S. Wu, T. H. Falk, and W.-Y . Chan, “Automatic speech emotion recognition using modulation spectral features,”Speech Commu- nication, vol. 53, no. 5, pp. 768–785, 2011
work page 2011
-
[29]
Automated assess- ment of psychiatric disorders using speech: a systematic review,
D. M. Low, K. H. Bentley, and S. S. Ghosh, “Automated assess- ment of psychiatric disorders using speech: a systematic review,” Laryngoscope Investigative Otolaryngology, vol. 5, no. 1, pp. 96– 116, 2020
work page 2020
-
[30]
Detecting depression with audio/text sequence modeling of interviews,
T. Al Hanai, M. Ghassemi, and J. Glass, “Detecting depression with audio/text sequence modeling of interviews,” inProc. Inter- speech, 2018, pp. 1716–1720
work page 2018
-
[31]
Language models are few-shot learners,
T. Brownet al., “Language models are few-shot learners,” inAd- vances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901
work page 2020
-
[32]
Depression detection on social media with large language models,
X. Lanet al., “Depression detection on social media with large language models,”arXiv preprint arXiv:2403.10750, 2024
-
[33]
Classifying anxiety and depression through LLMs virtual interactions: a case study with ChatGPT,
Y . Taoet al., “Classifying anxiety and depression through LLMs virtual interactions: a case study with ChatGPT,” inProc. IEEE BIBM, 2023, pp. 2259–2264
work page 2023
-
[34]
S. Xuet al., “Identifying psychiatric manifestations in outpatients with depression and anxiety: a large language model-based ap- proach,”medRxiv, 2025
work page 2025
-
[35]
S. K. Lhoet al., “Large language models and text embeddings for detecting depression and suicide in patient narratives,”JAMA Network Open, vol. 8, no. 5, 2025
work page 2025
-
[36]
Large language models for mental health applica- tions: systematic review,
Z. Guoet al., “Large language models for mental health applica- tions: systematic review,”JMIR Mental Health, vol. 11, no. 1, p. e57400, 2024
work page 2024
-
[37]
Enhanced large language models for ef- fective screening of depression and anxiety,
J. M. Liuet al., “Enhanced large language models for ef- fective screening of depression and anxiety,”arXiv preprint arXiv:2501.08769, 2025
-
[38]
Evaluation of ChatGPT for NLP-based mental health applications,
B. Lamichhane, “Evaluation of ChatGPT for NLP-based mental health applications,” inarXiv preprint arXiv:2303.15727, 2023
-
[39]
Towards interpretable mental health analysis with large language models,
K. Yanget al., “Towards interpretable mental health analysis with large language models,” inProc. EMNLP, 2024
work page 2024
-
[40]
Zero-shot speech-based depression and anxiety assessment with LLMs,
E. Loweimi, S. de la Fuente Garcia, and S. Luz, “Zero-shot speech-based depression and anxiety assessment with LLMs,” in Proc. Interspeech, 2025, pp. 489–493
work page 2025
-
[41]
Zero-shot speech-based mental health and affective state assessment using LLMs,
——, “Zero-shot speech-based mental health and affective state assessment using LLMs,”IEEE J. Selected Topics in Signal Pro- cessing, 2025, under review
work page 2025
-
[42]
K. Roy, H. Surana, D. Mullen, K. Haut, J. Flint, and J. Baxter, “Large language models for mental health diagnostic assessments: exploring the potential of LLMs for assisting with mental health diagnostic assessments,”arXiv preprint arXiv:2501.01305, 2025
-
[43]
Methodological gaps in predicting mental health states from social media,
S. K. Ernalaet al., “Methodological gaps in predicting mental health states from social media,” inProc. CHI, 2019
work page 2019
-
[44]
M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting,” inProc. NeurIPS, 2024
work page 2024
-
[45]
Survey of hallucination in natural language genera- tion,
Z. Jiet al., “Survey of hallucination in natural language genera- tion,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023
work page 2023
-
[46]
D. W. Joyce, G. Aref-Adib, A. Meyer, B. Shivaprasad, and S. Abrahams, “Explainable artificial intelligence for mental health through transparency and interpretability for understandability,” npj Digital Medicine, vol. 6, no. 6, 2023
work page 2023
-
[47]
Robust speech recognition via large-scale weak su- pervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProc. ICML, 2023
work page 2023
-
[48]
COVID-19: affect recognition through voice analysis during the winter lockdown in Scotland,
S. de la Fuente Garcia, F. Haider, and S. Luz, “COVID-19: affect recognition through voice analysis during the winter lockdown in Scotland,” inProc. IEEE EMBC, 2021, pp. 2326–2329
work page 2021
-
[49]
S. de la Fuente Garcia and S. Luz, “PsyV oiD—investigating the relationship between spontaneous speech features and psychology in the context of the COVID-19 pandemic and lockdown,” 2023, dataset, University of Edinburgh
work page 2023
-
[50]
The validity of the Hospital Anxiety and Depression Scale: an updated literature review,
I. Bjelland, A. A. Dahl, T. T. Haug, and D. Neckelmann, “The validity of the Hospital Anxiety and Depression Scale: an updated literature review,”J. Psychosomatic Research, vol. 52, no. 2, pp. 69–77, 2002
work page 2002
-
[51]
C. Herrmann, “International experiences with the Hospital Anxi- ety and Depression Scale—a review of validation data and clini- cal results,”J. Psychosomatic Research, vol. 42, no. 1, pp. 17–41, 1997
work page 1997
-
[52]
Chain-of-thought prompting elicits reasoning in large language models,
J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in Neural Information Process- ing Systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[53]
Intraclass correlations: uses in as- sessing rater reliability,
P. E. Shrout and J. L. Fleiss, “Intraclass correlations: uses in as- sessing rater reliability,”Psychological Bulletin, vol. 86, no. 2, pp. 420–428, 1979
work page 1979
-
[54]
A guideline of selecting and reporting intraclass correlation coefficients for reliability research,
T. K. Koo and M. Y . Li, “A guideline of selecting and reporting intraclass correlation coefficients for reliability research,”J. Chi- ropractic Medicine, vol. 15, no. 2, pp. 155–163, 2016
work page 2016
-
[55]
M. Abdinet al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Gemma 2: Improving Open Language Models at a Practical Size
Google Teamet al., “Gemma 2: improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
A. Grattafioriet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.