pith. machine review for the scientific record. sign in

arxiv: 2605.09634 · v1 · submitted 2026-05-10 · 💻 cs.CL

Recognition: no theorem link

Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLMsHADSmental health screeningconsistencyASR robustnessevidence faithfulnesszero-shotreliability
0
0 comments X

The pith

LLMs differ sharply in reliable HADS scoring from noisy speech transcripts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can reliably assign Hospital Anxiety and Depression Scale scores from spoken participant responses in a zero-shot manner. It checks three reliability dimensions: consistency of scores across repeated runs of the same model, how little those scores change when transcripts contain automatic speech recognition errors, and how closely the scores tie to actual keywords in the input that relate to anxiety or depression items. Three models are run on data from 111 English speakers using both clean transcripts and outputs from three Whisper ASR systems. Phi-4 and Gemma-2-9B keep strong consistency above 0.89 ICC with little ASR impact and over 93 percent keyword grounding, while Llama-3.1-8B drops to 0.36 ICC at 10 percent word error rate and 77-81 percent grounding. The work shows that numeric score agreement can hide differences in how much the output actually rests on the evidence.

Core claim

Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR and keyword groundedness above 93 percent; Llama-3.1-8B shows ASR-fragile consistency with ICC falling from 0.82 to 0.36 at 10 percent WER and groundedness of 77-81 percent, revealing score-evidence dissociation even as predictive validity holds for the stronger models.

What carries the argument

Intra-model consistency via intraclass correlation coefficient across three runs, robustness tested with Whisper Large/Medium/Small at varying word error rates, and keyword groundedness as proxy for evidence faithfulness in zero-shot HADS estimation from speech.

If this is right

  • Predictive validity of the HADS estimates remains largely preserved under ASR errors for the consistent models.
  • Inter-model agreement at the score level greatly exceeds agreement on the specific keywords used, indicating divergent reasoning paths.
  • Models with fragile consistency and lower groundedness may produce numeric outputs unsuitable for clinical settings where evidence matters.
  • High-groundedness models could enable more interpretable automated voice-based screening without major loss from real-world transcription noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If keyword counts prove too crude a faithfulness check, full human rating of model explanations would be required to confirm the dissociation findings.
  • The pattern of model differences may extend to other clinical questionnaires or languages, calling for parallel ASR-robustness tests on those tasks.
  • Prompts that force explicit evidence citation could narrow the score-evidence gap in weaker models like Llama-3.1.
  • Deployment decisions for speech-based mental health tools should prioritize models that keep both consistency and grounding rather than relying on numeric agreement alone.

Load-bearing premise

That matching HADS-related keywords in the transcript is a sufficient and unbiased way to judge whether the model's assigned score is faithfully supported by the evidence.

What would settle it

If independent clinicians review the full model outputs and transcripts and find that high keyword groundedness does not match actual faithful reasoning or that low-groundedness scores are still clinically accurate.

Figures

Figures reproduced from arXiv: 2605.09634 by Erfan Loweimi, Hadi Daneshvar, Samira Loveymi, Saturnino Luz, Sofia de la Fuente Garcia.

Figure 1
Figure 1. Figure 1: Zero-shot prompt used for HADS score estimation, integrating role specification, step-by-step decomposition, score prediction, and keyword justification. Adapted from [26]. 3.1. Intra-model consistency To assess whether repeated LLM inference yields stable pre￾dictions, we use the Friedman test to detect systematic inter￾run differences, and ICC(3,1) (two-way mixed, single measures, consistency [39]) to qu… view at source ↗
Figure 2
Figure 2. Figure 2: Top: Predictive validity (Spearman ρs with HADS ground truth) vs WER. Bottom: Intra-model consistency (ICC) vs WER. Note the sharp ICC decline for Llama-3.1-8B at higher WER, contrasting the stability of Phi-4 and Gemma-2-9B [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Top 15 keywords per model for HADS-Anxiety (GT transcripts). Note the dominance of “erm” across models and divergent keyword vocabularies underlying similar predictions. 6. Conclusions and Future Work We presented the first joint analysis of intra-model consistency, ASR robustness, and keyword evidence faithfulness for LLM￾based mental health screening. Phi-4 and Gemma-2-9B demon￾strate excellent consisten… view at source ↗
read the original abstract

LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates three LLMs (Phi-4, Gemma-2-9B, Llama-3.1-8B) for zero-shot HADS score estimation from speech on 111 participants. It uses ground-truth transcripts and three Whisper ASR outputs, with three runs per model-condition pair, to assess intra-model consistency (ICC), ASR robustness, predictive validity, and evidence faithfulness via keyword groundedness. Findings show ICC > 0.89 with minimal ASR degradation for Phi-4 and Gemma-2-9B, ICC drop from 0.82 to 0.36 at 10% WER for Llama-3.1-8B, preserved predictive validity for robust models, and groundedness >93% vs. 77-81%, indicating score-evidence dissociation.

Significance. If the results hold, the work supplies useful empirical benchmarks on LLM reliability for mental health screening. The design with repeated runs, multiple ASR conditions, and concrete metrics on 111 participants enables assessment of consistency and robustness. Model-specific differences and the dissociation observation carry implications for clinical interpretability and safe deployment.

major comments (2)
  1. [Results (keyword groundedness)] Results section on keyword groundedness: The evidence faithfulness claim and score-evidence dissociation rest on a keyword-matching proxy that counts pre-defined HADS-related terms in outputs (>93% for Phi-4/Gemma-2-9B, 77-81% for Llama-3.1-8B). This proxy does not establish that numeric scores are conditioned on transcript content rather than priors or prompt artifacts, especially in zero-shot prompting. The manuscript includes no human faithfulness ratings, keyword ablations, or checks correlating groundedness with transcript-grounded reasoning, so the dissociation may be an artifact of the metric.
  2. [Methods] Methods section (experimental setup): The paper does not detail the exact zero-shot prompts, post-processing or exclusion rules for outputs, or validation of HADS ground-truth labels (e.g., self-report reliability). These omissions affect reproducibility of the ICC drops and predictive validity results, and could alter interpretation of ASR fragility for Llama-3.1-8B.
minor comments (2)
  1. [Abstract] Abstract: The statement that 'predictive validity is largely preserved' should include the specific metric and values (e.g., correlation or MAE) for precision.
  2. [Results] Tables/figures: Label all ASR conditions (Large/Medium/Small) and runs clearly, and include confidence intervals for ICC values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify key aspects of our work on LLM reliability for HADS scoring. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: Results section on keyword groundedness: The evidence faithfulness claim and score-evidence dissociation rest on a keyword-matching proxy that counts pre-defined HADS-related terms in outputs (>93% for Phi-4/Gemma-2-9B, 77-81% for Llama-3.1-8B). This proxy does not establish that numeric scores are conditioned on transcript content rather than priors or prompt artifacts, especially in zero-shot prompting. The manuscript includes no human faithfulness ratings, keyword ablations, or checks correlating groundedness with transcript-grounded reasoning, so the dissociation may be an artifact of the metric.

    Authors: We agree that keyword groundedness serves as an objective but limited proxy and does not causally establish that scores are conditioned on transcript content versus model priors in zero-shot settings. The dissociation observation is supported by the large gap between high inter-model score agreement and low keyword agreement, which we interpret as evidence that models may converge on scores via divergent internal processes. To strengthen this, the revised manuscript will explicitly describe the metric as a preliminary proxy, add a dedicated limitations paragraph on its shortcomings, and include a new analysis correlating per-sample groundedness with intra-model consistency (ICC). We will also add a keyword ablation experiment (masking HADS terms in transcripts) to test sensitivity. Human ratings are not feasible to add at this stage without new data collection, but the proxy differences remain informative for the reported model-specific patterns. revision: partial

  2. Referee: Methods section (experimental setup): The paper does not detail the exact zero-shot prompts, post-processing or exclusion rules for outputs, or validation of HADS ground-truth labels (e.g., self-report reliability). These omissions affect reproducibility of the ICC drops and predictive validity results, and could alter interpretation of ASR fragility for Llama-3.1-8B.

    Authors: We thank the referee for highlighting these gaps. The revised Methods section will include the full zero-shot prompts, a complete description of post-processing steps and any output exclusion rules (e.g., invalid score filtering), and additional details on HADS ground-truth validation, including citations to established psychometric reliability studies for the self-report instrument. These changes will directly support reproducibility of the ICC and predictive validity findings and clarify the interpretation of Llama-3.1-8B's ASR fragility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on external data

full rationale

The paper reports direct empirical results from running three LLMs on 111 participants' ground-truth transcripts and ASR outputs, computing ICC for consistency, degradation under WER, and a keyword-count proxy for groundedness. No equations, fitted parameters, predictions, or derivations appear; no self-citations are invoked to justify uniqueness, ansatzes, or load-bearing premises. All quantities are computed from the external transcripts/ASR outputs and model generations without reducing to quantities defined inside the study itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation study with no mathematical derivations, free parameters, or postulated entities; relies on standard statistical tools (ICC) and off-the-shelf ASR systems treated as black-box inputs.

pith-pipeline@v0.9.0 · 5549 in / 1231 out tokens · 58259 ms · 2026-05-12T03:49:58.632351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

  1. [1]

    Introduction Mental health disorders such as anxiety and depression impact millions worldwide each year [1]. Although early detection is critical for effective intervention [2], traditional screening relies on clinical interviews [3] or the Hospital Anxiety and Depres- sion Scale (HADS) [4], approaches that are resource-intensive, subjective, and difficul...

  2. [2]

    Are LLM predictions stable across repeated runs, and does ASR affect this stability?

  3. [3]

    Does ASR degrade predictive validity (correlation with HADS ground truth)?

  4. [4]

    Are LLM-cited keywords grounded in the transcript, and is keyword evidence stable across runs and models?

  5. [5]

    Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

    Are there model-specific vulnerability patterns that inform deployment decisions? Having introduced the dataset, HADS instrument, and arXiv:2605.09634v1 [cs.CL] 10 May 2026 prompt design in Section 2, we present the statistical analysis framework underpinning our evaluation in Section 3. Section 4 describes the experimental setup. Section 5 presents resul...

  6. [6]

    Data, HADS, and Prompt Design 2.1. The PsyVoiD corpus We use the PsyV oiD corpus [35], collected in Scotland (UK) during the COVID-19 lockdown to investigate the relationship between spontaneous English speech and psychological traits. The dataset comprises 111 participants (70 female, 41 male), aged 21–86 (median 62), 34 of whom (31%) report a prior clin...

  7. [7]

    All statistical tests are non- parametric, reflecting the ordinal nature of HADS scores

    Statistical Analysis Framework In this section we describe the measures used to evaluate the three reliability dimensions. All statistical tests are non- parametric, reflecting the ordinal nature of HADS scores. You are a clinical psychologist and linguist, analyzing a spontaneous speech transcript recorded during the Covid-19 lockdown. Your task is to es...

  8. [8]

    - Detect depression-related cues (e.g., lack of motivation, hopelessness)

    Psychological and Emotional Features - Detect anxiety-related cues (e.g., excessive worry, nervousness). - Detect depression-related cues (e.g., lack of motivation, hopelessness)

  9. [9]

    - Detect negative self-statements (e.g., I do not feel good, I cannot cope)

    Linguistic and Behavioural Features - Identify hesitation markers (e.g., erm, uh, pauses) and their frequency. - Detect negative self-statements (e.g., I do not feel good, I cannot cope). - Assess certainty level (confident, unsure, detached)

  10. [10]

    Psychological Score Predictions - Predict HADS Anxiety score (0-21 scale, higher indicates greater anxiety) - Predict HADS Depression score (0-21 scale, higher indicates greater depression)

  11. [11]

    Justification Using Keywords - Provide keywords from the transcript that influenced each prediction. Transcript: {transcript} Figure 1:Zero-shot prompt used for HADS score estimation, integrating role specification, step-by-step decomposition, score prediction, and keyword justification. Adapted from [26]. 3.1. Intra-model consistency To assess whether re...

  12. [12]

    Experimental Setup 4.1. LLM configurations We evaluate three open-weight instruction-tuned LLMs, se- lected to span distinct model families, training pipelines, and parameter scales: Phi-4 (14.7B, Microsoft [41]), Gemma-2-9B (9B, Google [42]), and Llama-3.1-8B-Instruct (8B, Meta [43]). Following [26, 27], each model receives the zero-shot prompt (Section ...

  13. [13]

    erm,” false starts) that Phi-4 over-interprets as anxiety features, as confirmed by keyword frequency analysis where “erm

    Experimental Results and Discussion 5.1. Consistency across runs Table 1 reports ICC(3,1) and Friedmanp-values across all model–condition–subscale combinations. Phi-4 and Gemma-2- 9B show excellent consistency with minimal degradation across ASR conditions. Phi-4’s ICC ranges from 0.890 to 0.925 across all conditions and subscales (∆max = 0.035). Gemma-2-...

  14. [14]

    Conclusions and Future Work We presented the first joint analysis of intra-model consistency, ASR robustness, and keyword evidence faithfulness for LLM- based mental health screening. Phi-4 and Gemma-2-9B demon- strate excellent consistency (ICC>0.89) and stable predictive validity (ρ s = 0.38–0.56) across ASR conditions, whereas Llama-3.1-8B exhibits sev...

  15. [15]

    Prevalence of stress, anxiety, depression among the general pop- ulation during the COVID-19 pandemic: a systematic review and meta-analysis,

    N. Salari, A. Hosseinian-Far, R. Jalali, A. Vaisi-Raygani, S. Ra- soulpoor, M. Mohammadi, S. Rasoulpoor, and B. Khaledi-Paveh, “Prevalence of stress, anxiety, depression among the general pop- ulation during the COVID-19 pandemic: a systematic review and meta-analysis,”Globalization and Health, vol. 16, p. 57, 2020

  16. [16]

    Early intervention—an implementation challenge for 21st century men- tal health care,

    P. D. McGorry, A. Ratheesh, and B. O’Donoghue, “Early intervention—an implementation challenge for 21st century men- tal health care,”JAMA Psychiatry, vol. 75, no. 6, pp. 545–546, 2018

  17. [17]

    Anxiety and depression in a primary care clinic,

    M. V on Korff, S. Shapiro, J. D. Burke, M. Teitlebaum, E. A. Skin- ner, P. German, R. W. Turner, L. Klein, and B. Burns, “Anxiety and depression in a primary care clinic,”Archives of General Psy- chiatry, vol. 44, no. 2, pp. 152–156, 1987

  18. [18]

    The Hospital Anxiety and De- pression Scale,

    A. S. Zigmond and R. P. Snaith, “The Hospital Anxiety and De- pression Scale,”Acta Psychiatrica Scandinavica, vol. 67, no. 6, pp. 361–370, 1983

  19. [19]

    The heterogene- ity of mental health assessment,

    J. J. Newson, D. Hunter, and T. C. Thiagarajan, “The heterogene- ity of mental health assessment,”Frontiers in Psychiatry, vol. 11, p. 76, 2020

  20. [20]

    Language use of de- pressed and depression-vulnerable college students,

    S. Rude, E.-M. Gortner, and J. Pennebaker, “Language use of de- pressed and depression-vulnerable college students,”Cognition & Emotion, vol. 18, no. 8, pp. 1121–1133, 2004

  21. [21]

    The psychological mean- ing of words: LIWC and computerized text analysis methods,

    Y . R. Tausczik and J. W. Pennebaker, “The psychological mean- ing of words: LIWC and computerized text analysis methods,”J. Language and Social Psychology, vol. 29, no. 1, pp. 24–54, 2010

  22. [22]

    Depression and anxiety have distinct and over- lapping language patterns: results from a clinical interview,

    E. C. Stadeet al., “Depression and anxiety have distinct and over- lapping language patterns: results from a clinical interview,”J. Psychopathology and Clinical Science, 2023

  23. [23]

    Detecting depression and mental illness on social media: an integrative review,

    S. C. Guntuku, D. B. Yaden, M. L. Kern, L. H. Ungar, and J. C. Eichstaedt, “Detecting depression and mental illness on social media: an integrative review,”Current Opinion in Behavioral Sci- ences, vol. 18, pp. 43–49, 2017

  24. [24]

    Facebook language predicts depression in medical records,

    J. C. Eichstaedtet al., “Facebook language predicts depression in medical records,”Proc. Natl Acad. Sci. USA, vol. 115, no. 44, pp. 11 203–11 208, 2018

  25. [25]

    Deep learning techniques for suicide and depression detection from online social media: a scoping re- view,

    A. Malhotra and R. Jindal, “Deep learning techniques for suicide and depression detection from online social media: a scoping re- view,”Applied Soft Computing, vol. 130, p. 109713, 2022

  26. [26]

    Natural language processing in mental health applications using non-clinical texts,

    R. A. Calvo, D. N. Milne, M. S. Hussain, and H. Christensen, “Natural language processing in mental health applications using non-clinical texts,”Natural Language Engineering, vol. 23, no. 5, pp. 649–685, 2017

  27. [27]

    A review of depression and suicide risk assessment using speech analysis,

    N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T. F. Quatieri, “A review of depression and suicide risk assessment using speech analysis,”Speech Communication, vol. 71, pp. 10– 49, 2015

  28. [28]

    Automatic speech emotion recognition using modulation spectral features,

    S. Wu, T. H. Falk, and W.-Y . Chan, “Automatic speech emotion recognition using modulation spectral features,”Speech Commu- nication, vol. 53, no. 5, pp. 768–785, 2011

  29. [29]

    Automated assess- ment of psychiatric disorders using speech: a systematic review,

    D. M. Low, K. H. Bentley, and S. S. Ghosh, “Automated assess- ment of psychiatric disorders using speech: a systematic review,” Laryngoscope Investigative Otolaryngology, vol. 5, no. 1, pp. 96– 116, 2020

  30. [30]

    Detecting depression with audio/text sequence modeling of interviews,

    T. Al Hanai, M. Ghassemi, and J. Glass, “Detecting depression with audio/text sequence modeling of interviews,” inProc. Inter- speech, 2018, pp. 1716–1720

  31. [31]

    Language models are few-shot learners,

    T. Brownet al., “Language models are few-shot learners,” inAd- vances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901

  32. [32]

    Depression detection on social media with large language models,

    X. Lanet al., “Depression detection on social media with large language models,”arXiv preprint arXiv:2403.10750, 2024

  33. [33]

    Classifying anxiety and depression through LLMs virtual interactions: a case study with ChatGPT,

    Y . Taoet al., “Classifying anxiety and depression through LLMs virtual interactions: a case study with ChatGPT,” inProc. IEEE BIBM, 2023, pp. 2259–2264

  34. [34]

    Identifying psychiatric manifestations in outpatients with depression and anxiety: a large language model-based ap- proach,

    S. Xuet al., “Identifying psychiatric manifestations in outpatients with depression and anxiety: a large language model-based ap- proach,”medRxiv, 2025

  35. [35]

    Large language models and text embeddings for detecting depression and suicide in patient narratives,

    S. K. Lhoet al., “Large language models and text embeddings for detecting depression and suicide in patient narratives,”JAMA Network Open, vol. 8, no. 5, 2025

  36. [36]

    Large language models for mental health applica- tions: systematic review,

    Z. Guoet al., “Large language models for mental health applica- tions: systematic review,”JMIR Mental Health, vol. 11, no. 1, p. e57400, 2024

  37. [37]

    Enhanced large language models for ef- fective screening of depression and anxiety,

    J. M. Liuet al., “Enhanced large language models for ef- fective screening of depression and anxiety,”arXiv preprint arXiv:2501.08769, 2025

  38. [38]

    Evaluation of ChatGPT for NLP-based mental health applications,

    B. Lamichhane, “Evaluation of ChatGPT for NLP-based mental health applications,” inarXiv preprint arXiv:2303.15727, 2023

  39. [39]

    Towards interpretable mental health analysis with large language models,

    K. Yanget al., “Towards interpretable mental health analysis with large language models,” inProc. EMNLP, 2024

  40. [40]

    Zero-shot speech-based depression and anxiety assessment with LLMs,

    E. Loweimi, S. de la Fuente Garcia, and S. Luz, “Zero-shot speech-based depression and anxiety assessment with LLMs,” in Proc. Interspeech, 2025, pp. 489–493

  41. [41]

    Zero-shot speech-based mental health and affective state assessment using LLMs,

    ——, “Zero-shot speech-based mental health and affective state assessment using LLMs,”IEEE J. Selected Topics in Signal Pro- cessing, 2025, under review

  42. [42]

    Large language models for mental health diagnostic assessments: exploring the potential of LLMs for assisting with mental health diagnostic assessments,

    K. Roy, H. Surana, D. Mullen, K. Haut, J. Flint, and J. Baxter, “Large language models for mental health diagnostic assessments: exploring the potential of LLMs for assisting with mental health diagnostic assessments,”arXiv preprint arXiv:2501.01305, 2025

  43. [43]

    Methodological gaps in predicting mental health states from social media,

    S. K. Ernalaet al., “Methodological gaps in predicting mental health states from social media,” inProc. CHI, 2019

  44. [44]

    Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting,

    M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting,” inProc. NeurIPS, 2024

  45. [45]

    Survey of hallucination in natural language genera- tion,

    Z. Jiet al., “Survey of hallucination in natural language genera- tion,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

  46. [46]

    Explainable artificial intelligence for mental health through transparency and interpretability for understandability,

    D. W. Joyce, G. Aref-Adib, A. Meyer, B. Shivaprasad, and S. Abrahams, “Explainable artificial intelligence for mental health through transparency and interpretability for understandability,” npj Digital Medicine, vol. 6, no. 6, 2023

  47. [47]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProc. ICML, 2023

  48. [48]

    COVID-19: affect recognition through voice analysis during the winter lockdown in Scotland,

    S. de la Fuente Garcia, F. Haider, and S. Luz, “COVID-19: affect recognition through voice analysis during the winter lockdown in Scotland,” inProc. IEEE EMBC, 2021, pp. 2326–2329

  49. [49]

    PsyV oiD—investigating the relationship between spontaneous speech features and psychology in the context of the COVID-19 pandemic and lockdown,

    S. de la Fuente Garcia and S. Luz, “PsyV oiD—investigating the relationship between spontaneous speech features and psychology in the context of the COVID-19 pandemic and lockdown,” 2023, dataset, University of Edinburgh

  50. [50]

    The validity of the Hospital Anxiety and Depression Scale: an updated literature review,

    I. Bjelland, A. A. Dahl, T. T. Haug, and D. Neckelmann, “The validity of the Hospital Anxiety and Depression Scale: an updated literature review,”J. Psychosomatic Research, vol. 52, no. 2, pp. 69–77, 2002

  51. [51]

    International experiences with the Hospital Anxi- ety and Depression Scale—a review of validation data and clini- cal results,

    C. Herrmann, “International experiences with the Hospital Anxi- ety and Depression Scale—a review of validation data and clini- cal results,”J. Psychosomatic Research, vol. 42, no. 1, pp. 17–41, 1997

  52. [52]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in Neural Information Process- ing Systems, vol. 35, pp. 24 824–24 837, 2022

  53. [53]

    Intraclass correlations: uses in as- sessing rater reliability,

    P. E. Shrout and J. L. Fleiss, “Intraclass correlations: uses in as- sessing rater reliability,”Psychological Bulletin, vol. 86, no. 2, pp. 420–428, 1979

  54. [54]

    A guideline of selecting and reporting intraclass correlation coefficients for reliability research,

    T. K. Koo and M. Y . Li, “A guideline of selecting and reporting intraclass correlation coefficients for reliability research,”J. Chi- ropractic Medicine, vol. 15, no. 2, pp. 155–163, 2016

  55. [55]

    Phi-4 Technical Report

    M. Abdinet al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024

  56. [56]

    Gemma 2: Improving Open Language Models at a Practical Size

    Google Teamet al., “Gemma 2: improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024

  57. [57]

    The Llama 3 Herd of Models

    A. Grattafioriet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024