arxiv: 2605.09634 · v1 · submitted 2026-05-10 · 💻 cs.CL

Recognition: no theorem link

Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

Erfan Loweimi , Sofia de la Fuente Garcia , Samira Loveymi , Hadi Daneshvar , Saturnino Luz

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLMsHADSmental health screeningconsistencyASR robustnessevidence faithfulnesszero-shotreliability

0 comments

The pith

LLMs differ sharply in reliable HADS scoring from noisy speech transcripts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can reliably assign Hospital Anxiety and Depression Scale scores from spoken participant responses in a zero-shot manner. It checks three reliability dimensions: consistency of scores across repeated runs of the same model, how little those scores change when transcripts contain automatic speech recognition errors, and how closely the scores tie to actual keywords in the input that relate to anxiety or depression items. Three models are run on data from 111 English speakers using both clean transcripts and outputs from three Whisper ASR systems. Phi-4 and Gemma-2-9B keep strong consistency above 0.89 ICC with little ASR impact and over 93 percent keyword grounding, while Llama-3.1-8B drops to 0.36 ICC at 10 percent word error rate and 77-81 percent grounding. The work shows that numeric score agreement can hide differences in how much the output actually rests on the evidence.

Core claim

Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR and keyword groundedness above 93 percent; Llama-3.1-8B shows ASR-fragile consistency with ICC falling from 0.82 to 0.36 at 10 percent WER and groundedness of 77-81 percent, revealing score-evidence dissociation even as predictive validity holds for the stronger models.

What carries the argument

Intra-model consistency via intraclass correlation coefficient across three runs, robustness tested with Whisper Large/Medium/Small at varying word error rates, and keyword groundedness as proxy for evidence faithfulness in zero-shot HADS estimation from speech.

If this is right

Predictive validity of the HADS estimates remains largely preserved under ASR errors for the consistent models.
Inter-model agreement at the score level greatly exceeds agreement on the specific keywords used, indicating divergent reasoning paths.
Models with fragile consistency and lower groundedness may produce numeric outputs unsuitable for clinical settings where evidence matters.
High-groundedness models could enable more interpretable automated voice-based screening without major loss from real-world transcription noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If keyword counts prove too crude a faithfulness check, full human rating of model explanations would be required to confirm the dissociation findings.
The pattern of model differences may extend to other clinical questionnaires or languages, calling for parallel ASR-robustness tests on those tasks.
Prompts that force explicit evidence citation could narrow the score-evidence gap in weaker models like Llama-3.1.
Deployment decisions for speech-based mental health tools should prioritize models that keep both consistency and grounding rather than relying on numeric agreement alone.

Load-bearing premise

That matching HADS-related keywords in the transcript is a sufficient and unbiased way to judge whether the model's assigned score is faithfully supported by the evidence.

What would settle it

If independent clinicians review the full model outputs and transcripts and find that high keyword groundedness does not match actual faithful reasoning or that low-groundedness scores are still clinically accurate.

Figures

Figures reproduced from arXiv: 2605.09634 by Erfan Loweimi, Hadi Daneshvar, Samira Loveymi, Saturnino Luz, Sofia de la Fuente Garcia.

**Figure 1.** Figure 1: Zero-shot prompt used for HADS score estimation, integrating role specification, step-by-step decomposition, score prediction, and keyword justification. Adapted from [26]. 3.1. Intra-model consistency To assess whether repeated LLM inference yields stable predictions, we use the Friedman test to detect systematic interrun differences, and ICC(3,1) (two-way mixed, single measures, consistency [39]) to qu… view at source ↗

**Figure 2.** Figure 2: Top: Predictive validity (Spearman ρs with HADS ground truth) vs WER. Bottom: Intra-model consistency (ICC) vs WER. Note the sharp ICC decline for Llama-3.1-8B at higher WER, contrasting the stability of Phi-4 and Gemma-2-9B [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Top 15 keywords per model for HADS-Anxiety (GT transcripts). Note the dominance of “erm” across models and divergent keyword vocabularies underlying similar predictions. 6. Conclusions and Future Work We presented the first joint analysis of intra-model consistency, ASR robustness, and keyword evidence faithfulness for LLMbased mental health screening. Phi-4 and Gemma-2-9B demonstrate excellent consisten… view at source ↗

read the original abstract

LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Phi-4 and Gemma-2-9B hold consistent HADS scores under ASR noise while Llama-3.1-8B collapses, but the claimed score-evidence dissociation rests on a keyword-count proxy that does not prove grounding.

read the letter

The paper runs three LLMs on HADS scoring from ground-truth transcripts and three Whisper ASR outputs for 111 participants, with three repeats per condition. It reports ICC for consistency, degradation under increasing WER, preserved predictive validity for the stronger models, and a keyword-overlap measure for how often outputs mention HADS-related terms. The headline result is the split: Phi-4 and Gemma-2-9B stay above 0.89 ICC with minimal ASR impact, Llama-3.1-8B drops from 0.82 to 0.36 at 10% WER, and keyword rates are 93%+ for the first two but 77-81% for Llama, with lower inter-model keyword agreement than score agreement. That dissociation is the concrete new observation. The repeated runs and multiple ASR conditions give the numbers some weight, and the participant count is reasonable for this kind of comparison. The work is a direct empirical check rather than a new method or theory. The soft spot is the faithfulness metric. Keyword presence in the output does not show that the numeric score was actually conditioned on transcript content; zero-shot models can emit the terms independently. No human faithfulness ratings, no ablation that removes the keywords, and no check that high overlap tracks real evidence use. The dissociation could therefore be an artifact of the proxy. The HADS labels themselves are taken as given without extra validation steps shown. This is useful for groups already running LLM evaluations on clinical speech tasks or testing robustness to transcription noise. Readers who need numbers on which current models survive realistic ASR will get something from it. It is coherent on its own terms and deserves a serious referee, though the review should press on whether the keyword proxy supports the clinical-interpretability claim.

Referee Report

2 major / 2 minor

Summary. The paper evaluates three LLMs (Phi-4, Gemma-2-9B, Llama-3.1-8B) for zero-shot HADS score estimation from speech on 111 participants. It uses ground-truth transcripts and three Whisper ASR outputs, with three runs per model-condition pair, to assess intra-model consistency (ICC), ASR robustness, predictive validity, and evidence faithfulness via keyword groundedness. Findings show ICC > 0.89 with minimal ASR degradation for Phi-4 and Gemma-2-9B, ICC drop from 0.82 to 0.36 at 10% WER for Llama-3.1-8B, preserved predictive validity for robust models, and groundedness >93% vs. 77-81%, indicating score-evidence dissociation.

Significance. If the results hold, the work supplies useful empirical benchmarks on LLM reliability for mental health screening. The design with repeated runs, multiple ASR conditions, and concrete metrics on 111 participants enables assessment of consistency and robustness. Model-specific differences and the dissociation observation carry implications for clinical interpretability and safe deployment.

major comments (2)

[Results (keyword groundedness)] Results section on keyword groundedness: The evidence faithfulness claim and score-evidence dissociation rest on a keyword-matching proxy that counts pre-defined HADS-related terms in outputs (>93% for Phi-4/Gemma-2-9B, 77-81% for Llama-3.1-8B). This proxy does not establish that numeric scores are conditioned on transcript content rather than priors or prompt artifacts, especially in zero-shot prompting. The manuscript includes no human faithfulness ratings, keyword ablations, or checks correlating groundedness with transcript-grounded reasoning, so the dissociation may be an artifact of the metric.
[Methods] Methods section (experimental setup): The paper does not detail the exact zero-shot prompts, post-processing or exclusion rules for outputs, or validation of HADS ground-truth labels (e.g., self-report reliability). These omissions affect reproducibility of the ICC drops and predictive validity results, and could alter interpretation of ASR fragility for Llama-3.1-8B.

minor comments (2)

[Abstract] Abstract: The statement that 'predictive validity is largely preserved' should include the specific metric and values (e.g., correlation or MAE) for precision.
[Results] Tables/figures: Label all ASR conditions (Large/Medium/Small) and runs clearly, and include confidence intervals for ICC values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify key aspects of our work on LLM reliability for HADS scoring. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: Results section on keyword groundedness: The evidence faithfulness claim and score-evidence dissociation rest on a keyword-matching proxy that counts pre-defined HADS-related terms in outputs (>93% for Phi-4/Gemma-2-9B, 77-81% for Llama-3.1-8B). This proxy does not establish that numeric scores are conditioned on transcript content rather than priors or prompt artifacts, especially in zero-shot prompting. The manuscript includes no human faithfulness ratings, keyword ablations, or checks correlating groundedness with transcript-grounded reasoning, so the dissociation may be an artifact of the metric.

Authors: We agree that keyword groundedness serves as an objective but limited proxy and does not causally establish that scores are conditioned on transcript content versus model priors in zero-shot settings. The dissociation observation is supported by the large gap between high inter-model score agreement and low keyword agreement, which we interpret as evidence that models may converge on scores via divergent internal processes. To strengthen this, the revised manuscript will explicitly describe the metric as a preliminary proxy, add a dedicated limitations paragraph on its shortcomings, and include a new analysis correlating per-sample groundedness with intra-model consistency (ICC). We will also add a keyword ablation experiment (masking HADS terms in transcripts) to test sensitivity. Human ratings are not feasible to add at this stage without new data collection, but the proxy differences remain informative for the reported model-specific patterns. revision: partial
Referee: Methods section (experimental setup): The paper does not detail the exact zero-shot prompts, post-processing or exclusion rules for outputs, or validation of HADS ground-truth labels (e.g., self-report reliability). These omissions affect reproducibility of the ICC drops and predictive validity results, and could alter interpretation of ASR fragility for Llama-3.1-8B.

Authors: We thank the referee for highlighting these gaps. The revised Methods section will include the full zero-shot prompts, a complete description of post-processing steps and any output exclusion rules (e.g., invalid score filtering), and additional details on HADS ground-truth validation, including citations to established psychometric reliability studies for the self-report instrument. These changes will directly support reproducibility of the ICC and predictive validity findings and clarify the interpretation of Llama-3.1-8B's ASR fragility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on external data

full rationale

The paper reports direct empirical results from running three LLMs on 111 participants' ground-truth transcripts and ASR outputs, computing ICC for consistency, degradation under WER, and a keyword-count proxy for groundedness. No equations, fitted parameters, predictions, or derivations appear; no self-citations are invoked to justify uniqueness, ansatzes, or load-bearing premises. All quantities are computed from the external transcripts/ASR outputs and model generations without reducing to quantities defined inside the study itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation study with no mathematical derivations, free parameters, or postulated entities; relies on standard statistical tools (ICC) and off-the-shelf ASR systems treated as black-box inputs.

pith-pipeline@v0.9.0 · 5549 in / 1231 out tokens · 58259 ms · 2026-05-12T03:49:58.632351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

[1]

Introduction Mental health disorders such as anxiety and depression impact millions worldwide each year [1]. Although early detection is critical for effective intervention [2], traditional screening relies on clinical interviews [3] or the Hospital Anxiety and Depres- sion Scale (HADS) [4], approaches that are resource-intensive, subjective, and difficul...

work page
[2]

Are LLM predictions stable across repeated runs, and does ASR affect this stability?

work page
[3]

Does ASR degrade predictive validity (correlation with HADS ground truth)?

work page
[4]

Are LLM-cited keywords grounded in the transcript, and is keyword evidence stable across runs and models?

work page
[5]

Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

Are there model-specific vulnerability patterns that inform deployment decisions? Having introduced the dataset, HADS instrument, and arXiv:2605.09634v1 [cs.CL] 10 May 2026 prompt design in Section 2, we present the statistical analysis framework underpinning our evaluation in Section 3. Section 4 describes the experimental setup. Section 5 presents resul...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Data, HADS, and Prompt Design 2.1. The PsyVoiD corpus We use the PsyV oiD corpus [35], collected in Scotland (UK) during the COVID-19 lockdown to investigate the relationship between spontaneous English speech and psychological traits. The dataset comprises 111 participants (70 female, 41 male), aged 21–86 (median 62), 34 of whom (31%) report a prior clin...

work page
[7]

All statistical tests are non- parametric, reflecting the ordinal nature of HADS scores

Statistical Analysis Framework In this section we describe the measures used to evaluate the three reliability dimensions. All statistical tests are non- parametric, reflecting the ordinal nature of HADS scores. You are a clinical psychologist and linguist, analyzing a spontaneous speech transcript recorded during the Covid-19 lockdown. Your task is to es...

work page
[8]

- Detect depression-related cues (e.g., lack of motivation, hopelessness)

Psychological and Emotional Features - Detect anxiety-related cues (e.g., excessive worry, nervousness). - Detect depression-related cues (e.g., lack of motivation, hopelessness)

work page
[9]

- Detect negative self-statements (e.g., I do not feel good, I cannot cope)

Linguistic and Behavioural Features - Identify hesitation markers (e.g., erm, uh, pauses) and their frequency. - Detect negative self-statements (e.g., I do not feel good, I cannot cope). - Assess certainty level (confident, unsure, detached)

work page
[10]

Psychological Score Predictions - Predict HADS Anxiety score (0-21 scale, higher indicates greater anxiety) - Predict HADS Depression score (0-21 scale, higher indicates greater depression)

work page
[11]

Justification Using Keywords - Provide keywords from the transcript that influenced each prediction. Transcript: {transcript} Figure 1:Zero-shot prompt used for HADS score estimation, integrating role specification, step-by-step decomposition, score prediction, and keyword justification. Adapted from [26]. 3.1. Intra-model consistency To assess whether re...

work page
[12]

Experimental Setup 4.1. LLM configurations We evaluate three open-weight instruction-tuned LLMs, se- lected to span distinct model families, training pipelines, and parameter scales: Phi-4 (14.7B, Microsoft [41]), Gemma-2-9B (9B, Google [42]), and Llama-3.1-8B-Instruct (8B, Meta [43]). Following [26, 27], each model receives the zero-shot prompt (Section ...

work page
[13]

erm,” false starts) that Phi-4 over-interprets as anxiety features, as confirmed by keyword frequency analysis where “erm

Experimental Results and Discussion 5.1. Consistency across runs Table 1 reports ICC(3,1) and Friedmanp-values across all model–condition–subscale combinations. Phi-4 and Gemma-2- 9B show excellent consistency with minimal degradation across ASR conditions. Phi-4’s ICC ranges from 0.890 to 0.925 across all conditions and subscales (∆max = 0.035). Gemma-2-...

work page arXiv
[14]

Conclusions and Future Work We presented the first joint analysis of intra-model consistency, ASR robustness, and keyword evidence faithfulness for LLM- based mental health screening. Phi-4 and Gemma-2-9B demon- strate excellent consistency (ICC>0.89) and stable predictive validity (ρ s = 0.38–0.56) across ASR conditions, whereas Llama-3.1-8B exhibits sev...

work page
[15]

Prevalence of stress, anxiety, depression among the general pop- ulation during the COVID-19 pandemic: a systematic review and meta-analysis,

N. Salari, A. Hosseinian-Far, R. Jalali, A. Vaisi-Raygani, S. Ra- soulpoor, M. Mohammadi, S. Rasoulpoor, and B. Khaledi-Paveh, “Prevalence of stress, anxiety, depression among the general pop- ulation during the COVID-19 pandemic: a systematic review and meta-analysis,”Globalization and Health, vol. 16, p. 57, 2020

work page 2020
[16]

Early intervention—an implementation challenge for 21st century men- tal health care,

P. D. McGorry, A. Ratheesh, and B. O’Donoghue, “Early intervention—an implementation challenge for 21st century men- tal health care,”JAMA Psychiatry, vol. 75, no. 6, pp. 545–546, 2018

work page 2018
[17]

Anxiety and depression in a primary care clinic,

M. V on Korff, S. Shapiro, J. D. Burke, M. Teitlebaum, E. A. Skin- ner, P. German, R. W. Turner, L. Klein, and B. Burns, “Anxiety and depression in a primary care clinic,”Archives of General Psy- chiatry, vol. 44, no. 2, pp. 152–156, 1987

work page 1987
[18]

The Hospital Anxiety and De- pression Scale,

A. S. Zigmond and R. P. Snaith, “The Hospital Anxiety and De- pression Scale,”Acta Psychiatrica Scandinavica, vol. 67, no. 6, pp. 361–370, 1983

work page 1983
[19]

The heterogene- ity of mental health assessment,

J. J. Newson, D. Hunter, and T. C. Thiagarajan, “The heterogene- ity of mental health assessment,”Frontiers in Psychiatry, vol. 11, p. 76, 2020

work page 2020
[20]

Language use of de- pressed and depression-vulnerable college students,

S. Rude, E.-M. Gortner, and J. Pennebaker, “Language use of de- pressed and depression-vulnerable college students,”Cognition & Emotion, vol. 18, no. 8, pp. 1121–1133, 2004

work page 2004
[21]

The psychological mean- ing of words: LIWC and computerized text analysis methods,

Y . R. Tausczik and J. W. Pennebaker, “The psychological mean- ing of words: LIWC and computerized text analysis methods,”J. Language and Social Psychology, vol. 29, no. 1, pp. 24–54, 2010

work page 2010
[22]

Depression and anxiety have distinct and over- lapping language patterns: results from a clinical interview,

E. C. Stadeet al., “Depression and anxiety have distinct and over- lapping language patterns: results from a clinical interview,”J. Psychopathology and Clinical Science, 2023

work page 2023
[23]

Detecting depression and mental illness on social media: an integrative review,

S. C. Guntuku, D. B. Yaden, M. L. Kern, L. H. Ungar, and J. C. Eichstaedt, “Detecting depression and mental illness on social media: an integrative review,”Current Opinion in Behavioral Sci- ences, vol. 18, pp. 43–49, 2017

work page 2017
[24]

Facebook language predicts depression in medical records,

J. C. Eichstaedtet al., “Facebook language predicts depression in medical records,”Proc. Natl Acad. Sci. USA, vol. 115, no. 44, pp. 11 203–11 208, 2018

work page 2018
[25]

Deep learning techniques for suicide and depression detection from online social media: a scoping re- view,

A. Malhotra and R. Jindal, “Deep learning techniques for suicide and depression detection from online social media: a scoping re- view,”Applied Soft Computing, vol. 130, p. 109713, 2022

work page 2022
[26]

Natural language processing in mental health applications using non-clinical texts,

R. A. Calvo, D. N. Milne, M. S. Hussain, and H. Christensen, “Natural language processing in mental health applications using non-clinical texts,”Natural Language Engineering, vol. 23, no. 5, pp. 649–685, 2017

work page 2017
[27]

A review of depression and suicide risk assessment using speech analysis,

N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T. F. Quatieri, “A review of depression and suicide risk assessment using speech analysis,”Speech Communication, vol. 71, pp. 10– 49, 2015

work page 2015
[28]

Automatic speech emotion recognition using modulation spectral features,

S. Wu, T. H. Falk, and W.-Y . Chan, “Automatic speech emotion recognition using modulation spectral features,”Speech Commu- nication, vol. 53, no. 5, pp. 768–785, 2011

work page 2011
[29]

Automated assess- ment of psychiatric disorders using speech: a systematic review,

D. M. Low, K. H. Bentley, and S. S. Ghosh, “Automated assess- ment of psychiatric disorders using speech: a systematic review,” Laryngoscope Investigative Otolaryngology, vol. 5, no. 1, pp. 96– 116, 2020

work page 2020
[30]

Detecting depression with audio/text sequence modeling of interviews,

T. Al Hanai, M. Ghassemi, and J. Glass, “Detecting depression with audio/text sequence modeling of interviews,” inProc. Inter- speech, 2018, pp. 1716–1720

work page 2018
[31]

Language models are few-shot learners,

T. Brownet al., “Language models are few-shot learners,” inAd- vances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901

work page 2020
[32]

Depression detection on social media with large language models,

X. Lanet al., “Depression detection on social media with large language models,”arXiv preprint arXiv:2403.10750, 2024

work page arXiv 2024
[33]

Classifying anxiety and depression through LLMs virtual interactions: a case study with ChatGPT,

Y . Taoet al., “Classifying anxiety and depression through LLMs virtual interactions: a case study with ChatGPT,” inProc. IEEE BIBM, 2023, pp. 2259–2264

work page 2023
[34]

Identifying psychiatric manifestations in outpatients with depression and anxiety: a large language model-based ap- proach,

S. Xuet al., “Identifying psychiatric manifestations in outpatients with depression and anxiety: a large language model-based ap- proach,”medRxiv, 2025

work page 2025
[35]

Large language models and text embeddings for detecting depression and suicide in patient narratives,

S. K. Lhoet al., “Large language models and text embeddings for detecting depression and suicide in patient narratives,”JAMA Network Open, vol. 8, no. 5, 2025

work page 2025
[36]

Large language models for mental health applica- tions: systematic review,

Z. Guoet al., “Large language models for mental health applica- tions: systematic review,”JMIR Mental Health, vol. 11, no. 1, p. e57400, 2024

work page 2024
[37]

Enhanced large language models for ef- fective screening of depression and anxiety,

J. M. Liuet al., “Enhanced large language models for ef- fective screening of depression and anxiety,”arXiv preprint arXiv:2501.08769, 2025

work page arXiv 2025
[38]

Evaluation of ChatGPT for NLP-based mental health applications,

B. Lamichhane, “Evaluation of ChatGPT for NLP-based mental health applications,” inarXiv preprint arXiv:2303.15727, 2023

work page arXiv 2023
[39]

Towards interpretable mental health analysis with large language models,

K. Yanget al., “Towards interpretable mental health analysis with large language models,” inProc. EMNLP, 2024

work page 2024
[40]

Zero-shot speech-based depression and anxiety assessment with LLMs,

E. Loweimi, S. de la Fuente Garcia, and S. Luz, “Zero-shot speech-based depression and anxiety assessment with LLMs,” in Proc. Interspeech, 2025, pp. 489–493

work page 2025
[41]

Zero-shot speech-based mental health and affective state assessment using LLMs,

——, “Zero-shot speech-based mental health and affective state assessment using LLMs,”IEEE J. Selected Topics in Signal Pro- cessing, 2025, under review

work page 2025
[42]

Large language models for mental health diagnostic assessments: exploring the potential of LLMs for assisting with mental health diagnostic assessments,

K. Roy, H. Surana, D. Mullen, K. Haut, J. Flint, and J. Baxter, “Large language models for mental health diagnostic assessments: exploring the potential of LLMs for assisting with mental health diagnostic assessments,”arXiv preprint arXiv:2501.01305, 2025

work page arXiv 2025
[43]

Methodological gaps in predicting mental health states from social media,

S. K. Ernalaet al., “Methodological gaps in predicting mental health states from social media,” inProc. CHI, 2019

work page 2019
[44]

Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting,

M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting,” inProc. NeurIPS, 2024

work page 2024
[45]

Survey of hallucination in natural language genera- tion,

Z. Jiet al., “Survey of hallucination in natural language genera- tion,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

work page 2023
[46]

Explainable artificial intelligence for mental health through transparency and interpretability for understandability,

D. W. Joyce, G. Aref-Adib, A. Meyer, B. Shivaprasad, and S. Abrahams, “Explainable artificial intelligence for mental health through transparency and interpretability for understandability,” npj Digital Medicine, vol. 6, no. 6, 2023

work page 2023
[47]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProc. ICML, 2023

work page 2023
[48]

COVID-19: affect recognition through voice analysis during the winter lockdown in Scotland,

S. de la Fuente Garcia, F. Haider, and S. Luz, “COVID-19: affect recognition through voice analysis during the winter lockdown in Scotland,” inProc. IEEE EMBC, 2021, pp. 2326–2329

work page 2021
[49]

PsyV oiD—investigating the relationship between spontaneous speech features and psychology in the context of the COVID-19 pandemic and lockdown,

S. de la Fuente Garcia and S. Luz, “PsyV oiD—investigating the relationship between spontaneous speech features and psychology in the context of the COVID-19 pandemic and lockdown,” 2023, dataset, University of Edinburgh

work page 2023
[50]

The validity of the Hospital Anxiety and Depression Scale: an updated literature review,

I. Bjelland, A. A. Dahl, T. T. Haug, and D. Neckelmann, “The validity of the Hospital Anxiety and Depression Scale: an updated literature review,”J. Psychosomatic Research, vol. 52, no. 2, pp. 69–77, 2002

work page 2002
[51]

International experiences with the Hospital Anxi- ety and Depression Scale—a review of validation data and clini- cal results,

C. Herrmann, “International experiences with the Hospital Anxi- ety and Depression Scale—a review of validation data and clini- cal results,”J. Psychosomatic Research, vol. 42, no. 1, pp. 17–41, 1997

work page 1997
[52]

Chain-of-thought prompting elicits reasoning in large language models,

J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in Neural Information Process- ing Systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[53]

Intraclass correlations: uses in as- sessing rater reliability,

P. E. Shrout and J. L. Fleiss, “Intraclass correlations: uses in as- sessing rater reliability,”Psychological Bulletin, vol. 86, no. 2, pp. 420–428, 1979

work page 1979
[54]

A guideline of selecting and reporting intraclass correlation coefficients for reliability research,

T. K. Koo and M. Y . Li, “A guideline of selecting and reporting intraclass correlation coefficients for reliability research,”J. Chi- ropractic Medicine, vol. 15, no. 2, pp. 155–163, 2016

work page 2016
[55]

Phi-4 Technical Report

M. Abdinet al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Gemma 2: Improving Open Language Models at a Practical Size

Google Teamet al., “Gemma 2: improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

The Llama 3 Herd of Models

A. Grattafioriet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024