When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

Jianfeng Zhu; Karin G. Coifman; Megan Korhummel; Ruoming Jin

arxiv: 2605.23148 · v2 · pith:5N2QGTRXnew · submitted 2026-05-22 · 💻 cs.CL · cs.CY

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

Jianfeng Zhu , Megan Korhummel , Ruoming Jin , Karin G. Coifman This is my paper

Pith reviewed 2026-05-25 05:05 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords large language modelspsychiatric screeningevidence weightingfalse negativesanxiety disordermajor depressive disorderpost-traumatic stress disorderclinical validation

0 comments

The pith

Large language models discount explicit psychiatric symptoms when patients show preserved functioning or protective context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates five LLMs on a benchmark of 555 SCID-anchored interviews for anxiety, depression, PTSD, and any mental health disorder using zero-shot prompting. Accuracy ranges from 0.49 to 0.86 with modest correlation to reference labels, and models produce false negatives for anxiety and PTSD even when symptom evidence is present. These errors occur more often when interviews also contain descriptions of intact daily functioning, coping ability, or social support. Adding functional-impairment details shifts outputs toward positive classifications while protective-context details shift them away. The authors conclude that LLMs could enable scalable screening but that this evidence-weighting pattern requires validation prior to clinical use.

Core claim

Using zero-shot task-specific prompting on five state-of-the-art LLMs and a dataset of 555 semi-structured experiential interviews with diagnostic reference labels, false-negative anxiety and PTSD classifications frequently contain explicit symptom evidence accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifts model outputs toward positive classifications, whereas protective-context evidence shifts outputs away from positive classifications.

What carries the argument

The evidence-integration analysis that examines how model classification outputs change when symptom evidence is accompanied by functional-impairment cues versus protective-context cues in the interview transcripts.

If this is right

GPT-4.1 Mini and GPT-5 Mini show the most consistent disorder-specific accuracy across tasks.
Depression classification accuracy is higher for male than female participants.
No consistent age-related accuracy pattern appears across models.
Modest non-uniform accuracy variation occurs across race strata.
LLMs may support scalable psychiatric screening only after the observed evidence-weighting pattern receives clinical validation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weighting pattern could appear in other clinical tasks where overall life adjustment is mentioned alongside specific symptoms.
Prompt engineering that explicitly instructs models to ignore functioning and context might reduce the observed false-negative rate.
If the pattern stems from training data that links diagnoses to impairment, fine-tuning on symptom-only examples could alter model behavior.

Load-bearing premise

Shifts in model outputs can be reliably attributed to differential weighting of symptom, functional-impairment, and protective-context evidence rather than to prompt phrasing, model architecture, or other unmeasured factors.

What would settle it

A controlled test that takes the same interview transcripts, systematically inserts or removes sentences describing functional impairment or protective context, and checks whether the models produce consistent directional changes in positive versus negative classifications.

Figures

Figures reproduced from arXiv: 2605.23148 by Jianfeng Zhu, Karin G. Coifman, Megan Korhummel, Ruoming Jin.

**Figure 3.** Figure 3: Age-Stratified Model Accuracy Across Mental Health Screening Tasks Within age strata, Cochran’s Q tests indicated significant model-level accuracy differences for most screening tasks. Significant differences were observed for anxiety and depression in all three age groups, for PTSD in the 18–44 and 45–64 groups, and for any current mental health disorder in the 18–44 and 45–64 groups. Differences were not… view at source ↗

**Figure 5.** Figure 5: shows that false-negative errors were not uniformly characterized by absence of symptom evidence. For anxiety and PTSD, false-negative cases contained higher symptomevidence counts than true-positive cases, with the largest difference observed for PTSD. False-negative anxiety cases also contained higher protective-context evidence, suggesting that symptom language was often accompanied by preserved functi… view at source ↗

**Figure 6.** Figure 6: Evidence-Domain Coefficients Predicting Model Outputs and SCID-Derived Labels 3.2.3 Participant-Level Evidence Contributions To visualize participant-level evidence contributions, we generated SHAP-style beeswarm plots from standardized logit contributions derived from the fitted logistic regression models. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability across diagnoses, demographic subgroups, and evidence-use patterns remains uncertain. We introduce a SCID-anchored benchmark of 555 semi-structured experiential interviews paired with diagnostic reference labels for anxiety disorder, major depressive disorder, post-traumatic stress disorder, and any current mental health disorder. Using zero-shot task-specific prompting, we evaluated five state-of-the-art LLMs and examined whether false-negative errors reflected missed psychiatric evidence or differential weighting of symptom, functional-impairment, and protective-context cues. Performance varied across tasks and models, with accuracy ranging from 0.49 to 0.86 and Matthews correlation coefficients from 0.16 to 0.38. GPT-4.1 Mini and GPT-5 Mini showed the most consistent disorder-specific accuracy. Subgroup analyses found higher depression-classification accuracy among male than female participants, no consistent age-related pattern, and modest non-uniform variation across race strata. Evidence-integration analyses showed that false-negative anxiety and PTSD classifications often contained explicit symptom evidence but were accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifted model outputs toward positive classifications, whereas protective-context evidence shifted outputs away. These findings suggest that LLMs may support scalable psychiatric screening, but their tendency to discount symptom evidence in the presence of preserved functioning or protective context requires careful validation before clinical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New SCID benchmark of 555 interviews plus error analysis on evidence types is the useful part; the weighting attribution is suggestive but not isolated from other factors.

read the letter

The main takeaway is a new benchmark of 555 SCID-anchored interviews for anxiety, depression, PTSD, and any mental health disorder, paired with zero-shot evaluations of five LLMs and a breakdown of false negatives by symptom, impairment, and protective-context evidence. Accuracies range 0.49-0.86 with some model consistency on specific disorders, plus subgroup notes like higher depression accuracy for males. The error patterns show false negatives often include explicit symptoms but also preserved functioning or support, and the outputs shift with functional-impairment cues toward positives and protective cues away from them. That dataset and the targeted error look are the concrete additions beyond standard accuracy tables in earlier LLM mental health papers. The numbers and subgroup splits are reported plainly enough to be usable. The softer spot is the central interpretation. The abstract ties the false negatives to differential weighting, but without visible details on blinded evidence coding, matched prompt variants, or controls for text length, lexical overlap, or other correlations, the shifts could stem from other interview features or prompt effects. The patterns exist in the outputs, yet the causal link to weighting stays preliminary. This is aimed at researchers building or testing AI screening tools in psychiatry. Someone in that niche would find the benchmark and the cautionary error examples worth checking. It has enough new data and a clear applied question to merit a serious referee who can examine the methods section and ask for tighter controls on the evidence analysis.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a benchmark of 555 SCID-anchored semi-structured experiential interviews with diagnostic labels for anxiety disorder, MDD, PTSD, and any current mental health disorder. Using zero-shot task-specific prompting, it evaluates five LLMs, reporting accuracies of 0.49–0.86 and MCC values of 0.16–0.38 (with GPT-4.1 Mini and GPT-5 Mini most consistent), documents subgroup differences (e.g., higher depression accuracy in males), and analyzes false-negative cases to claim that models discount symptom evidence in the presence of preserved functioning or protective context while functional-impairment cues shift outputs positive and protective cues shift them negative.

Significance. If the central attribution holds, the work supplies a large anchored benchmark and concrete error-pattern findings that illuminate how LLMs integrate (or fail to integrate) symptom versus contextual evidence in psychiatric screening. This has direct relevance for scalable mental-health tools and underscores the need for validation focused on evidence-use patterns rather than aggregate accuracy alone. The SCID grounding and subgroup reporting are clear strengths.

major comments (3)

[Evidence-integration analyses] Evidence-integration analyses (abstract and corresponding results section): the claim that false-negative anxiety/PTSD cases 'often contained explicit symptom evidence' accompanied by preserved functioning or protective context, with functional-impairment shifting outputs positive and protective context shifting them negative, rests on an unelaborated coding procedure. No information is supplied on whether evidence types were identified via blinded independent coders, a pre-specified protocol, or inter-rater reliability metrics; this is load-bearing for the differential-weighting interpretation.
[Methods] Methods (zero-shot prompting and evidence analyses): the manuscript provides no matched prompt variants that hold all other text fixed while varying only one cue type, nor statistical controls for interview length, lexical overlap, or demographic confounders. Without these, observed output shifts cannot be isolated from prompt phrasing or text correlations, undermining the causal link to evidence-weighting.
[Results] Results (subgroup and evidence-integration sections): accuracy differences by sex (and modest race variation) are reported separately from the evidence-weighting patterns; the paper does not cross these analyses, so it remains unclear whether the discounting of symptoms in protective contexts varies systematically across demographic strata.

minor comments (2)

[Abstract] Abstract: model names 'GPT-4.1 Mini' and 'GPT-5 Mini' are non-standard; clarify the exact versions evaluated.
[Abstract] Abstract: the reported accuracy and MCC ranges are aggregate; per-task and per-model breakdowns would improve interpretability of the 'most consistent' claim for GPT-4.1 Mini and GPT-5 Mini.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in the manuscript. We provide point-by-point responses below.

read point-by-point responses

Referee: [Evidence-integration analyses] Evidence-integration analyses (abstract and corresponding results section): the claim that false-negative anxiety/PTSD cases 'often contained explicit symptom evidence' accompanied by preserved functioning or protective context, with functional-impairment shifting outputs positive and protective context shifting them negative, rests on an unelaborated coding procedure. No information is supplied on whether evidence types were identified via blinded independent coders, a pre-specified protocol, or inter-rater reliability metrics; this is load-bearing for the differential-weighting interpretation.

Authors: We agree that the coding procedure for evidence types requires elaboration. The evidence-integration analyses involved systematic manual review of the false-negative interview transcripts by the study authors, using a pre-specified set of criteria for identifying symptom evidence (explicit reports of diagnostic criteria), functional impairment (descriptions of work, social, or daily functioning deficits), and protective context (mentions of coping strategies, social support, or resilience factors). No blinded independent coders were used, and formal inter-rater reliability was not calculated; instead, ambiguous cases were discussed among the team to reach consensus. We will revise the Methods section to fully describe this procedure and acknowledge its limitations. revision: yes
Referee: [Methods] Methods (zero-shot prompting and evidence analyses): the manuscript provides no matched prompt variants that hold all other text fixed while varying only one cue type, nor statistical controls for interview length, lexical overlap, or demographic confounders. Without these, observed output shifts cannot be isolated from prompt phrasing or text correlations, undermining the causal link to evidence-weighting.

Authors: The evidence analyses are observational, based on the natural variation in the interview content rather than experimentally manipulated prompts. We did not generate matched prompt variants or apply statistical controls for the mentioned factors. We will add a dedicated limitations paragraph in the Discussion section explaining that these analyses demonstrate associations in real-world interview data but do not establish causality, and that controlled experiments with matched cues would be a valuable extension. revision: partial
Referee: [Results] Results (subgroup and evidence-integration sections): accuracy differences by sex (and modest race variation) are reported separately from the evidence-weighting patterns; the paper does not cross these analyses, so it remains unclear whether the discounting of symptoms in protective contexts varies systematically across demographic strata.

Authors: We will conduct additional analyses to examine whether the evidence-weighting patterns (e.g., symptom discounting in protective contexts) differ by demographic subgroups such as sex and race. This will be added to the Results section, with appropriate caveats regarding sample sizes in some strata. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation of LLM outputs against fixed labels

full rationale

The paper conducts an empirical benchmark study: 555 SCID-anchored interviews with diagnostic labels are used to evaluate zero-shot LLM prompting performance (accuracy, MCC), subgroup differences, and post-hoc inspection of false-negative cases for presence of symptom vs. functional/protective evidence. No equations, fitted parameters, derivations, or predictions are defined. No self-citations are invoked to justify uniqueness or load-bearing premises. The central analyses are direct comparisons to external reference labels and manual evidence coding; nothing reduces to its own inputs by construction. This is the most common honest non-finding for purely evaluative empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that SCID labels constitute reliable ground truth and that observed output changes can be interpreted as evidence weighting; no free parameters or invented entities are introduced.

axioms (1)

domain assumption SCID provides reliable diagnostic reference labels for the benchmark
The study anchors all evaluations to SCID diagnoses without independent verification of label accuracy.

pith-pipeline@v0.9.0 · 5818 in / 1261 out tokens · 27716 ms · 2026-05-25T05:05:01.191413+00:00 · methodology

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)