LLAMADRS: Evaluating Open-Source LLMs on Real Clinical Interviews--To Reason or Not to Reason?

Einat Liebenthal; Fernando De la Torre; Gaoussou Youssouf Kebe; Jeffrey M. Girard; Justin Baker; Louis-Philippe Morency

arxiv: 2501.03624 · v2 · submitted 2025-01-07 · 💻 cs.HC · cs.CL

LLAMADRS: Evaluating Open-Source LLMs on Real Clinical Interviews--To Reason or Not to Reason?

Gaoussou Youssouf Kebe , Jeffrey M. Girard , Einat Liebenthal , Justin Baker , Fernando De la Torre , Louis-Philippe Morency This is my paper

Pith reviewed 2026-05-23 06:30 UTC · model grok-4.3

classification 💻 cs.HC cs.CL

keywords LLMsclinical assessmentpsychiatric interviewsreasoning modelsprompt designitem-level accuracyCAMI corpussemi-structured prediction

0 comments

The pith

Strong open-source LLMs reach expert-level accuracy on psychiatric interview scoring when symptoms are rated one at a time before summing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark called LlaMADRS from 541 real psychiatric interview sessions that carry 5,804 expert annotations. It tests 25 open-source models ranging from 0.6B to 400B parameters and shows that the strongest ones produce item-level scores whose remaining errors stay below levels that would matter in clinical practice. Breaking the task into separate model calls for each symptom and then adding the results reduces error compared with asking the model for a total score in one step. Models labeled as reasoning do not outperform standard models once the prompt includes clear structure and examples; instead, gains come from prompt design, model scale, and longer reasoning traces when they appear. The work supplies direct evidence on when decomposition helps and when scale matters for semi-structured clinical prediction.

Core claim

Strong open-source LLMs achieve item-level accuracy with residual error below clinically substantial thresholds on the LlaMADRS benchmark. An Item-then-Sum strategy that assesses symptoms individually through discrete LLM calls before synthesizing final scores significantly reduces error relative to Direct Total Score prediction across most model architectures and scales. Performance gains attributed to reasoning depend fundamentally on prompt design, with standard models equipped with structured task definitions and examples matching reasoning-augmented counterparts, while longer reasoning traces and higher model scale reduce error across both types.

What carries the argument

The Item-then-Sum (ItS) strategy, which decomposes the clinical scoring task into separate LLM calls for each symptom before summing, compared against Direct Total Score (DTS) prediction on the LlaMADRS benchmark built from the CAMI corpus.

If this is right

Strong open-source models can be used for structured clinical assessment from dialogue without exceeding clinically relevant error bounds.
Decomposing symptom assessment into separate calls outperforms asking for a total score directly across model sizes.
Prompt structure and examples explain most of the benefit previously credited to reasoning models on this task.
Model scale and longer reasoning traces each contribute measurable error reduction on semi-structured clinical scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the Item-then-Sum pattern holds for other dialogue-based judgments, clinical systems may be built more reliably by chaining many targeted calls than by one large query.
The same decomposition approach could be tested on non-clinical semi-structured prediction tasks such as legal or educational interviews.
Future work could measure how much the reported thresholds change when the same models are evaluated on interviews from different clinical sites or diagnostic traditions.

Load-bearing premise

The expert annotations in the CAMI corpus serve as reliable and unbiased ground truth for the reported accuracy thresholds.

What would settle it

Independent re-annotation of a held-out subset of the CAMI sessions by new experts, followed by re-running the models and checking whether item-level errors still remain below the stated clinical thresholds under the new labels.

read the original abstract

Large language models (LLMs) excel on many NLP benchmarks, but their behavior on real-world, semi-structured prediction remains underexplored. We present LlaMADRS, a benchmark for structured clinical assessment from dialogue built on the CAMI corpus of psychiatric interviews, comprising 5,804 expert annotations across 541 sessions. We evaluate 25 open-source models (standard and reasoning-augmented; 0.6B--400B parameters) and generate over 400,000 predictions. Our results demonstrate that strong open-source LLMs achieve item-level accuracy with residual error below clinically substantial thresholds. Additionally, an Item-then-Sum (ItS) strategy, assessing symptoms individually through discrete LLM calls before synthesizing final scores, significantly reduces error relative to Direct Total Score (DTS) prediction across most model architectures and scales, despite reasoning models attempting similar decomposition in the reasoning traces of their DTS predictions. In fact, we find that performance gains attributed to "reasoning" depend fundamentally on prompt design: standard models equipped with structured task definitions and examples match reasoning-augmented counterparts. Among the latter, longer reasoning traces correlate with reduced error; while higher model scale does across both architectures. Our results clarify when and why reasoning helps and offer actionable guidance for deploying LLMs in semi-structured clinical assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core finding is that item-by-item prompting on real clinical interviews beats direct total-score prediction for open-source LLMs, and that prompt structure explains most of the gains usually credited to reasoning models.

read the letter

The punchline is that this work gives a large-scale empirical check on LLMs doing structured symptom scoring from actual psychiatric interview transcripts. They introduce the LlaMADRS benchmark on the CAMI corpus, run 25 models across sizes and types, and show that breaking the task into separate item calls before summing (ItS) lowers error compared with asking for the total score in one shot (DTS). They also find that standard models with clear prompts and examples match reasoning-augmented ones, and that longer reasoning traces help only modestly within the reasoning group. Scale still helps across the board. That decomposition result and the 400k-prediction sweep are the genuinely new pieces. The evaluation is straightforward and covers enough ground to make the ItS-vs-DTS gap credible on this corpus. The main soft spot is the ground truth. The claims about residual error falling below clinically meaningful thresholds rest on the expert annotations being low-noise, yet the abstract gives no inter-rater reliability numbers or external validation for those labels. If annotator variability is comparable to the model residuals, the clinical-utility argument weakens. No error bars or split details appear in the summary either, which makes it harder to judge how stable the differences are. This paper is for people working on clinical NLP applications or anyone testing LLMs on semi-structured extraction tasks. It is worth sending to peer review because the benchmark and the scale of the comparison are useful even if the annotation reliability needs to be tightened in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LlaMADRS, a benchmark for LLM-based structured clinical assessment built on the CAMI corpus of 541 psychiatric interviews yielding 5,804 expert annotations. It evaluates 25 open-source models (standard and reasoning-augmented, 0.6B–400B parameters) across >400k predictions, comparing Direct Total Score (DTS) prediction against an Item-then-Sum (ItS) strategy that scores symptoms individually before aggregation. Main claims are that strong open-source LLMs reach item-level accuracy with residual error below clinically substantial thresholds, that ItS significantly outperforms DTS across architectures, and that apparent benefits of reasoning models are largely attributable to prompt structure rather than internal reasoning traces.

Significance. If the central empirical claims hold after addressing ground-truth validation, the work supplies the largest-scale open evaluation to date of LLMs on real semi-structured clinical dialogue, with direct implications for prompt engineering and model selection in healthcare NLP. The scale of the evaluation and the controlled ItS-vs-DTS comparison constitute a concrete, falsifiable contribution that can guide deployment decisions.

major comments (2)

[Abstract] Abstract: The claim that 'strong open-source LLMs achieve item-level accuracy with residual error below clinically substantial thresholds' and that ItS 'significantly reduces error relative to DTS' both rest on the 5,804 CAMI expert annotations serving as low-noise ground truth. No inter-rater reliability statistics (Cohen’s kappa, ICC, or equivalent) or external validation of the annotations are reported, so it is impossible to determine whether annotation variability itself exceeds the reported model residuals or the ItS–DTS gap.
[Abstract] Abstract and methods description: The abstract states that results are supported by '400k predictions' yet provides no error bars, confidence intervals, or statistical significance tests for the accuracy figures or the ItS-vs-DTS comparisons. Without these, the quantitative claims that residuals fall 'below clinically substantial thresholds' and that ItS 'significantly reduces error' cannot be evaluated for robustness.

minor comments (2)

[Abstract] The abstract refers to 'data splits' but supplies no details on session-level partitioning, leakage controls, or how the 541 sessions were divided; this information belongs in the methods section for reproducibility.
Notation for the MADRS items and total-score computation should be defined explicitly on first use rather than assumed from domain knowledge.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on ground-truth validation and statistical reporting. We address each point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'strong open-source LLMs achieve item-level accuracy with residual error below clinically substantial thresholds' and that ItS 'significantly reduces error relative to DTS' both rest on the 5,804 CAMI expert annotations serving as low-noise ground truth. No inter-rater reliability statistics (Cohen’s kappa, ICC, or equivalent) or external validation of the annotations are reported, so it is impossible to determine whether annotation variability itself exceeds the reported model residuals or the ItS–DTS gap.

Authors: We agree that inter-rater reliability would help quantify annotation noise. The CAMI corpus provides expert annotations from the source study but does not include multiple independent ratings per item, precluding computation of Cohen’s kappa or ICC. We will add an explicit limitations paragraph discussing this constraint and its implications for interpreting residual errors and the ItS–DTS gap. revision: partial
Referee: [Abstract] Abstract and methods description: The abstract states that results are supported by '400k predictions' yet provides no error bars, confidence intervals, or statistical significance tests for the accuracy figures or the ItS-vs-DTS comparisons. Without these, the quantitative claims that residuals fall 'below clinically substantial thresholds' and that ItS 'significantly reduces error' cannot be evaluated for robustness.

Authors: The results section reports error bars on all accuracy metrics and includes statistical tests (paired comparisons with p-values) for ItS versus DTS differences. We will revise the abstract to reference these statistical supports and qualify the claims accordingly. revision: yes

standing simulated objections not resolved

Inter-rater reliability statistics (Cohen’s kappa, ICC) for the CAMI annotations, which cannot be computed from the published corpus data

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with external ground truth

full rationale

The paper is an empirical evaluation study that generates LLM predictions on the CAMI corpus and compares them directly to 5,804 expert annotations. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential constructions appear in the abstract or described claims. The central results (item-level accuracy, ItS vs DTS error reduction) rest on held-out external annotations rather than any internal fit or self-citation chain. This matches the default expectation of a self-contained benchmark against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions of clinical NLP evaluation rather than new free parameters or invented entities.

axioms (1)

domain assumption Expert annotations in the CAMI corpus provide accurate ground truth for MADRS item scores
All accuracy claims are measured against these annotations; invoked throughout the evaluation description.

pith-pipeline@v0.9.0 · 5793 in / 1353 out tokens · 22081 ms · 2026-05-23T06:30:02.989750+00:00 · methodology

LLAMADRS: Evaluating Open-Source LLMs on Real Clinical Interviews--To Reason or Not to Reason?

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)