pith. sign in

arxiv: 2501.03624 · v2 · submitted 2025-01-07 · 💻 cs.HC · cs.CL

LLAMADRS: Evaluating Open-Source LLMs on Real Clinical Interviews--To Reason or Not to Reason?

Pith reviewed 2026-05-23 06:30 UTC · model grok-4.3

classification 💻 cs.HC cs.CL
keywords LLMsclinical assessmentpsychiatric interviewsreasoning modelsprompt designitem-level accuracyCAMI corpussemi-structured prediction
0
0 comments X

The pith

Strong open-source LLMs reach expert-level accuracy on psychiatric interview scoring when symptoms are rated one at a time before summing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark called LlaMADRS from 541 real psychiatric interview sessions that carry 5,804 expert annotations. It tests 25 open-source models ranging from 0.6B to 400B parameters and shows that the strongest ones produce item-level scores whose remaining errors stay below levels that would matter in clinical practice. Breaking the task into separate model calls for each symptom and then adding the results reduces error compared with asking the model for a total score in one step. Models labeled as reasoning do not outperform standard models once the prompt includes clear structure and examples; instead, gains come from prompt design, model scale, and longer reasoning traces when they appear. The work supplies direct evidence on when decomposition helps and when scale matters for semi-structured clinical prediction.

Core claim

Strong open-source LLMs achieve item-level accuracy with residual error below clinically substantial thresholds on the LlaMADRS benchmark. An Item-then-Sum strategy that assesses symptoms individually through discrete LLM calls before synthesizing final scores significantly reduces error relative to Direct Total Score prediction across most model architectures and scales. Performance gains attributed to reasoning depend fundamentally on prompt design, with standard models equipped with structured task definitions and examples matching reasoning-augmented counterparts, while longer reasoning traces and higher model scale reduce error across both types.

What carries the argument

The Item-then-Sum (ItS) strategy, which decomposes the clinical scoring task into separate LLM calls for each symptom before summing, compared against Direct Total Score (DTS) prediction on the LlaMADRS benchmark built from the CAMI corpus.

If this is right

  • Strong open-source models can be used for structured clinical assessment from dialogue without exceeding clinically relevant error bounds.
  • Decomposing symptom assessment into separate calls outperforms asking for a total score directly across model sizes.
  • Prompt structure and examples explain most of the benefit previously credited to reasoning models on this task.
  • Model scale and longer reasoning traces each contribute measurable error reduction on semi-structured clinical scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the Item-then-Sum pattern holds for other dialogue-based judgments, clinical systems may be built more reliably by chaining many targeted calls than by one large query.
  • The same decomposition approach could be tested on non-clinical semi-structured prediction tasks such as legal or educational interviews.
  • Future work could measure how much the reported thresholds change when the same models are evaluated on interviews from different clinical sites or diagnostic traditions.

Load-bearing premise

The expert annotations in the CAMI corpus serve as reliable and unbiased ground truth for the reported accuracy thresholds.

What would settle it

Independent re-annotation of a held-out subset of the CAMI sessions by new experts, followed by re-running the models and checking whether item-level errors still remain below the stated clinical thresholds under the new labels.

read the original abstract

Large language models (LLMs) excel on many NLP benchmarks, but their behavior on real-world, semi-structured prediction remains underexplored. We present LlaMADRS, a benchmark for structured clinical assessment from dialogue built on the CAMI corpus of psychiatric interviews, comprising 5,804 expert annotations across 541 sessions. We evaluate 25 open-source models (standard and reasoning-augmented; 0.6B--400B parameters) and generate over 400,000 predictions. Our results demonstrate that strong open-source LLMs achieve item-level accuracy with residual error below clinically substantial thresholds. Additionally, an Item-then-Sum (ItS) strategy, assessing symptoms individually through discrete LLM calls before synthesizing final scores, significantly reduces error relative to Direct Total Score (DTS) prediction across most model architectures and scales, despite reasoning models attempting similar decomposition in the reasoning traces of their DTS predictions. In fact, we find that performance gains attributed to "reasoning" depend fundamentally on prompt design: standard models equipped with structured task definitions and examples match reasoning-augmented counterparts. Among the latter, longer reasoning traces correlate with reduced error; while higher model scale does across both architectures. Our results clarify when and why reasoning helps and offer actionable guidance for deploying LLMs in semi-structured clinical assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LlaMADRS, a benchmark for LLM-based structured clinical assessment built on the CAMI corpus of 541 psychiatric interviews yielding 5,804 expert annotations. It evaluates 25 open-source models (standard and reasoning-augmented, 0.6B–400B parameters) across >400k predictions, comparing Direct Total Score (DTS) prediction against an Item-then-Sum (ItS) strategy that scores symptoms individually before aggregation. Main claims are that strong open-source LLMs reach item-level accuracy with residual error below clinically substantial thresholds, that ItS significantly outperforms DTS across architectures, and that apparent benefits of reasoning models are largely attributable to prompt structure rather than internal reasoning traces.

Significance. If the central empirical claims hold after addressing ground-truth validation, the work supplies the largest-scale open evaluation to date of LLMs on real semi-structured clinical dialogue, with direct implications for prompt engineering and model selection in healthcare NLP. The scale of the evaluation and the controlled ItS-vs-DTS comparison constitute a concrete, falsifiable contribution that can guide deployment decisions.

major comments (2)
  1. [Abstract] Abstract: The claim that 'strong open-source LLMs achieve item-level accuracy with residual error below clinically substantial thresholds' and that ItS 'significantly reduces error relative to DTS' both rest on the 5,804 CAMI expert annotations serving as low-noise ground truth. No inter-rater reliability statistics (Cohen’s kappa, ICC, or equivalent) or external validation of the annotations are reported, so it is impossible to determine whether annotation variability itself exceeds the reported model residuals or the ItS–DTS gap.
  2. [Abstract] Abstract and methods description: The abstract states that results are supported by '400k predictions' yet provides no error bars, confidence intervals, or statistical significance tests for the accuracy figures or the ItS-vs-DTS comparisons. Without these, the quantitative claims that residuals fall 'below clinically substantial thresholds' and that ItS 'significantly reduces error' cannot be evaluated for robustness.
minor comments (2)
  1. [Abstract] The abstract refers to 'data splits' but supplies no details on session-level partitioning, leakage controls, or how the 541 sessions were divided; this information belongs in the methods section for reproducibility.
  2. Notation for the MADRS items and total-score computation should be defined explicitly on first use rather than assumed from domain knowledge.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on ground-truth validation and statistical reporting. We address each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'strong open-source LLMs achieve item-level accuracy with residual error below clinically substantial thresholds' and that ItS 'significantly reduces error relative to DTS' both rest on the 5,804 CAMI expert annotations serving as low-noise ground truth. No inter-rater reliability statistics (Cohen’s kappa, ICC, or equivalent) or external validation of the annotations are reported, so it is impossible to determine whether annotation variability itself exceeds the reported model residuals or the ItS–DTS gap.

    Authors: We agree that inter-rater reliability would help quantify annotation noise. The CAMI corpus provides expert annotations from the source study but does not include multiple independent ratings per item, precluding computation of Cohen’s kappa or ICC. We will add an explicit limitations paragraph discussing this constraint and its implications for interpreting residual errors and the ItS–DTS gap. revision: partial

  2. Referee: [Abstract] Abstract and methods description: The abstract states that results are supported by '400k predictions' yet provides no error bars, confidence intervals, or statistical significance tests for the accuracy figures or the ItS-vs-DTS comparisons. Without these, the quantitative claims that residuals fall 'below clinically substantial thresholds' and that ItS 'significantly reduces error' cannot be evaluated for robustness.

    Authors: The results section reports error bars on all accuracy metrics and includes statistical tests (paired comparisons with p-values) for ItS versus DTS differences. We will revise the abstract to reference these statistical supports and qualify the claims accordingly. revision: yes

standing simulated objections not resolved
  • Inter-rater reliability statistics (Cohen’s kappa, ICC) for the CAMI annotations, which cannot be computed from the published corpus data

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with external ground truth

full rationale

The paper is an empirical evaluation study that generates LLM predictions on the CAMI corpus and compares them directly to 5,804 expert annotations. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential constructions appear in the abstract or described claims. The central results (item-level accuracy, ItS vs DTS error reduction) rest on held-out external annotations rather than any internal fit or self-citation chain. This matches the default expectation of a self-contained benchmark against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions of clinical NLP evaluation rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Expert annotations in the CAMI corpus provide accurate ground truth for MADRS item scores
    All accuracy claims are measured against these annotations; invoked throughout the evaluation description.

pith-pipeline@v0.9.0 · 5793 in / 1353 out tokens · 22081 ms · 2026-05-23T06:30:02.989750+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.