ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

URL https://arxiv · 2025 · cs.CL · arXiv 2505.22919

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Existing benchmarks for evaluating the clinical reasoning capabilities of large language models (LLMs) often lack a clear definition of "clinical reasoning" as a construct, fail to capture the full breadth of interdependent tasks within a clinical workflow, and rely on stylized vignettes rather than real-world clinical documentation. As a result, recent studies have found significant discrepancies between LLM performance on stylized benchmarks derived from medical licensing exams and their performance in real-world prospective studies. To address these limitations, we introduce ER-Reason, a benchmark designed to evaluate LLM reasoning as clinical evidence accumulates across decision-making tasks spanning the full workflow of emergency medicine. ER-Reason comprises 25,174 de-identified clinical notes from 3,437 patients, supporting evaluation across all stages of the emergency department workflow: triage intake, treatment selection, disposition planning, and final diagnosis. Crucially, evaluation in ER-Reason extends beyond diagnostic accuracy to include stepwise Script Concordance Test (SCT)-style questions grounded in real patient cases, which assess whether LLMs update their diagnostic beliefs in the correct direction and magnitude as clinical evidence accumulates, scored against 2,555 emergency physician annotations. We evaluate reasoning and non-reasoning LLMs on ER-Reason, and show that our tasks provide a more nuanced view of how LLM reasoning fails on real patient cases than existing benchmarks allow.

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

stat.ML · 2026-05-07 · unverdicted · novelty 7.0

CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.

Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs

cs.AI · 2026-04-09 · accept · novelty 7.0

The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

ProMedical builds a 50k preference dataset with fine-grained rubrics and a multi-dimensional reward model that disentangles safety from proficiency, yielding 22.3% accuracy and 21.7% safety gains on Qwen3-8B via GRPO while generalizing to UltraMedical.

Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

cs.CL · 2026-03-29 · unverdicted · novelty 6.0

A new counterfactual multi-agent framework improves LLM diagnostic accuracy by quantifying confidence shifts from edited clinical findings and guiding specialist discussions.

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

cs.CL · 2025-08-06 · unverdicted · novelty 6.0

MedCheck is a lifecycle checklist framework that audits 53 existing medical LLM benchmarks and identifies systemic gaps in clinical fidelity, contamination control, and safety metrics.

citing papers explorer

Showing 6 of 6 citing papers.

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency stat.ML · 2026-05-07 · unverdicted · none · ref 13 · internal anchor
CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs cs.AI · 2026-04-09 · accept · none · ref 68 · internal anchor
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics cs.CL · 2026-05-10 · unverdicted · none · ref 25 · internal anchor
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection cs.AI · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
ProMedical builds a 50k preference dataset with fine-grained rubrics and a multi-dimensional reward model that disentangles safety from proficiency, yielding 22.3% accuracy and 21.7% safety gains on Qwen3-8B via GRPO while generalizing to UltraMedical.
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning cs.CL · 2026-03-29 · unverdicted · none · ref 19 · internal anchor
A new counterfactual multi-agent framework improves LLM diagnostic accuracy by quantifying confidence shifts from edited clinical findings and guiding specialist discussions.
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models cs.CL · 2025-08-06 · unverdicted · none · ref 4 · internal anchor
MedCheck is a lifecycle checklist framework that audits 53 existing medical LLM benchmarks and identifies systemic gaps in clinical fidelity, contamination control, and safety metrics.

ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer