Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

· 2026 · cs.LG · arXiv 2604.14892

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations. Both LLM and clinician-generated diagnoses are scored across four dimensions: diagnosis, differential diagnosis, clinical reasoning and negative treatment risk. For each of these, we assess scoring difference, inter-rater agreement, scoring stability, severe safety errors and the effect of post-hoc calibration. We find that: (i) the uncalibrated LLM jury scores are systematically lower than clinician panels scores; (ii) the LLM Jury preserves ordinal agreement and exhibits better concordance with the primary expert panels than the human expert re-score panels do; (iii) the probability of severe errors is lower in \lj models compared to the human expert re-score panels; (iv) the LLM Jury shows excellent agreement with primary expert panels' rankings. We find that the LLM jury combined with AI model diagnoses can be used to identify ward diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (v) LLM jury models show no self-preference bias. They did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Finally, we demonstrate that LLM jury calibration using isotonic regression improves alignment with human expert panel evaluations. Together, these results provide compelling evidence that a calibrated, multi-model LLM jury can serve as a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking.

representative citing papers

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

The paper formulates LLM-as-judge evaluation as a two-stage missing-data problem and derives sample-size formulas via doubly robust estimators to achieve desired power while allocating more human reviews where LLM predictability is low.

Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

cs.LG · 2026-04-18 · unverdicted · novelty 5.0

Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.

citing papers explorer

Showing 2 of 2 citing papers.

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need? cs.LG · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
The paper formulates LLM-as-judge evaluation as a two-stage missing-data problem and derives sample-size formulas via doubly robust estimators to achieve desired power while allocating more human reviews where LLM predictability is low.
Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models cs.LG · 2026-04-18 · unverdicted · none · ref 20 · internal anchor
Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.

Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

fields

years

verdicts

representative citing papers

citing papers explorer