Human Evaluators vs. LLM-as-a-Judge: Toward Scal- able, Real-Time Evaluation of GenAI in Global Health

· 2025 · DOI 10.1101/2025.10.27.25338910

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

A calibrated three-model LLM jury scores medical diagnoses and clinical reasoning on real hospital cases with higher agreement to primary expert panels and fewer severe errors than human re-scoring panels.

Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

cs.LG · 2026-04-18 · unverdicted · novelty 5.0

Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.

citing papers explorer

Showing 2 of 2 citing papers.

Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels? cs.LG · 2026-04-16 · unverdicted · none · ref 1
A calibrated three-model LLM jury scores medical diagnoses and clinical reasoning on real hospital cases with higher agreement to primary expert panels and fewer severe errors than human re-scoring panels.
Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models cs.LG · 2026-04-18 · unverdicted · none · ref 16
Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.

Human Evaluators vs. LLM-as-a-Judge: Toward Scal- able, Real-Time Evaluation of GenAI in Global Health

fields

years

verdicts

representative citing papers

citing papers explorer