A calibrated three-model LLM jury scores medical diagnoses and clinical reasoning on real hospital cases with higher agreement to primary expert panels and fewer severe errors than human re-scoring panels.
Human Evaluators vs. LLM-as-a-Judge: Toward Scal- able, Real-Time Evaluation of GenAI in Global Health
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.
citing papers explorer
-
Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?
A calibrated three-model LLM jury scores medical diagnoses and clinical reasoning on real hospital cases with higher agreement to primary expert panels and fewer severe errors than human re-scoring panels.
-
Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models
Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.