LLM judges achieve only near-chance discrimination (AUC 0.49-0.66) between complete and incomplete medical responses and apply different completeness standards than clinicians.
Explicitly state what the Assistant OMITTED
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CY 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Same Verdict, Different Reasons: LLM-as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness
LLM judges achieve only near-chance discrimination (AUC 0.49-0.66) between complete and incomplete medical responses and apply different completeness standards than clinicians.