Large-scale study of 21 LLM-as-a-Judge models shows exact-match agreement overstates reliability, rankings shift across benchmarks, and high consistency can mask position bias.
Not All Voices Are Rewarded Equally: Probing and Repairing Reward Models across Human Diversity
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias
Large-scale study of 21 LLM-as-a-Judge models shows exact-match agreement overstates reliability, rankings shift across benchmarks, and high consistency can mask position bias.