Recognition: unknown
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
read the original abstract
LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. We introduce a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges to scenarios, is demonstrated to be the optimal strategy for a fixed judge panel and judge-call budget: the score recovers the panel mean exactly while matching the cost of single-judge evaluation. Empirical results on MT-Bench and MindEval validate the effectiveness of CyclicJudge as predicted, across both general-purpose and domain-specific evaluation settings.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Reference graph
Works this paper leans on
-
[1]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Limits to scalable evaluation at the frontier: LLM as judge won’t beat twice the data. InInterna- tional Conference on Learning Representations. Yann Dubois, Balázs Galambosi, Percy Liang, and Tat- sunori B. Hashimoto. 2024. Length-controlled Al- pacaEval: A simple way to debias automatic evalua- tors.Preprint, arXiv:2404.04475. Ronald A. Fisher. 1925.Sta...
work page internal anchor Pith review arXiv 2024
-
[2]
HiBayES: A hierarchical Bayesian model- ing framework for AI evaluation statistics.Preprint, arXiv:2505.05602. 5 Lovish Madaan, Samarth Singh, Rylan Schaeffer, An- drew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sha- ran Narang, and Dieuwke Hupkes. 2025. Quantify- ing variance in evaluation benchmarks. InInterna- tional Conference on Learning Representation...
-
[3]
arXiv preprint arXiv:2509.21128 , year=
tinyBenchmarks: Evaluating LLMs with fewer examples. InProceedings of the 41st International Conference on Machine Learning. Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. 2025. RL squeezes, SFT expands: A comparative study of reasoning LLMs.Preprint, arXiv:2509.21128. Evan Miller. 2024. Adding error...
-
[4]
InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri
-
[5]
arXiv preprint arXiv:2410.21819 (2025)
Self-preference bias in LLM-as-a-judge. Preprint, arXiv:2410.21819. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Ji- axi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 24 others. 2024. Qwen2.5 technical report.Prepr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.