CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

Ziyi Zhu , Olivier Tieleman , Alexey Bukhtiyarov , Jinghong Chen

Authors on Pith no claims yet

classification 💻 cs.CL

keywords evaluationcyclicjudgejudgebiasesjudgesmodelpanelscenarios

read the original abstract

LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. We introduce a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges to scenarios, is demonstrated to be the optimal strategy for a fixed judge panel and judge-call budget: the score recovers the panel mean exactly while matching the cost of single-judge evaluation. Empirical results on MT-Bench and MindEval validate the effectiveness of CyclicJudge as predicted, across both general-purpose and domain-specific evaluation settings.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
cs.LG 2026-05 unverdicted novelty 5.0

A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Limits to scalable evaluation at the frontier: LLM as judge won’t beat twice the data. InInterna- tional Conference on Learning Representations. Yann Dubois, Balázs Galambosi, Percy Liang, and Tat- sunori B. Hashimoto. 2024. Length-controlled Al- pacaEval: A simple way to debias automatic evalua- tors.Preprint, arXiv:2404.04475. Ronald A. Fisher. 1925.Sta...

work page internal anchor Pith review arXiv 2024
[2]

5 Lovish Madaan, Samarth Singh, Rylan Schaeffer, An- drew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sha- ran Narang, and Dieuwke Hupkes

HiBayES: A hierarchical Bayesian model- ing framework for AI evaluation statistics.Preprint, arXiv:2505.05602. 5 Lovish Madaan, Samarth Singh, Rylan Schaeffer, An- drew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sha- ran Narang, and Dieuwke Hupkes. 2025. Quantify- ing variance in evaluation benchmarks. InInterna- tional Conference on Learning Representation...

work page arXiv 2025
[3]

arXiv preprint arXiv:2509.21128 , year=

tinyBenchmarks: Evaluating LLMs with fewer examples. InProceedings of the 41st International Conference on Machine Learning. Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. 2025. RL squeezes, SFT expands: A comparative study of reasoning LLMs.Preprint, arXiv:2509.21128. Evan Miller. 2024. Adding error...

work page arXiv 2025
[4]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri

work page
[5]

arXiv preprint arXiv:2410.21819 (2025)

Self-preference bias in LLM-as-a-judge. Preprint, arXiv:2410.21819. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Ji- axi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 24 others. 2024. Qwen2.5 technical report.Prepr...

work page arXiv 2024