arxiv: 2603.01865 · v3 · submitted 2026-03-02 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

Ziyi Zhu , Olivier Tieleman , Alexey Bukhtiyarov , Jinghong Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationjudge biasvariance decompositionround-robin assignmentbenchmark scoringmodel rankingsystematic bias

0 comments

The pith

CyclicJudge recovers the exact panel mean score in LLM evaluations at single-judge cost by rotating judges round-robin across scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses systematic biases in LLM judges that distort model rankings in open-ended tasks, with bias size often matching the differences benchmarks aim to detect. It partitions score variance into scenario, generation, judge, and residual components to isolate judge effects. CyclicJudge applies round-robin rotation of a fixed judge panel so each model receives equal exposure to every judge. This balances biases exactly in the averaged score. Tests on MT-Bench and MindEval confirm the rotated scores match the panel mean while using the same total judge calls as a single-judge setup.

Core claim

Based on a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components, CyclicJudge, a round-robin assignment of judges to scenarios, is demonstrated to be the optimal strategy for a fixed judge panel and judge-call budget: the score recovers the panel mean exactly while matching the cost of single-judge evaluation.

What carries the argument

CyclicJudge round-robin assignment of a fixed judge panel across scenarios, derived from the four-way variance decomposition that isolates judge effects.

If this is right

CyclicJudge scores equal the multi-judge panel mean for any fixed panel size.
Total judge calls remain identical to single-judge evaluation.
The method applies equally to general-purpose benchmarks like MT-Bench and domain-specific ones like MindEval.
Single-judge setups produce rankings distorted by bias magnitudes comparable to model gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rotation pattern could extend to human raters or other evaluators with stable individual biases.
If judge preferences shift with scenario difficulty, adaptive cycling schedules might improve recovery further.
The variance breakdown points to possible refinements such as judge-specific weighting once residual terms are estimated.

Load-bearing premise

Judge biases are systematic and stable enough across scenarios that equal rotation cancels them to recover the panel mean without residual effects.

What would settle it

Compute the full multi-judge panel mean by having every judge score every generation, then apply CyclicJudge with the same total calls and check whether the two averages match exactly on the same data.

Figures

Figures reproduced from arXiv: 2603.01865 by Alexey Bukhtiyarov, Jinghong Chen, Olivier Tieleman, Ziyi Zhu.

**Figure 1.** Figure 1: validates the three allocation strategies via 5,000 subsampling repetitions at each budget level on both benchmarks. Markers show empirical variance; dashed lines show exact predictions from empirical pool variances (Appendix G). CyclicJudge achieves lower variance everywhere on both benchmarks, and predictions match empirical results precisely. At B=5, switching from random to cycling cuts variance by ∼… view at source ↗

**Figure 2.** Figure 2: Estimated judge biases γˆℓ = X¯ ··ℓ − X¯ ··· for each judge–model pair. is smaller than the residual mean square. Because a variance is non-negative by definition, negative estimates are set to zero—the standard truncation convention for ANOVA-based variance component estimation (Searle et al., 1992). These zeros, therefore, indicate that generation-to-generation variability is negligible relative to res… view at source ↗

read the original abstract

LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. We introduce a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges to scenarios, is demonstrated to be the optimal strategy for a fixed judge panel and judge-call budget: the score recovers the panel mean exactly while matching the cost of single-judge evaluation. Empirical results on MT-Bench and MindEval validate the effectiveness of CyclicJudge as predicted, across both general-purpose and domain-specific evaluation settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CyclicJudge recovers the multi-judge mean exactly via round-robin assignment at single-judge cost, but only if biases are additive with no scenario interactions.

read the letter

CyclicJudge stands out for turning a round-robin judge schedule into an exact estimator of the panel average using the same number of calls as single-judge evaluation. The variance decomposition into scenario, generation, judge, and residual components is the foundation, and the paper shows how the balanced design cancels the judge term under that model. That is the concrete new piece: a low-cost way to remove systematic bias without extra calls or new judges. The experiments on MT-Bench and MindEval back the prediction, showing scores that line up with multi-judge averages in both general and domain-specific settings. That practical result is worth noting for anyone running open-ended LLM evaluations on a budget. The central claim holds inside the stated additive model, and the derivation appears direct rather than fitted. The soft spot is the assumption that judge biases stay constant across scenarios. If interactions exist, the recovered score picks up a confounding term equal to the interaction variance, and the abstract does not report a test for that or an orthogonality proof. The full paper needs to show either negligible interaction terms in the data or that the design matrix handles them. Without those checks the optimality is conditional. This paper is for researchers and engineers who already use LLM judges and want a simple adjustment to improve reliability. A reader focused on evaluation protocols gets immediate value from the method and the decomposition. The thinking is clear and it engages the bias literature directly, so the work deserves peer review to verify the derivations and interaction checks.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CyclicJudge, a round-robin assignment of judges to scenarios for LLM-as-judge evaluation. It presents a variance decomposition partitioning benchmark score variance into scenario, generation, judge, and residual components. From this decomposition, the paper derives that CyclicJudge is optimal for a fixed judge panel and judge-call budget: the resulting score recovers the panel mean exactly while incurring the same cost as single-judge evaluation. The approach is validated empirically on MT-Bench and MindEval across general-purpose and domain-specific settings.

Significance. If the optimality result holds, CyclicJudge provides a practical, zero-extra-cost method to eliminate systematic judge bias in LLM evaluations, yielding more reliable model rankings. The four-way variance decomposition itself is a useful analytical contribution that could inform future evaluation protocols. The empirical confirmation on two distinct benchmarks strengthens the case for adoption in open-ended assessment pipelines.

major comments (3)

[Abstract / variance decomposition] Abstract and variance decomposition section: the claim that CyclicJudge 'recovers the panel mean exactly' is derived under an additive model (scenario + generation + judge + residual). The manuscript must demonstrate either (a) that the round-robin design matrix is orthogonal to all judge-by-scenario interaction contrasts or (b) that interaction variance is empirically negligible relative to main effects; otherwise the recovered score retains a confounding term equal to the interaction component.
[Optimality derivation] Optimality derivation: the paper states that the strategy matches single-judge cost while recovering the panel mean, but without the explicit design-matrix algebra or proof of unbiasedness under the stated decomposition, it is unclear whether the result is parameter-free or relies on additional assumptions about judge bias stability across scenarios.
[Empirical validation] Empirical results: the validation on MT-Bench and MindEval is described as confirming the predictions, yet the absence of reported interaction-term tests, confidence intervals on the recovered means, or ablation removing the round-robin structure leaves the exact-recovery claim difficult to verify quantitatively.

minor comments (2)

[Introduction] Notation for the four variance components could be introduced earlier and used consistently when stating the optimality result.
[Abstract] The abstract would benefit from a single quantitative highlight (e.g., bias reduction magnitude or ranking stability improvement) to convey the practical gain.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify places where additional rigor in the presentation of assumptions and proofs would strengthen the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract / variance decomposition] Abstract and variance decomposition section: the claim that CyclicJudge 'recovers the panel mean exactly' is derived under an additive model (scenario + generation + judge + residual). The manuscript must demonstrate either (a) that the round-robin design matrix is orthogonal to all judge-by-scenario interaction contrasts or (b) that interaction variance is empirically negligible relative to main effects; otherwise the recovered score retains a confounding term equal to the interaction component.

Authors: We agree that the exact-recovery claim is derived under the additive model stated in the variance decomposition. Under this model the round-robin assignment is orthogonal to the judge main-effect contrasts, so the estimator recovers the panel mean exactly. We will add an explicit statement of the no-interaction assumption together with a short proof sketch of orthogonality. In addition, we will report empirical estimates of the judge-by-scenario interaction component on both MT-Bench and MindEval to confirm that it is small relative to the main effects. These changes will be incorporated in the revised manuscript. revision: partial
Referee: [Optimality derivation] Optimality derivation: the paper states that the strategy matches single-judge cost while recovering the panel mean, but without the explicit design-matrix algebra or proof of unbiasedness under the stated decomposition, it is unclear whether the result is parameter-free or relies on additional assumptions about judge bias stability across scenarios.

Authors: We accept that the current text omits the explicit design-matrix algebra. In the revision we will present the linear-model formulation, show that the CyclicJudge estimator is unbiased for the panel mean under the additive decomposition, and confirm that the result holds for arbitrary fixed judge biases (i.e., it is parameter-free with respect to the judge effects). No additional stability assumptions across scenarios are required beyond the model itself. revision: yes
Referee: [Empirical validation] Empirical results: the validation on MT-Bench and MindEval is described as confirming the predictions, yet the absence of reported interaction-term tests, confidence intervals on the recovered means, or ablation removing the round-robin structure leaves the exact-recovery claim difficult to verify quantitatively.

Authors: We will add interaction-term tests and bootstrap confidence intervals on the recovered means using the existing evaluation data. The single-judge baseline already provides a direct comparison to the round-robin design; we will augment this with a quantitative decomposition of variance explained by the round-robin structure. These additions can be made from the data already collected. revision: partial

Circularity Check

0 steps flagged

No circularity; optimality follows mathematically from explicit additive variance model

full rationale

The paper introduces a four-way variance decomposition (scenario + generation + judge + residual) and derives the exact panel-mean recovery property of round-robin assignment as a direct algebraic consequence of that additive structure. No parameter is fitted on data and then relabeled as a prediction, no self-citation chain justifies the uniqueness of the design, and the central equality is not smuggled in via prior work. The result is self-contained within the stated model assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that judge biases are systematic and stable enough to be isolated by the four-way variance decomposition; no free parameters or new invented entities are mentioned.

axioms (1)

domain assumption LLM judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations.
Explicitly stated as the starting observation in the abstract.

pith-pipeline@v0.9.0 · 5447 in / 1276 out tokens · 57459 ms · 2026-05-15T18:07:14.011992+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model each score as: Xijℓ = μθ + αi + βij + γℓ + εijℓ ... Var(X̄) = σ²α/n + σ²β/nm + σ²ε/nmK + σ²γ/K · (Ktot−K)/(Ktot−1)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CyclicJudge ... recovers the panel mean exactly while matching the cost of single-judge evaluation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
cs.LG 2026-05 unverdicted novelty 5.0

A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Limits to scalable evaluation at the frontier: LLM as judge won’t beat twice the data. InInterna- tional Conference on Learning Representations. Yann Dubois, Balázs Galambosi, Percy Liang, and Tat- sunori B. Hashimoto. 2024. Length-controlled Al- pacaEval: A simple way to debias automatic evalua- tors.Preprint, arXiv:2404.04475. Ronald A. Fisher. 1925.Sta...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

5 Lovish Madaan, Samarth Singh, Rylan Schaeffer, An- drew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sha- ran Narang, and Dieuwke Hupkes

HiBayES: A hierarchical Bayesian model- ing framework for AI evaluation statistics.Preprint, arXiv:2505.05602. 5 Lovish Madaan, Samarth Singh, Rylan Schaeffer, An- drew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sha- ran Narang, and Dieuwke Hupkes. 2025. Quantify- ing variance in evaluation benchmarks. InInterna- tional Conference on Learning Representation...

work page arXiv 2025
[3]

arXiv preprint arXiv:2509.21128 , year=

tinyBenchmarks: Evaluating LLMs with fewer examples. InProceedings of the 41st International Conference on Machine Learning. Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. 2025. RL squeezes, SFT expands: A comparative study of reasoning LLMs.Preprint, arXiv:2509.21128. Evan Miller. 2024. Adding error...

work page arXiv 2025
[4]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri

work page
[5]

Self-Preference Bias in LLM-as-a-Judge

Self-preference bias in LLM-as-a-judge. Preprint, arXiv:2410.21819. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Ji- axi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 24 others. 2024. Qwen2.5 technical report.Prepr...

work page internal anchor Pith review Pith/arXiv arXiv 2024