Recognition: 2 theorem links
· Lean TheoremCyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
Pith reviewed 2026-05-15 18:07 UTC · model grok-4.3
The pith
CyclicJudge recovers the exact panel mean score in LLM evaluations at single-judge cost by rotating judges round-robin across scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Based on a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components, CyclicJudge, a round-robin assignment of judges to scenarios, is demonstrated to be the optimal strategy for a fixed judge panel and judge-call budget: the score recovers the panel mean exactly while matching the cost of single-judge evaluation.
What carries the argument
CyclicJudge round-robin assignment of a fixed judge panel across scenarios, derived from the four-way variance decomposition that isolates judge effects.
If this is right
- CyclicJudge scores equal the multi-judge panel mean for any fixed panel size.
- Total judge calls remain identical to single-judge evaluation.
- The method applies equally to general-purpose benchmarks like MT-Bench and domain-specific ones like MindEval.
- Single-judge setups produce rankings distorted by bias magnitudes comparable to model gaps.
Where Pith is reading between the lines
- The rotation pattern could extend to human raters or other evaluators with stable individual biases.
- If judge preferences shift with scenario difficulty, adaptive cycling schedules might improve recovery further.
- The variance breakdown points to possible refinements such as judge-specific weighting once residual terms are estimated.
Load-bearing premise
Judge biases are systematic and stable enough across scenarios that equal rotation cancels them to recover the panel mean without residual effects.
What would settle it
Compute the full multi-judge panel mean by having every judge score every generation, then apply CyclicJudge with the same total calls and check whether the two averages match exactly on the same data.
Figures
read the original abstract
LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. We introduce a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges to scenarios, is demonstrated to be the optimal strategy for a fixed judge panel and judge-call budget: the score recovers the panel mean exactly while matching the cost of single-judge evaluation. Empirical results on MT-Bench and MindEval validate the effectiveness of CyclicJudge as predicted, across both general-purpose and domain-specific evaluation settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CyclicJudge, a round-robin assignment of judges to scenarios for LLM-as-judge evaluation. It presents a variance decomposition partitioning benchmark score variance into scenario, generation, judge, and residual components. From this decomposition, the paper derives that CyclicJudge is optimal for a fixed judge panel and judge-call budget: the resulting score recovers the panel mean exactly while incurring the same cost as single-judge evaluation. The approach is validated empirically on MT-Bench and MindEval across general-purpose and domain-specific settings.
Significance. If the optimality result holds, CyclicJudge provides a practical, zero-extra-cost method to eliminate systematic judge bias in LLM evaluations, yielding more reliable model rankings. The four-way variance decomposition itself is a useful analytical contribution that could inform future evaluation protocols. The empirical confirmation on two distinct benchmarks strengthens the case for adoption in open-ended assessment pipelines.
major comments (3)
- [Abstract / variance decomposition] Abstract and variance decomposition section: the claim that CyclicJudge 'recovers the panel mean exactly' is derived under an additive model (scenario + generation + judge + residual). The manuscript must demonstrate either (a) that the round-robin design matrix is orthogonal to all judge-by-scenario interaction contrasts or (b) that interaction variance is empirically negligible relative to main effects; otherwise the recovered score retains a confounding term equal to the interaction component.
- [Optimality derivation] Optimality derivation: the paper states that the strategy matches single-judge cost while recovering the panel mean, but without the explicit design-matrix algebra or proof of unbiasedness under the stated decomposition, it is unclear whether the result is parameter-free or relies on additional assumptions about judge bias stability across scenarios.
- [Empirical validation] Empirical results: the validation on MT-Bench and MindEval is described as confirming the predictions, yet the absence of reported interaction-term tests, confidence intervals on the recovered means, or ablation removing the round-robin structure leaves the exact-recovery claim difficult to verify quantitatively.
minor comments (2)
- [Introduction] Notation for the four variance components could be introduced earlier and used consistently when stating the optimality result.
- [Abstract] The abstract would benefit from a single quantitative highlight (e.g., bias reduction magnitude or ranking stability improvement) to convey the practical gain.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments correctly identify places where additional rigor in the presentation of assumptions and proofs would strengthen the manuscript. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract / variance decomposition] Abstract and variance decomposition section: the claim that CyclicJudge 'recovers the panel mean exactly' is derived under an additive model (scenario + generation + judge + residual). The manuscript must demonstrate either (a) that the round-robin design matrix is orthogonal to all judge-by-scenario interaction contrasts or (b) that interaction variance is empirically negligible relative to main effects; otherwise the recovered score retains a confounding term equal to the interaction component.
Authors: We agree that the exact-recovery claim is derived under the additive model stated in the variance decomposition. Under this model the round-robin assignment is orthogonal to the judge main-effect contrasts, so the estimator recovers the panel mean exactly. We will add an explicit statement of the no-interaction assumption together with a short proof sketch of orthogonality. In addition, we will report empirical estimates of the judge-by-scenario interaction component on both MT-Bench and MindEval to confirm that it is small relative to the main effects. These changes will be incorporated in the revised manuscript. revision: partial
-
Referee: [Optimality derivation] Optimality derivation: the paper states that the strategy matches single-judge cost while recovering the panel mean, but without the explicit design-matrix algebra or proof of unbiasedness under the stated decomposition, it is unclear whether the result is parameter-free or relies on additional assumptions about judge bias stability across scenarios.
Authors: We accept that the current text omits the explicit design-matrix algebra. In the revision we will present the linear-model formulation, show that the CyclicJudge estimator is unbiased for the panel mean under the additive decomposition, and confirm that the result holds for arbitrary fixed judge biases (i.e., it is parameter-free with respect to the judge effects). No additional stability assumptions across scenarios are required beyond the model itself. revision: yes
-
Referee: [Empirical validation] Empirical results: the validation on MT-Bench and MindEval is described as confirming the predictions, yet the absence of reported interaction-term tests, confidence intervals on the recovered means, or ablation removing the round-robin structure leaves the exact-recovery claim difficult to verify quantitatively.
Authors: We will add interaction-term tests and bootstrap confidence intervals on the recovered means using the existing evaluation data. The single-judge baseline already provides a direct comparison to the round-robin design; we will augment this with a quantitative decomposition of variance explained by the round-robin structure. These additions can be made from the data already collected. revision: partial
Circularity Check
No circularity; optimality follows mathematically from explicit additive variance model
full rationale
The paper introduces a four-way variance decomposition (scenario + generation + judge + residual) and derives the exact panel-mean recovery property of round-robin assignment as a direct algebraic consequence of that additive structure. No parameter is fitted on data and then relabeled as a prediction, no self-citation chain justifies the uniqueness of the design, and the central equality is not smuggled in via prior work. The result is self-contained within the stated model assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We model each score as: Xijℓ = μθ + αi + βij + γℓ + εijℓ ... Var(X̄) = σ²α/n + σ²β/nm + σ²ε/nmK + σ²γ/K · (Ktot−K)/(Ktot−1)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CyclicJudge ... recovers the panel mean exactly while matching the cost of single-judge evaluation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Reference graph
Works this paper leans on
-
[1]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Limits to scalable evaluation at the frontier: LLM as judge won’t beat twice the data. InInterna- tional Conference on Learning Representations. Yann Dubois, Balázs Galambosi, Percy Liang, and Tat- sunori B. Hashimoto. 2024. Length-controlled Al- pacaEval: A simple way to debias automatic evalua- tors.Preprint, arXiv:2404.04475. Ronald A. Fisher. 1925.Sta...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
HiBayES: A hierarchical Bayesian model- ing framework for AI evaluation statistics.Preprint, arXiv:2505.05602. 5 Lovish Madaan, Samarth Singh, Rylan Schaeffer, An- drew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sha- ran Narang, and Dieuwke Hupkes. 2025. Quantify- ing variance in evaluation benchmarks. InInterna- tional Conference on Learning Representation...
-
[3]
arXiv preprint arXiv:2509.21128 , year=
tinyBenchmarks: Evaluating LLMs with fewer examples. InProceedings of the 41st International Conference on Machine Learning. Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. 2025. RL squeezes, SFT expands: A comparative study of reasoning LLMs.Preprint, arXiv:2509.21128. Evan Miller. 2024. Adding error...
-
[4]
InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri
-
[5]
Self-Preference Bias in LLM-as-a-Judge
Self-preference bias in LLM-as-a-judge. Preprint, arXiv:2410.21819. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Ji- axi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 24 others. 2024. Qwen2.5 technical report.Prepr...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.