pith. machine review for the scientific record. sign in

arxiv: 2603.01865 · v3 · submitted 2026-03-02 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM evaluationjudge biasvariance decompositionround-robin assignmentbenchmark scoringmodel rankingsystematic bias
0
0 comments X

The pith

CyclicJudge recovers the exact panel mean score in LLM evaluations at single-judge cost by rotating judges round-robin across scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses systematic biases in LLM judges that distort model rankings in open-ended tasks, with bias size often matching the differences benchmarks aim to detect. It partitions score variance into scenario, generation, judge, and residual components to isolate judge effects. CyclicJudge applies round-robin rotation of a fixed judge panel so each model receives equal exposure to every judge. This balances biases exactly in the averaged score. Tests on MT-Bench and MindEval confirm the rotated scores match the panel mean while using the same total judge calls as a single-judge setup.

Core claim

Based on a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components, CyclicJudge, a round-robin assignment of judges to scenarios, is demonstrated to be the optimal strategy for a fixed judge panel and judge-call budget: the score recovers the panel mean exactly while matching the cost of single-judge evaluation.

What carries the argument

CyclicJudge round-robin assignment of a fixed judge panel across scenarios, derived from the four-way variance decomposition that isolates judge effects.

If this is right

  • CyclicJudge scores equal the multi-judge panel mean for any fixed panel size.
  • Total judge calls remain identical to single-judge evaluation.
  • The method applies equally to general-purpose benchmarks like MT-Bench and domain-specific ones like MindEval.
  • Single-judge setups produce rankings distorted by bias magnitudes comparable to model gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rotation pattern could extend to human raters or other evaluators with stable individual biases.
  • If judge preferences shift with scenario difficulty, adaptive cycling schedules might improve recovery further.
  • The variance breakdown points to possible refinements such as judge-specific weighting once residual terms are estimated.

Load-bearing premise

Judge biases are systematic and stable enough across scenarios that equal rotation cancels them to recover the panel mean without residual effects.

What would settle it

Compute the full multi-judge panel mean by having every judge score every generation, then apply CyclicJudge with the same total calls and check whether the two averages match exactly on the same data.

Figures

Figures reproduced from arXiv: 2603.01865 by Alexey Bukhtiyarov, Jinghong Chen, Olivier Tieleman, Ziyi Zhu.

Figure 1
Figure 1. Figure 1: validates the three allocation strategies via 5,000 subsampling repetitions at each budget level on both benchmarks. Markers show empirical variance; dashed lines show exact predictions from empirical pool variances (Appendix G). Cyclic￾Judge achieves lower variance everywhere on both benchmarks, and predictions match empirical re￾sults precisely. At B=5, switching from random to cycling cuts variance by ∼… view at source ↗
Figure 2
Figure 2. Figure 2: Estimated judge biases γˆℓ = X¯ ··ℓ − X¯ ··· for each judge–model pair. is smaller than the residual mean square. Because a variance is non-negative by definition, negative estimates are set to zero—the standard truncation convention for ANOVA-based variance component estimation (Searle et al., 1992). These zeros, there￾fore, indicate that generation-to-generation vari￾ability is negligible relative to res… view at source ↗
read the original abstract

LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. We introduce a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges to scenarios, is demonstrated to be the optimal strategy for a fixed judge panel and judge-call budget: the score recovers the panel mean exactly while matching the cost of single-judge evaluation. Empirical results on MT-Bench and MindEval validate the effectiveness of CyclicJudge as predicted, across both general-purpose and domain-specific evaluation settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CyclicJudge, a round-robin assignment of judges to scenarios for LLM-as-judge evaluation. It presents a variance decomposition partitioning benchmark score variance into scenario, generation, judge, and residual components. From this decomposition, the paper derives that CyclicJudge is optimal for a fixed judge panel and judge-call budget: the resulting score recovers the panel mean exactly while incurring the same cost as single-judge evaluation. The approach is validated empirically on MT-Bench and MindEval across general-purpose and domain-specific settings.

Significance. If the optimality result holds, CyclicJudge provides a practical, zero-extra-cost method to eliminate systematic judge bias in LLM evaluations, yielding more reliable model rankings. The four-way variance decomposition itself is a useful analytical contribution that could inform future evaluation protocols. The empirical confirmation on two distinct benchmarks strengthens the case for adoption in open-ended assessment pipelines.

major comments (3)
  1. [Abstract / variance decomposition] Abstract and variance decomposition section: the claim that CyclicJudge 'recovers the panel mean exactly' is derived under an additive model (scenario + generation + judge + residual). The manuscript must demonstrate either (a) that the round-robin design matrix is orthogonal to all judge-by-scenario interaction contrasts or (b) that interaction variance is empirically negligible relative to main effects; otherwise the recovered score retains a confounding term equal to the interaction component.
  2. [Optimality derivation] Optimality derivation: the paper states that the strategy matches single-judge cost while recovering the panel mean, but without the explicit design-matrix algebra or proof of unbiasedness under the stated decomposition, it is unclear whether the result is parameter-free or relies on additional assumptions about judge bias stability across scenarios.
  3. [Empirical validation] Empirical results: the validation on MT-Bench and MindEval is described as confirming the predictions, yet the absence of reported interaction-term tests, confidence intervals on the recovered means, or ablation removing the round-robin structure leaves the exact-recovery claim difficult to verify quantitatively.
minor comments (2)
  1. [Introduction] Notation for the four variance components could be introduced earlier and used consistently when stating the optimality result.
  2. [Abstract] The abstract would benefit from a single quantitative highlight (e.g., bias reduction magnitude or ranking stability improvement) to convey the practical gain.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify places where additional rigor in the presentation of assumptions and proofs would strengthen the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / variance decomposition] Abstract and variance decomposition section: the claim that CyclicJudge 'recovers the panel mean exactly' is derived under an additive model (scenario + generation + judge + residual). The manuscript must demonstrate either (a) that the round-robin design matrix is orthogonal to all judge-by-scenario interaction contrasts or (b) that interaction variance is empirically negligible relative to main effects; otherwise the recovered score retains a confounding term equal to the interaction component.

    Authors: We agree that the exact-recovery claim is derived under the additive model stated in the variance decomposition. Under this model the round-robin assignment is orthogonal to the judge main-effect contrasts, so the estimator recovers the panel mean exactly. We will add an explicit statement of the no-interaction assumption together with a short proof sketch of orthogonality. In addition, we will report empirical estimates of the judge-by-scenario interaction component on both MT-Bench and MindEval to confirm that it is small relative to the main effects. These changes will be incorporated in the revised manuscript. revision: partial

  2. Referee: [Optimality derivation] Optimality derivation: the paper states that the strategy matches single-judge cost while recovering the panel mean, but without the explicit design-matrix algebra or proof of unbiasedness under the stated decomposition, it is unclear whether the result is parameter-free or relies on additional assumptions about judge bias stability across scenarios.

    Authors: We accept that the current text omits the explicit design-matrix algebra. In the revision we will present the linear-model formulation, show that the CyclicJudge estimator is unbiased for the panel mean under the additive decomposition, and confirm that the result holds for arbitrary fixed judge biases (i.e., it is parameter-free with respect to the judge effects). No additional stability assumptions across scenarios are required beyond the model itself. revision: yes

  3. Referee: [Empirical validation] Empirical results: the validation on MT-Bench and MindEval is described as confirming the predictions, yet the absence of reported interaction-term tests, confidence intervals on the recovered means, or ablation removing the round-robin structure leaves the exact-recovery claim difficult to verify quantitatively.

    Authors: We will add interaction-term tests and bootstrap confidence intervals on the recovered means using the existing evaluation data. The single-judge baseline already provides a direct comparison to the round-robin design; we will augment this with a quantitative decomposition of variance explained by the round-robin structure. These additions can be made from the data already collected. revision: partial

Circularity Check

0 steps flagged

No circularity; optimality follows mathematically from explicit additive variance model

full rationale

The paper introduces a four-way variance decomposition (scenario + generation + judge + residual) and derives the exact panel-mean recovery property of round-robin assignment as a direct algebraic consequence of that additive structure. No parameter is fitted on data and then relabeled as a prediction, no self-citation chain justifies the uniqueness of the design, and the central equality is not smuggled in via prior work. The result is self-contained within the stated model assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that judge biases are systematic and stable enough to be isolated by the four-way variance decomposition; no free parameters or new invented entities are mentioned.

axioms (1)
  • domain assumption LLM judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations.
    Explicitly stated as the starting observation in the abstract.

pith-pipeline@v0.9.0 · 5447 in / 1276 out tokens · 57459 ms · 2026-05-15T18:07:14.011992+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

    cs.LG 2026-05 unverdicted novelty 5.0

    A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Limits to scalable evaluation at the frontier: LLM as judge won’t beat twice the data. InInterna- tional Conference on Learning Representations. Yann Dubois, Balázs Galambosi, Percy Liang, and Tat- sunori B. Hashimoto. 2024. Length-controlled Al- pacaEval: A simple way to debias automatic evalua- tors.Preprint, arXiv:2404.04475. Ronald A. Fisher. 1925.Sta...

  2. [2]

    5 Lovish Madaan, Samarth Singh, Rylan Schaeffer, An- drew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sha- ran Narang, and Dieuwke Hupkes

    HiBayES: A hierarchical Bayesian model- ing framework for AI evaluation statistics.Preprint, arXiv:2505.05602. 5 Lovish Madaan, Samarth Singh, Rylan Schaeffer, An- drew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sha- ran Narang, and Dieuwke Hupkes. 2025. Quantify- ing variance in evaluation benchmarks. InInterna- tional Conference on Learning Representation...

  3. [3]

    arXiv preprint arXiv:2509.21128 , year=

    tinyBenchmarks: Evaluating LLMs with fewer examples. InProceedings of the 41st International Conference on Machine Learning. Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. 2025. RL squeezes, SFT expands: A comparative study of reasoning LLMs.Preprint, arXiv:2509.21128. Evan Miller. 2024. Adding error...

  4. [4]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

    Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri

  5. [5]

    Self-Preference Bias in LLM-as-a-Judge

    Self-preference bias in LLM-as-a-judge. Preprint, arXiv:2410.21819. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Ji- axi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 24 others. 2024. Qwen2.5 technical report.Prepr...