pith. machine review for the scientific record. sign in

arxiv: 2605.06939 · v1 · submitted 2026-05-07 · 💻 cs.LG · stat.ME· stat.ML

Recognition: no theorem link

Bias and Uncertainty in LLM-as-a-Judge Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3

classification 💻 cs.LG stat.MEstat.ML
keywords LLM-as-a-Judgebias correctionmodel evaluationcalibration instabilityMMLU-Procomparison estimation
0
0 comments X

The pith

Sharing calibration across models in LLM-as-a-Judge evaluations can reverse the apparent winner with high apparent confidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that raw outputs from an LLM acting as a judge are systematically biased when used to estimate model performance. Existing correction methods aim to remove this bias but still depend on the judge being accurate and on calibration remaining consistent when comparing multiple models. Sharing a single calibration set across the models being compared, while convenient, can distort the corrected estimate enough to indicate the opposite ranking from the truth. The authors derive analytical bias formulas, run simulations that vary judge quality and calibration drift, and demonstrate sign reversal in a real MMLU-Pro case study. They introduce two simple diagnostics, J and ΔJ, to flag when corrected estimates are likely to be unreliable.

Core claim

LLM-as-a-Judge evaluation using the naive estimator of raw judge outputs is systematically biased. Bias-corrected estimators remain unreliable for model comparisons when calibration is shared across models, producing estimates that can point in the wrong direction with high apparent confidence. Analytical bias expressions, simulations over judge quality J and cross-model calibration instability ΔJ, and an MMLU-Pro case study with observed sign reversal establish this failure mode and motivate reporting J and ΔJ as reliability diagnostics.

What carries the argument

Analytical expressions and simulations for bias and uncertainty in terms of judge quality J and cross-model calibration instability ΔJ, which quantify distortion in shared-calibration comparisons.

If this is right

  • The naive estimator from raw judge outputs is systematically biased.
  • Corrected estimators require both high judge quality and stable calibration across compared models to avoid distortion.
  • Shared calibration, though practical, risks estimates that reverse the actual performance order.
  • Reporting J and ΔJ lets users assess when LaaJ results are likely to be invalid.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Separate calibration sets per model may be necessary despite higher collection cost to prevent sign reversal.
  • The same shared-calibration risk could appear in any pairwise comparison that reuses a single judge calibration set.
  • The J and ΔJ diagnostics could be computed automatically in evaluation pipelines to warn users before results are trusted.

Load-bearing premise

The analytical bias expressions and simulation results generalize to real LLM judges and evaluation tasks beyond the MMLU-Pro case study examined.

What would settle it

A controlled experiment on additional benchmarks with known true model accuracies in which the shared-calibration corrected estimate matches the true ordering even when measured ΔJ is large.

Figures

Figures reproduced from arXiv: 2605.06939 by James Fiedler.

Figure 1
Figure 1. Figure 1: Simulation results under q0 = q1 with true comparison effect δ = 0.05. Dotted horizontal line in coverage panels marks nominal 95%. Left column sweeps JA with ∆J = 0.05 fixed; right column sweeps ∆J with JA = 0.3 fixed. Subject Judge JˆGemma 3 JˆQwen 2.5 ∆dJ ∆dJ plotted Math Mistral Large 0.397 0.516 −0.119 [0.297, 0.494] [0.418, 0.612] [−0.253, 0.016] Gemma 4 31B 0.642 0.671 −0.030 [0.551, 0.728] [0.582, … view at source ↗
Figure 2
Figure 2. Figure 2: Bootstrap medians and 95% confidence intervals for per-model accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bootstrap medians and 95% confidence intervals for per-model accuracy [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
read the original abstract

LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliability depends critically on judge quality and, for model comparisons, on calibration stability. Sharing calibration across compared models is practically attractive but can introduce severe bias, including cases where the comparison estimate points in the wrong direction with high apparent confidence. We study these failure modes through analytical results, simulations over judge quality ($J$) and cross-model calibration instability ($\Delta J$), and a real-data MMLU-Pro case study with sign reversal. We propose $J$ and $\Delta J$ as diagnostics for when corrected estimates, especially shared-calibration comparisons, are likely unreliable, and provide reporting guidance for LaaJ evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript analyzes biases in LLM-as-a-Judge (LaaJ) evaluations for model performance assessment. It argues that bias-corrected estimators can still produce severe bias when calibration is shared across compared models, including cases of sign-reversed comparison estimates with high apparent confidence. The authors derive analytical bias expressions, perform simulations sweeping judge quality J and cross-model calibration instability ΔJ, demonstrate sign reversal on a real MMLU-Pro dataset, and propose J and ΔJ as diagnostics with reporting guidance for LaaJ evaluations.

Significance. If the central findings hold, the work is significant for the rapidly growing use of LLM judges in scalable ML evaluation, as it identifies a practical and previously under-emphasized failure mode in calibration sharing that can invert conclusions. Strengths include the closed-form analytical derivations, systematic simulation sweeps, and the real-data existence proof of reversal, which together provide both theoretical insight and a concrete cautionary example. The proposed diagnostics offer a constructive path forward for practitioners.

major comments (3)
  1. [§3] §3 (Analytical derivations): The closed-form bias expressions for shared-calibration comparisons rest on specific assumptions about judge error distributions and the parametric form of ΔJ. These assumptions are not shown to be robust to common real-LLM phenomena such as heavy-tailed errors, position biases, or task-dependent calibration shifts, which directly affects whether the quantitative severity predictions generalize beyond the MMLU-Pro example.
  2. [§5] §5 (MMLU-Pro case study): The sign-reversal demonstration is valuable as an existence proof, yet the section provides limited detail on the number of model pairs, statistical power, or controls for other confounding factors in the judge outputs. This makes it difficult to evaluate how representative the observed bias magnitude and confidence levels are for the broader claim.
  3. [§4] §4 (Simulations and diagnostics): While the sweeps over J and ΔJ are comprehensive, the manuscript does not fully specify how practitioners would estimate these quantities from real judge outputs on new tasks. Without this mapping, the proposed diagnostics remain difficult to apply, weakening their utility for the recommended reporting guidance.
minor comments (2)
  1. [Abstract] Notation for J and ΔJ is introduced clearly in the body but could benefit from a brief intuitive definition when first mentioned in the abstract.
  2. [Figures in §4] Simulation figure captions should explicitly restate the assumed judge error model and parameter ranges to improve standalone readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Analytical derivations): The closed-form bias expressions for shared-calibration comparisons rest on specific assumptions about judge error distributions and the parametric form of ΔJ. These assumptions are not shown to be robust to common real-LLM phenomena such as heavy-tailed errors, position biases, or task-dependent calibration shifts, which directly affects whether the quantitative severity predictions generalize beyond the MMLU-Pro example.

    Authors: We agree that the closed-form derivations rely on specific assumptions regarding judge error distributions and the parametric form of ΔJ. The MMLU-Pro case study provides an empirical illustration under real judge outputs that are likely to exhibit some of these phenomena. To address the concern directly, we will add a dedicated limitations subsection discussing the sensitivity of the bias expressions to heavy-tailed errors, position biases, and task-dependent shifts, including qualitative analysis of how violations would affect the sign-reversal predictions. We will also include a small set of additional simulation results under alternative error distributions where feasible within space constraints. This is a partial revision, as the core analytical results remain valid under the stated modeling assumptions. revision: partial

  2. Referee: [§5] §5 (MMLU-Pro case study): The sign-reversal demonstration is valuable as an existence proof, yet the section provides limited detail on the number of model pairs, statistical power, or controls for other confounding factors in the judge outputs. This makes it difficult to evaluate how representative the observed bias magnitude and confidence levels are for the broader claim.

    Authors: We thank the referee for highlighting this. In the revised manuscript we will expand §5 to report the exact number of model pairs evaluated, the statistical power calculations or confidence intervals used for the comparisons, and the controls applied for confounding factors such as prompt formatting variations and position biases in the judge outputs. These additions will better situate the observed reversal magnitudes and apparent confidence levels within the broader claim. revision: yes

  3. Referee: [§4] §4 (Simulations and diagnostics): While the sweeps over J and ΔJ are comprehensive, the manuscript does not fully specify how practitioners would estimate these quantities from real judge outputs on new tasks. Without this mapping, the proposed diagnostics remain difficult to apply, weakening their utility for the recommended reporting guidance.

    Authors: We acknowledge this practical gap. We will revise the diagnostics section to include explicit, step-by-step procedures for estimating J (judge quality) and ΔJ (cross-model calibration instability) from real judge outputs on new tasks. These will be based on held-out calibration sets or cross-validation approaches using the same judge model, together with guidance on sample sizes needed for stable estimates. This will directly support the recommended reporting practices. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations are independent analytical results under stated assumptions

full rationale

The paper derives closed-form bias expressions from explicit assumptions on judge output distributions and calibration instability ΔJ, then validates via parameter sweeps on synthetic judges and one external MMLU-Pro case study. No self-definitional steps, no fitted parameters renamed as predictions, and no load-bearing self-citations appear in the provided abstract or derivation outline. The central claims about bias severity and sign reversal are direct consequences of the stated error model rather than reductions to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit list of free parameters or invented entities; standard statistical assumptions about judge error distributions are implicit but not detailed.

pith-pipeline@v0.9.0 · 5440 in / 1151 out tokens · 33000 ms · 2026-05-11T00:57:28.956573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    URLhttps://www.science.org/doi/abs/10.1126/science.adi6000

    doi: 10.1126/science.adi6000. URLhttps://www.science.org/doi/abs/10.1126/science.adi6000. Anastasios N. Angelopoulos, John C. Duchi, and Tijana Zrnic. Ppi++: Efficient prediction-powered inference,

  2. [2]

    URLhttps://arxiv.org/abs/2311.01453. Yiqun T. Chen, Sizhu Lu, Sijia Li, Moran Guo, and Shengyi Li. Efficient inference for noisy llm-as-a-judge evaluation,

  3. [3]

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E

    URLhttps://arxiv.org/abs/2601.05420. Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, N...

  4. [4]

    ISBN 979-8-89176-384-5

    Association for Computational Linguistics. ISBN 979-8-89176-384-5. doi: 10.18653/v1/2026.eacl-industry.69. URLhttps://aclanthology.org/2026.eacl-industry.69/. Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. Alpacafarm: A simulation framework for methods that learn from ...

  5. [5]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    URLhttps://openreview.net/forum?id=4hturzLcKX. Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

  6. [6]

    An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4

    Hui Huang, Xingyuan Bu, Hongli Zhou, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, and Tiejun Zhao. An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge model is not a general substitute for GPT-4. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguist...

  7. [7]

    ISBN 979- 8-89176-256-5

    Association for Computational Linguistics. ISBN 979- 8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.306. URL https://aclanthology.org/ 2025.findings-acl.306/. Zsolt Lang and Jen˝o Reiczigel. Confidence limits for prevalence of disease adjusted for estimated sensitivity and specificity.Preventive V eterinary Medicine, 113(1):13–22, 01

  8. [8]

    Quantum processor-inspired machine learning in the biomedical sciences

    doi: 10.1016/j. prevetmed.2013.09.015. LangChain. How to calibrate llm-as-a-judge with human corrections, n.d. URL https://www. langchain.com/articles/llm-as-a-judge. Accessed: 2026-04-29. Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy yong Sohn, and Kangwook Lee. How to correctly report llm-as-a-judge evaluations,

  9. [9]

    How to correctly report llm-as-a-judge evaluations.arXiv preprint arXiv:2511.21140, 2025

    URLhttps://arxiv.org/abs/2511.21140. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December

  10. [10]

    G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

    Association for Computational Lin- guistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023. emnlp-main.153/. 10 Microsoft Azure AI Foundry Blog. Evaluating ai agents: Can llm- as-a-judge evaluators be trusted?, January

  11. [11]

    Accessed: 2026-04-29

    URL https: //techcommunity.microsoft.com/blog/azure-ai-foundry-blog/ evaluating-ai-agents-can-llm-as-a-judge-evaluators-be-trusted/4480110 . Accessed: 2026-04-29. Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations,

  12. [12]

    URLhttps://arxiv.org/abs/2411.00640. C. S. Peirce. The numerical measure of the success of predictions.Science, ns-4(93):453–454,

  13. [13]

    Walter J

    doi: 10.1126/science.ns-4.93.453-a. Walter J. Rogan and Beth Gladen. Estimating prevalence from the results of a screening test.American Journal of Epidemiology, 107(1):71–76,

  14. [14]

    Rogan and Beth Gladen

    doi: 10.1093/oxfordjournals.aje.a112510. Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM-based judges. InThe Thirteenth International Conference on Learning Representations,

  15. [15]

    In Ku, L.-W., Martins, A

    URL https://openreview.net/forum?id=G0dksFayVq. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computation...

  16. [16]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica

    doi: 10.1002/ 1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt...

  17. [17]

    Yes” if the student’s reasoning is accurate and sufficient to arrive at the correct answer, “No

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf. 11 A Calibration structure, RG vs PPI++ A.1 Judge-centric vs model-specific A key structural difference between RG and PPI++ is the role of the calibration set. In RG, calibration isjudge-centric: the calibration set consists...