pith. sign in

arxiv: 2511.21140 · v4 · pith:4C34I5XWnew · submitted 2025-11-26 · 💻 cs.LG · cs.CL· stat.AP· stat.ML

How to Correctly Report LLM-as-a-Judge Evaluations

classification 💻 cs.LG cs.CLstat.APstat.ML
keywords evaluationcalibrationframeworkbiasdatasetintervalssensitivityspecificity
0
0 comments X
read the original abstract

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Instance-Optimal Estimation with Multiple LLM Judges on a Budget

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.

  2. Uncertainty Propagation in LLM-Based Systems

    cs.SE 2026-04 unverdicted novelty 7.0

    This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...

  3. Open-Ended Task Discovery via Bayesian Optimization

    cs.AI 2026-05 unverdicted novelty 6.0

    Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.

  4. Bias and Uncertainty in LLM-as-a-Judge Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimate...

  5. AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

    cs.CL 2026-04 unverdicted novelty 6.0

    AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

  6. MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution

    cs.LG 2025-12 unverdicted novelty 5.0

    MaxShapley computes fair document attributions in generative QA by reducing Shapley value calculation to polynomial time via a max-sum utility, matching exact Shapley quality on HotPotQA, MuSiQUE, and MS MARCO while u...