LLM-as-Judge on a Budget

Aadirupa Saha; Aniket Wagde; Branislav Kveton

arxiv: 2602.15481 · v2 · submitted 2026-02-17 · 💻 cs.LG

LLM-as-Judge on a Budget

Aadirupa Saha , Aniket Wagde , Branislav Kveton This is my paper

Pith reviewed 2026-05-15 21:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM-as-Judgebudget allocationvariance adaptivescore estimationmulti-armed banditevaluation efficiencyjudgment stochasticity

0 comments

The pith

A variance-adaptive algorithm allocates fixed LLM query budget across pairs to reach near-optimal worst-case estimation error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to distribute a fixed number of LLM queries across many prompt-response pairs by first sampling each a few times to estimate score variance and then spending the rest of the budget on pairs with higher estimated variance. This matters because LLM judgments are noisy, so uniform allocation leaves high-uncertainty pairs poorly estimated while over-sampling stable ones and inflating overall worst-case error in model evaluations. The resulting error bound scales as the square root of total variance over budget, matching the performance of an oracle that knows the variances in advance up to logarithmic factors. Experiments on summarization and preference datasets confirm lower error than uniform splitting at the same total query count.

Core claim

We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of Õ(√(∑_{i=1}^K σ_i² / B)), σ_i² being the unknown score variance for pair i ∈ [K] with near-optimal budget allocation.

What carries the argument

Variance-adaptive allocation that estimates per-pair score variances from initial samples and then allocates remaining queries proportionally to those estimates using concentration inequalities.

If this is right

The error scales as the square root of summed variances over budget, matching an oracle allocator up to logs.
The method reduces worst-case estimation error relative to uniform allocation on the same total query count.
It supports reliable large-scale automated LLM assessment without raising total computational cost.
The bound holds for any fixed variance profile once initial estimates are obtained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same variance-driven splitting could apply directly to allocating human annotation effort in preference datasets.
Periodic re-estimation of variances would be needed if score noise drifts with model updates or new prompt distributions.
A hybrid version that predicts variances from prompt features could shrink the initial sampling cost further.

Load-bearing premise

Score variances for each pair can be estimated accurately enough from a modest number of initial samples to drive the adaptive allocation without introducing substantial bias or extra overhead that would invalidate the error bound.

What would settle it

Run the adaptive allocator versus uniform allocation on a held-out dataset with measured heterogeneous variances; the claim fails if the adaptive worst-case error does not stay within the predicted Õ bound or fails to beat uniform allocation by a clear margin at the same total B.

read the original abstract

LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K \sigma_i^2}{B}}\right)$, $\sigma_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies multi-armed bandit variance-adaptive allocation to LLM-as-judge query budgeting and reports better worst-case error than uniform on two datasets, but the claimed Õ bound looks fragile once variance estimation error is considered.

read the letter

The core idea is to treat each prompt-response pair as an arm and use estimated score variances to decide how many times to query each one under a fixed total budget B. They start with a small number of samples per pair, then shift the remaining queries toward higher-variance pairs. The abstract states a worst-case error of Õ(sqrt(sum sigma_i² / B)) and shows the method beats uniform allocation on Summarize-From-Feedback and HelpSteer2.

Referee Report

3 major / 2 minor

Summary. The paper proposes a variance-adaptive algorithm, grounded in multi-armed bandit theory and concentration inequalities, for allocating a fixed query budget B across K prompt-response pairs when using LLMs as judges. It claims a worst-case score-estimation error bound of Õ(√(∑_{i=1}^K σ_i² / B)) under near-optimal allocation and reports empirical improvements over uniform allocation on the Summarize-From-Feedback and HelpSteer2 datasets.

Significance. If the adaptive bound holds after accounting for variance estimation from initial samples, the work would supply a practical, theoretically grounded method for reducing evaluation costs in LLM-as-a-judge pipelines while preserving worst-case error guarantees. The empirical gains on two standard datasets are a positive signal, but the result's impact hinges on whether the analysis absorbs the adaptivity and estimation error without additional assumptions that weaken the Õ rate.

major comments (3)

[§4] §4 (Theoretical Analysis), main theorem: the claimed Õ(√(∑ σ_i² / B)) bound is stated for near-optimal allocation, yet the algorithm description indicates that variances are estimated from a fixed initial sample size m per pair. No martingale or doubling argument is referenced to control the deviation between empirical and true allocation when some σ_i are small; this leaves the worst-case guarantee unproven under the stated adaptive procedure.
[§3.2] §3.2 (Algorithm), line describing initial sampling: the procedure allocates a modest fixed m queries upfront to estimate σ_i² before reallocating the remainder. When K is large or several σ_i approach zero, the relative error Θ(1/√m) in the sample variance can produce allocations n_i far from the oracle B·σ_i²/∑σ_j², violating the premise needed for the uniform error bound to hold.
[Experiments] Experiments section, Table 1/2: the reported reductions in worst-case error are shown only for the two datasets with unspecified values of K and B; without an ablation on the initial-sample fraction m/B or on pairs with near-zero variance, it is impossible to verify that the observed gains survive the regime where the theoretical concern is most acute.

minor comments (2)

[§2] Notation: σ_i² is introduced as the unknown score variance, but the text does not explicitly state whether the concentration inequality is applied to the final empirical means after adaptive allocation or to a surrogate quantity.
[Abstract] The abstract states the bound with near-optimal allocation, but the manuscript never quantifies how close the realized n_i must be to the oracle allocation for the Õ notation to remain valid.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on the theoretical analysis and experimental validation of our variance-adaptive allocation algorithm. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Theoretical Analysis), main theorem: the claimed Õ(√(∑ σ_i² / B)) bound is stated for near-optimal allocation, yet the algorithm description indicates that variances are estimated from a fixed initial sample size m per pair. No martingale or doubling argument is referenced to control the deviation between empirical and true allocation when some σ_i are small; this leaves the worst-case guarantee unproven under the stated adaptive procedure.

Authors: We agree that the original analysis assumes an oracle near-optimal allocation. To address the estimation error from the initial m samples, we have revised Section 4 to include a doubling trick: after every doubling of the allocated queries, we re-estimate the variances and reallocate the remaining budget. Using martingale concentration inequalities, we show that the allocation stays within a (1+o(1)) factor of the optimal with high probability, preserving the Õ(√(∑ σ_i² / B)) bound. The revised theorem statement now explicitly accounts for this adaptive procedure. revision: yes
Referee: [§3.2] §3.2 (Algorithm), line describing initial sampling: the procedure allocates a modest fixed m queries upfront to estimate σ_i² before reallocating the remainder. When K is large or several σ_i approach zero, the relative error Θ(1/√m) in the sample variance can produce allocations n_i far from the oracle B·σ_i²/∑σ_j², violating the premise needed for the uniform error bound to hold.

Authors: This concern is valid when m is too small relative to K. In the revised manuscript, we have added a guideline that m = Ω(log K) suffices to ensure, via Hoeffding's inequality on the variance estimates, that the empirical allocation deviates by at most a constant factor from the oracle allocation with probability 1-1/K. For pairs with σ_i near zero, we introduce a small regularization term ε in the allocation formula n_i ∝ max(σ_i², ε), which does not affect the overall bound since their contribution to the sum is negligible. We have updated the algorithm description in §3.2 accordingly. revision: yes
Referee: Experiments section, Table 1/2: the reported reductions in worst-case error are shown only for the two datasets with unspecified values of K and B; without an ablation on the initial-sample fraction m/B or on pairs with near-zero variance, it is impossible to verify that the observed gains survive the regime where the theoretical concern is most acute.

Authors: We acknowledge the need for more comprehensive experiments. We have added new experiments in the revised manuscript, including an ablation study varying m/B from 1% to 20% on both datasets, and a synthetic dataset with 20% of pairs having σ_i = 0. The results confirm that performance gains over uniform allocation hold for m/B ≥ 5%, with only minor degradation in the zero-variance regime due to the regularization. Updated tables and figures are included in the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: bound derived from standard bandit concentration

full rationale

The claimed worst-case error bound Õ(√(∑σ_i²/B)) is obtained by applying known multi-armed bandit concentration inequalities to a variance-adaptive allocation rule. The bound is expressed directly in terms of the unknown population variances and does not reduce to any quantity fitted or defined inside the paper. No self-citation is used to justify the central guarantee, and the derivation chain remains independent of the paper's own data or ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard concentration inequalities from the multi-armed bandit literature applied to stochastic LLM judgments; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

standard math Standard multi-armed bandit concentration inequalities apply to the stochastic LLM judgment scores.
Invoked to obtain the worst-case error bound Õ(√(∑ σ_i² / B)).

pith-pipeline@v0.9.0 · 5518 in / 1253 out tokens · 26360 ms · 2026-05-15T21:14:49.836181+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our algorithm is shown to achieve a worst-case score-estimation error of Õ(√(∑_{i=1}^K σ_i² / B)) ... dynamically allocates queries based on estimated score variances
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ROBIN ... argmax σ_i² / n_i(t−1) ... ROBIN-HOOD ... ¯V_i(t) UCB variance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.