pith. sign in

arxiv: 2605.16354 · v1 · pith:4QDH3RXSnew · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL· cs.HC· stat.ML

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

Pith reviewed 2026-05-20 22:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.HCstat.ML
keywords LLM evaluationhuman ratingsdoubly robust estimationtwo-stage samplingsample size determinationmissing datastatistical powerAI benchmarks
0
0 comments X

The pith

A doubly robust estimator in two-stage sampling determines the number of human reviews needed to supplement LLM judges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLM judges should be treated as an auxiliary source rather than a direct substitute for human evaluation. It sets up the problem as a two-stage sampling design in which every item receives an LLM rating and a random subset also receives a human rating. Because the probability of receiving a human rating is known by design, a doubly robust estimator can be used that stays consistent even if the model predicting human ratings from LLM ratings is incorrect. This estimator's variance formula then gives the sample sizes needed for a target level of statistical power and guides how to assign more human ratings to evaluation types that LLMs handle less well. A sympathetic reader would care because the method replaces ad-hoc checks with a principled statistical plan for balancing cost and reliability in AI system evaluation.

Core claim

The central claim is that the LLM judge paradigm can be recast as augmenting human evaluation through two-stage sampling, where LLM ratings are obtained for the full sample and human ratings for a subsample. The doubly robust estimator exploits the known sampling probabilities to remain robust to misspecification of the LLM prediction model. Its asymptotic variance supplies the basis for choosing the numbers of human and LLM ratings required to reach a desired power, and the design can be made more efficient by directing extra human ratings toward evaluation categories with lower LLM predictability.

What carries the argument

Doubly robust estimator for the population mean under two-stage sampling, which uses the known inclusion probabilities and an outcome model to achieve double protection against model error.

If this is right

  • The required number of human ratings can be calculated from pilot estimates of how well LLM ratings predict human ones.
  • Human effort can be concentrated on the subsets of evaluations where LLM predictions are weakest.
  • Statistical power calculations become available for hybrid human-LLM evaluation studies.
  • Benchmarks can be validated with a quantifiable minimum level of human involvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampling and estimation approach could apply when the auxiliary judge is any cheaper but imperfect predictor, not only LLMs.
  • One could test the method by varying the fraction of human reviews in an existing large evaluation set and checking whether power behaves as predicted.
  • Connections exist to optimal design in survey sampling where some units are measured with expensive instruments and others with cheap proxies.

Load-bearing premise

The probabilities with which each item is selected for human review are fixed in advance and known exactly.

What would settle it

Apply the two-stage sampling to a dataset with a known overall mean human rating, use a deliberately incorrect prediction model for the LLM ratings, and check whether the estimator still recovers the true mean and whether its observed variance matches the formula used for sample-size planning.

Figures

Figures reproduced from arXiv: 2605.16354 by Jane Paik Kim.

Figure 1
Figure 1. Figure 1: Proposed formulation of a two-stage design: At the first stage, LLM ratings are collected [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Number of human reviews as function of total LLM evaluation, for a given target effective [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (Left): Number of human reviews as a function of [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Valid (p1, p2) pairs satisfying the variance equation at n ∗ = 200, for R2 1 = 0.8, R2 2 = 0.3, N = 1000. The budget line at nbudget = 100 reflects the maximum budget allowed by the investigator. The minimum-cost design is shown by the green line. because stratum 1 has a high predictive power r 2 2 = 0.8, and the minimum cost design exploits this by allocating fewer human reviews where the LLM is most reli… view at source ↗
Figure 5
Figure 5. Figure 5: Reduction in total human samples when comparing two-stratum vs. uniform sampling. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design. This paper (1) shifts the role of the LLM judge from substitutive to auxiliary, and (2) formulates the LLM-as-a-judge paradigm as one of augmenting human evaluation through a two-stage sampling design, where LLM evaluations are measured for all observations at the first stage and human ratings are partially observed for a subsample at the second stage. We propose to use a doubly robust estimator from the missing data literature, which takes advantage of the robustness property against the prediction model, since the missingness model is known by design. Using the asymptotic variance of this estimator, we propose how sample sizes of human and LLM ratings can be determined to achieve a targeted level of power. We also show that a study can be efficiently designed by allocating more human ratings for types of evaluations where the predictability of LLM ratings is not high. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper frames LLM-as-judge evaluation as a two-stage sampling design in which LLM ratings are collected on the full sample and human ratings on a known-probability subsample. It proposes a doubly robust estimator drawn from the missing-data literature that exploits the fact that the missingness (propensity) model is known by design, derives sample-size formulas from the asymptotic variance of the influence function, and gives an allocation rule that assigns more human ratings to evaluation types where LLM predictability is low.

Significance. If the finite-sample behavior of the estimator and the accuracy of the power formulas hold for realistic LLM-human rating pairs, the work supplies a principled, efficiency-oriented alternative to ad-hoc agreement checks and could reduce the human annotation burden in benchmark validation. The explicit use of a known propensity and the resulting robustness property are clear strengths; however, the absence of any simulation or real-data verification leaves the practical utility untested.

major comments (2)
  1. [Abstract] Abstract (paragraph describing the estimator): the claim that the doubly robust estimator is robust against misspecification of the LLM-to-human prediction model rests entirely on the missingness model being known by design; the manuscript provides no simulation or empirical check that this robustness translates to acceptable bias or coverage in the finite samples (typically a few hundred to a few thousand items) common in evaluation studies.
  2. [Abstract] Abstract (sample-size and power section): the asymptotic variance formulas are invoked to determine required human and LLM sample sizes, yet no Monte Carlo study or analytic finite-sample correction is presented to confirm that these formulas yield accurate power or type-I error control under the correlation structures actually observed between LLM and human ratings.
minor comments (2)
  1. Notation for the outcome regression and propensity score should be introduced with explicit symbols and distinguished from the target parameter (the mean human rating) to avoid reader confusion.
  2. The paper should cite the specific missing-data references (e.g., the original doubly robust papers) rather than only alluding to “the missing data literature.”

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for acknowledging the strengths of framing LLM-as-judge evaluation as a two-stage design with a known propensity score and doubly robust estimation. We agree that finite-sample validation of the estimator and power formulas is important for practical adoption. We will incorporate Monte Carlo simulation studies in the revised manuscript to address these points. Below we respond to each major comment.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph describing the estimator): the claim that the doubly robust estimator is robust against misspecification of the LLM-to-human prediction model rests entirely on the missingness model being known by design; the manuscript provides no simulation or empirical check that this robustness translates to acceptable bias or coverage in the finite samples (typically a few hundred to a few thousand items) common in evaluation studies.

    Authors: We thank the referee for this observation. The double-robustness property follows directly from the fact that the propensity (missingness) model is known by design in the two-stage sampling scheme; consistency holds for the target parameter even under misspecification of the LLM-to-human outcome model. We acknowledge, however, that the current manuscript contains no finite-sample Monte Carlo experiments demonstrating bias, variance, or coverage for the sample sizes typical in evaluation studies. In the revision we will add a simulation study that generates synthetic LLM-human rating pairs under a range of realistic correlation structures and evaluates the estimator at n = 200 to 5000 items. revision: yes

  2. Referee: [Abstract] Abstract (sample-size and power section): the asymptotic variance formulas are invoked to determine required human and LLM sample sizes, yet no Monte Carlo study or analytic finite-sample correction is presented to confirm that these formulas yield accurate power or type-I error control under the correlation structures actually observed between LLM and human ratings.

    Authors: We appreciate the referee's emphasis on validating the power calculations. The sample-size formulas are obtained from the asymptotic variance of the influence function of the doubly robust estimator. While such asymptotic derivations are standard, we agree that their accuracy should be checked in finite samples. The revised manuscript will include Monte Carlo experiments that compare the asymptotic power and type-I error rates against empirical rejection rates under correlation levels observed in real LLM-human rating data, for the sample sizes that arise in benchmark validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper frames LLM-as-judge evaluation as a two-stage sampling design in which LLM ratings are obtained for the full sample and human ratings are obtained on a known-probability subsample. It directly imports the doubly robust estimator and its asymptotic variance from the established missing-data literature. Because the missingness (propensity) model is fixed by the experimental design rather than estimated from data, the robustness property holds under standard conditions without requiring the LLM-to-human prediction model to be correctly specified. Sample-size formulas and allocation rules are then obtained from that variance expression. No equation or claim reduces the target result to a fitted parameter or self-defined quantity internal to the paper; the derivation remains self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework applies standard missing-data theory to the LLM-judge setting. Free parameters are researcher-chosen targets rather than data-fitted constants. No new entities are postulated.

free parameters (2)
  • target power level
    Chosen by the user to set the desired statistical power for sample-size formulas.
  • LLM predictability per evaluation type
    Used to decide differential allocation of human ratings; treated as an input that can be estimated or assumed.
axioms (2)
  • standard math Asymptotic normality and consistency of the doubly robust estimator under known missingness mechanism
    Invoked to derive variance formulas for power calculations (abstract description of estimator).
  • domain assumption Two-stage sampling design with known selection probabilities for the human subsample
    Enables the robustness property against the LLM prediction model.

pith-pipeline@v0.9.0 · 5818 in / 1394 out tokens · 40473 ms · 2026-05-20T22:40:22.553759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 3 internal anchors

  1. [1]

    Nature Medicine , pages=

    Reliability of LLMs as medical assistants for the general public: a randomized preregistered study , author=. Nature Medicine , pages=. 2026 , publisher=

  2. [2]

    arXiv preprint arXiv:2508.18076 , year=

    Neither valid nor reliable? investigating the use of llms as judges , author=. arXiv preprint arXiv:2508.18076 , year=

  3. [3]

    2013 , publisher=

    Statistical power analysis for the behavioral sciences , author=. 2013 , publisher=

  4. [4]

    Journal of the American Medical Informatics Association , volume=

    Development and validation of the provider documentation summarization quality instrument for large language models , author=. Journal of the American Medical Informatics Association , volume=. 2025 , publisher=

  5. [5]

    Econometrica , pages=

    On the role of the propensity score in efficient semiparametric estimation of average treatment effects , author=. Econometrica , pages=. 1998 , publisher=

  6. [6]

    The review of economic studies , volume=

    Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme , author=. The review of economic studies , volume=. 1997 , publisher=

  7. [7]

    2021 , publisher=

    Statistical methods for handling incomplete data , author=. 2021 , publisher=

  8. [8]

    Nature Machine Intelligence , pages=

    When large language models are reliable for judging empathic communication , author=. Nature Machine Intelligence , pages=. 2026 , publisher=

  9. [9]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Llms-as-judges: a comprehensive survey on llm-based evaluation methods , author=. arXiv preprint arXiv:2412.05579 , year=

  10. [10]

    arXiv preprint arXiv:2602.12992 , year=

    Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes , author=. arXiv preprint arXiv:2602.12992 , year=

  11. [11]

    and Rotnitzky, Andrea , title =

    Robins, James M. and Rotnitzky, Andrea , title =. Journal of the American Statistical Association , year =

  12. [12]

    The Innovation , year=

    A survey on llm-as-a-judge , author=. The Innovation , year=

  13. [13]

    Journal of the American statistical Association , volume=

    A generalization of sampling without replacement from a finite universe , author=. Journal of the American statistical Association , volume=. 1952 , publisher=

  14. [14]

    Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

    Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels? , author=. arXiv preprint arXiv:2604.14892 , year=

  15. [15]

    proceedings of the 2008 conference on empirical methods in natural language processing , pages=

    An analysis of active learning strategies for sequence labeling tasks , author=. proceedings of the 2008 conference on empirical methods in natural language processing , pages=

  16. [16]

    2025 , booktitle=

    Smarter Sampling for LLM Judges: Reliable Evaluation on a Budget , author=. 2025 , booktitle=

  17. [17]

    2016 , isbn =

    Hadley Wickham , title =. 2016 , isbn =

  18. [18]

    Statistics in medicine , volume=

    Designs and analysis of two-stage studies , author=. Statistics in medicine , volume=. 1992 , publisher=

  19. [19]

    Instruction-Following Evaluation for Large Language Models

    Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=