Principled Evaluation with Human Labels: One Rater at a Time and Rater Equivalence

Grant Schoenebeck; Paul Resnick; Tim Weninger; Yuqing Kong

arxiv: 2106.01254 · v3 · submitted 2021-06-02 · 💻 cs.LG · cs.HC· cs.MA

Principled Evaluation with Human Labels: One Rater at a Time and Rater Equivalence

Paul Resnick , Yuqing Kong , Grant Schoenebeck , Tim Weninger This is my paper

Pith reviewed 2026-05-24 12:34 UTC · model grok-4.3

classification 💻 cs.LG cs.HCcs.MA

keywords human evaluationclassifier scoringrater equivalencemajority voteinter-rater disagreementutility modelbenchmark panels

0 comments

The pith

When human raters disagree, scoring classifiers against one rater at a time and averaging those scores is more principled than majority vote under a utility model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In classification tasks without definitive ground truth, human raters often disagree. The paper argues that majority-vote scoring of classifiers is not justified when raters lack either objectivity or equanimity. Instead, a utility model supports scoring each classifier against individual raters and averaging those scores. It also defines rater equivalence as the smallest number of raters whose combined judgment matches a classifier's performance and gives an optimal way to combine their labels for benchmarks. This matters because evaluation practices directly affect which classifiers are selected and deployed in real systems where human judgment is the only standard.

Core claim

The paper claims that under a utility model appropriate for settings where human judgments disagree, scoring against one rater at a time and averaging the scores is more principled than majority vote when objectivity or equanimity fails. It introduces rater equivalence, the smallest number of human raters whose combined judgment matches the classifier's performance, and provides a provably optimal algorithm for combining benchmark panel labels.

What carries the argument

Rater equivalence defined as the smallest number of human raters whose combined judgment matches the classifier's performance, together with the one-rater-at-a-time scoring method under the utility model.

If this is right

Evaluations of classifiers in subjective tasks should shift from majority vote to averaged single-rater scores.
Benchmark panels can be reduced to the size given by rater equivalence without losing matching power.
The optimal combination algorithm allows fairer comparisons between human panels and classifiers.
Decisions about deploying classifiers can be based on utility rather than agreement counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the utility model holds, many existing benchmark results in subjective labeling tasks may need re-evaluation using the single-rater method.
The rater equivalence idea could apply to preference aggregation in ranking or recommendation systems.
Empirical tests on tasks with measurable downstream utility would show whether the one-rater approach changes model selection.

Load-bearing premise

A utility model is appropriate for the evaluation setting and that the failure of objectivity or equanimity is the condition that invalidates majority-vote scoring.

What would settle it

A controlled comparison where the ordering of classifiers by majority-vote scores differs from the ordering by averaged single-rater scores, and where an external measure of realized utility confirms which ordering is better.

read the original abstract

In many classification tasks, there is no definitive ground truth, only human judgments that may disagree. We address two challenges that arise in such settings: (1) how to use human raters to score classifiers, and (2) how to use them for comparison benchmarks. For the first, the common practice is to score classifiers against the majority vote of an evaluation panel of several human raters. We argue that this is not justified when either of two properties fails: objectivity or equanimity. Instead, under a utility model appropriate for such settings, scoring against one rater at a time and averaging the scores across raters is a more principled approach. For the second, we introduce the concept of rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. We provide a provably optimal algorithm for combining benchmark panel labels, and demonstrate the framework through case studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces rater equivalence and argues for single-rater scoring plus averaging over majority vote under a utility model when objectivity or equanimity fails, but the abstract leaves the derivations unshown.

read the letter

This paper's main new pieces are the definition of rater equivalence—the smallest number of raters whose combined labels match a classifier's performance—and the claim that scoring against one rater at a time then averaging is preferable to majority vote when a utility model fits and either objectivity or equanimity does not hold. It supplies a provably optimal algorithm for the benchmark-panel case and illustrates with case studies. Those elements are not standard in the prior literature on human-labeled evaluation, so the framing is fresh for tasks like content moderation or medical labeling where disagreement is routine. The case studies give a concrete sense of how the framework changes numbers in practice. The central soft spot is exactly the one the stress-test flags: the abstract invokes a utility model and an optimality proof but does not display the functional form, the formal definitions of the two properties, or the inequality that shows why majority vote fails. Without those steps visible, the recommendation to replace common practice rests on an assumption rather than a derived result. If the full paper contains the missing derivations and they are tight, the contribution strengthens; if they are special-case or rest on unstated restrictions, the practical advice weakens. The work is aimed at researchers who build or audit benchmarks that rely on human raters. Anyone who cares about how disagreement affects model selection or leaderboard construction would find the framework worth examining, provided the formal parts check out. It deserves peer review so referees can inspect the proofs and test whether the utility argument generalizes beyond the chosen model.

Referee Report

2 major / 1 minor

Summary. The paper addresses evaluation of classifiers in settings without definitive ground truth, where human raters may disagree. It argues that the common practice of scoring against majority vote is not justified when objectivity or equanimity fails, and proposes instead scoring against one rater at a time and averaging the scores under an appropriate utility model. The paper introduces the concept of rater equivalence, defined as the smallest number of human raters whose combined judgment matches the classifier's performance, and provides a provably optimal algorithm for combining benchmark panel labels, demonstrated through case studies.

Significance. If the utility model and the optimality of the algorithm hold, this could significantly impact how classifiers are evaluated in domains with subjective human labels, such as natural language processing and computer vision, by providing a more principled alternative to majority voting and a new way to determine benchmark panel sizes. The work highlights conditions under which standard practices may be invalid.

major comments (2)

[Abstract] Abstract: The abstract states that there is a 'utility model appropriate for such settings' under which per-rater scoring is more principled when objectivity or equanimity fails, but does not provide the functional form of the utility, formal definitions of objectivity and equanimity, or a derivation showing that majority vote is invalidated under these conditions. This is load-bearing for the central claim replacing common practice.
[Abstract] Abstract: The abstract claims a 'provably optimal algorithm' for combining benchmark panel labels but provides no proof sketch, algorithm description, or reference to the section containing the proof and analysis. The central claims cannot be verified from the provided text alone.

minor comments (1)

The abstract mentions 'case studies' but does not specify the domains or key findings from them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We respond to each major comment below, providing clarification and indicating where revisions will be made to improve the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that there is a 'utility model appropriate for such settings' under which per-rater scoring is more principled when objectivity or equanimity fails, but does not provide the functional form of the utility, formal definitions of objectivity and equanimity, or a derivation showing that majority vote is invalidated under these conditions. This is load-bearing for the central claim replacing common practice.

Authors: The abstract is a high-level summary by design. The utility model (with explicit functional form as the expected value of per-rater agreement utilities) is defined in Section 3. Formal definitions of objectivity (consistency with an objective ground truth when it exists) and equanimity (symmetric treatment of raters without bias) appear in Section 2. The derivation that majority voting is suboptimal under the utility model when either property fails is given in Proposition 2 of Section 4. We will revise the abstract to include parenthetical references to these sections. revision: yes
Referee: [Abstract] Abstract: The abstract claims a 'provably optimal algorithm' for combining benchmark panel labels but provides no proof sketch, algorithm description, or reference to the section containing the proof and analysis. The central claims cannot be verified from the provided text alone.

Authors: The abstract summarizes the contribution concisely. The algorithm for optimal panel combination is described in Section 5, and the proof of optimality (via dynamic programming on a submodular objective) is provided in Theorem 3 of Section 6, including a proof sketch. We will revise the abstract to add a reference to Section 5. revision: yes

Circularity Check

0 steps flagged

No circularity; new framework proposed without reduction to inputs or self-citations

full rationale

The paper introduces a utility model to argue that per-rater scoring (then averaging) is preferable to majority vote when objectivity or equanimity fails, and defines rater equivalence with a provably optimal combination algorithm. No equations, fitted parameters, or derivations appear in the abstract or description that reduce the central claims to the inputs by construction (e.g., no self-definitional loops, no fitted input renamed as prediction, no load-bearing self-citation chains). The framework is presented as a self-contained proposal with new concepts rather than a derivation that assumes its own conclusion. External benchmarks or independent verification are not required here because the work does not claim to derive results from prior fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no free parameters, axioms, or invented entities are specified.

pith-pipeline@v0.9.0 · 5701 in / 1089 out tokens · 34247 ms · 2026-05-24T12:34:27.794587+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

scoring against one rater at a time and averaging the scores across raters is a more principled approach under a utility model... survey equivalence is the smallest number of human raters whose combined judgment matches the classifier's performance
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (ABC+CE estimates mutual information)... lim |I|→∞ SPC(W, ABC(·), CE(·))k − SPC(W, ABC(·), CE(·))0 = MI(Yk+1;Y1,...,Yk)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.