STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

Akash Bonagiri; Angelina Lai; Devang Borkar; Gerard Janno Anderias; Gezheng Kang; Houman Homayoun; Ishant Gandhi; Saee Patil; Setareh Rafatirad

arxiv: 2605.02122 · v2 · pith:65R7KA66new · submitted 2026-05-04 · 💻 cs.LG · cs.AI

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

Akash Bonagiri , Gerard Janno Anderias , Saee Patil , Angelina Lai , Devang Borkar , Gezheng Kang , Ishant Gandhi , Setareh Rafatirad

show 1 more author

Houman Homayoun

This is my paper

Pith reviewed 2026-05-09 16:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords AI evaluationannotator disagreementranking stabilityprobabilistic modelinghuman annotationsystem rankingdisagreement-aware evaluationmajority vote

0 comments

The pith

STABLEVAL models latent item correctness and annotator confusion to produce stable AI system rankings where majority vote fails.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that human evaluations of AI systems suffer from unstable rankings when using simple majority vote, because that method ignores differences in annotator reliability and item ambiguity. STABLEVAL instead builds a probabilistic model of latent correctness for each item and confusion patterns specific to each annotator, then computes posterior expected credits and calibrated system scores with explicit uncertainty. A sympathetic reader would care because AI progress still depends on human judgment, and fragile rankings make it hard to know whether one system truly outperforms another across repeated evaluations. The authors support the claim with synthetic experiments that vary heterogeneity and noise plus several real human-annotated benchmarks, where majority vote degrades while STABLEVAL remains steadier.

Core claim

STABLEVAL is a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. It treats ranking stability as a first-class objective and shows that this approach preserves underlying annotator behavior better than majority vote or label-denoising methods such as Dawid-Skene, resulting in lower score error and more consistent system orderings under controlled heterogeneity and adversarial noise.

What carries the argument

The probabilistic model of latent item correctness together with annotator-specific confusion patterns, which generates posterior expected credits and calibrated scores rather than hard labels.

If this is right

Majority vote exhibits increasing score error and ranking instability as annotator heterogeneity and adversarial noise grow.
STABLEVAL produces lower error and more stable system rankings across the same conditions.
Ranking stability must be treated as an explicit goal separate from recovering individual hard labels.
Disagreement modeling improves reproducibility of AI evaluations on both synthetic and real human-annotated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modeling approach could be applied to other subjective ranking tasks such as content moderation or creative evaluation to reduce dependence on single annotator pools.
Quantifying the amount of disagreement that still allows reliable rankings might let practitioners decide when additional annotators are worth the cost.
If the posteriors prove reliable, evaluation pipelines could report confidence intervals on system scores instead of point estimates.

Load-bearing premise

The chosen probabilistic model of latent item correctness and annotator confusion patterns will produce posteriors that genuinely reflect real-world stability rather than artifacts of the modeling assumptions.

What would settle it

Run the same set of items through multiple independent annotator groups and check whether STABLEVAL system rankings remain consistent across groups while majority-vote rankings flip; reversal of that pattern would falsify the stability advantage.

Figures

Figures reproduced from arXiv: 2605.02122 by Akash Bonagiri, Angelina Lai, Devang Borkar, Gerard Janno Anderias, Gezheng Kang, Houman Homayoun, Ishant Gandhi, Saee Patil, Setareh Rafatirad.

**Figure 1.** Figure 1: Synthetic Evaluation Pipeline. Starting from a base configuration, we systematically vary six ablation parameters: adversarial fraction, strict and lenient annotator fractions, hard item probability, labels per item, and agent quality gaps. For each configuration, we generate observed labels, fit three aggregation methods Majority Vote, Dawid-Skene (Hard), and Posterior Expected Credit and compute evalua… view at source ↗

**Figure 2.** Figure 2: Real dataset evaluation pipeline. Four benchmark datasets (MT-Bench, ConvAbuse, QAGS, MSLR) with collected human labels are aggregated using three methods: Majority Votes, Dawid–Skene(Hard), and Posterior Expected Credit. Agent scores are computed and evaluated across four metrics: Agent Scores, Ranking Stability, Item Ambiguity, and Annotator Diagnostics. 6 view at source ↗

**Figure 4.** Figure 4: MSE across varying proportions of biased annotators. Top: Strict annotators (0–40%). Bottom: Lenient annotators (0–40%). Dawid–Skene achieves the lowest error across all configurations view at source ↗

**Figure 5.** Figure 5: Ranking Accuracy vs Adversarial Fraction. Ranking accuracy (with 95% confidence intervals) as the fraction of adversarial annotators increases from 0% to 40%. Posterior Expected Credit maintains near-perfect accuracy across all fractions. Majority Vote drops from 0.998 to 0.988 at 40% adversarial fraction. 15 view at source ↗

**Figure 8.** Figure 8: MSE Across Agent Quality Configurations. MSE (with 95% confidence intervals) comparing aggregation methods under tight and wide quality gaps among agents. The tight configuration uses agent qualities [0.85, 0.80, 0.70, 0.55, 0.35, 0.20]; the wide configuration uses [0.75, 0.70, 0.65, 0.60, 0.55, 0.50]. Dawid–Skene achieves the lowest error in the tight configuration (0.00047). Majority Vote error increase… view at source ↗

**Figure 9.** Figure 9: Ranking Accuracy vs Agent Gap Type. Ranking accuracy (with 95% confidence intervals) comparing aggregation methods under tight and wide quality gaps among agents. Dawid–Skene and Posterior Expected Credit converge near 1.000 in the wide configuration. Majority Vote increases from 0.9684 in the tight configuration to 0.9982 in the wide configuration, trailing the other methods by approximately 0.003 in the … view at source ↗

**Figure 12.** Figure 12: MSE Across Varying Numbers of Labels Per Item. MSE (with 95% confidence intervals) comparing aggregation methods as the number of labels per item increases from 3 to 9. Dawid– Skene achieves the lowest error across all configurations. Majority Vote error decreases from 0.00478 with 3 labels to 0.00110 with 9 labels. Posterior Expected Credit error decreases from 0.00560 to 0.00098 across the same range view at source ↗

**Figure 13.** Figure 13: Ranking Accuracy vs Labels Per Item. Ranking accuracy (with 95% confidence intervals) comparing aggregation methods as the number of labels per item increases from 3 to 9. Majority Vote improves monotonically from 0.9948 with 3 labels to 1.0000 with 9 labels. Dawid–Skene and Posterior Expected Credit show non-monotonic behavior, peaking near 5 labels before declining slightly, then recovering at 9 labels… view at source ↗

**Figure 16.** Figure 16: Agent scores comparison across methods on QAGS for two summarization models (CNN, XSUM). Scores shown for Majority Vote (green), Dawid-Skene Hard (blue), and Posterior Expected Credit (purple) view at source ↗

**Figure 17.** Figure 17: Agent scores comparison across three evaluation methods on MSLR for six agents: PX7SGV, 8FWF5T, SPNXTA, AQ85CE, VNCH8M, and JB6Z8F. Scores shown for Majority Vote (green), Dawid-Skene Hard (blue), and Posterior Expected Credit (purple) 18 view at source ↗

read the original abstract

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STABLEVAL targets ranking stability in human AI eval by modeling annotator confusion instead of just label recovery, but the gains look tied to its own generative assumptions.

read the letter

The core idea is to treat stable system rankings as the main goal rather than trying to recover hidden true labels. STABLEVAL builds a probabilistic model of item correctness and per-annotator confusion, then uses posterior expected credits and calibrated scores to rank systems while keeping uncertainty visible. This is a clear shift from Dawid-Skene style methods, which focus on denoising labels first. The paper shows majority vote producing rising error and rank flips as annotator heterogeneity or adversarial noise increases, while STABLEVAL stays steadier on both synthetic data and several real human-annotated benchmarks. That analysis of how aggregation choices distort behavior is useful and directly addresses a practical pain point in current leaderboards. The experiments are run on controlled synthetics plus multiple real datasets, which gives some breadth. The distinction from prior label-aggregation work is stated cleanly in the abstract and holds up in the framing. The main weakness is that the synthetic results are generated from a model close to the one being fit, so the reported stability improvements could be partly circular. Real-world benchmarks lack an external ground-truth measure of ranking stability, so it is hard to tell whether the posteriors are capturing actual robustness or just the model's preferred output. The paper would be stronger with more ablations on the confusion parameterization and explicit checks for correlated annotator biases that the current model may miss. This work is aimed at researchers who run or maintain human evaluations for AI systems and at anyone who relies on benchmark rankings for decisions. Readers who care about reducing noise in leaderboards will find the stability objective and the majority-vote failure cases worth reading. The paper deserves a serious referee because the problem it attacks is central to reproducible AI evaluation and the proposed objective is well-defined, even though the empirical support needs tightening on the real-data side. I would send it out for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. It claims that this leads to more stable and statistically grounded system rankings compared to majority vote, as shown in synthetic experiments and real-world human-annotated benchmarks.

Significance. If the empirical findings are robust, STABLEVAL could improve the reliability of human evaluations in AI, addressing a key challenge in reproducible research. The emphasis on ranking stability as a primary objective is a notable contribution to the field of evaluation methodologies.

major comments (2)

[Synthetic Experiments] Synthetic Experiments section: The synthetic data appears to be generated from a latent model similar to the one used in STABLEVAL, raising the possibility that the reported improvements in stability are due to model alignment rather than general applicability. This is load-bearing for the claim of robustness under annotator heterogeneity.
[Real-world Benchmarks] Real-world Benchmarks section: There is no independent ground-truth measure of ranking stability provided for the human-annotated datasets, making it challenging to verify that the reductions in score error are not artifacts of the probabilistic modeling assumptions.

minor comments (2)

[Abstract] Abstract: The phrase 'statistically grounded system rankings' should be clarified with specific statistical measures or tests used to support the claims.
[Related Work] Related Work: Consider adding a more detailed comparison table with Dawid-Skene and other label aggregation methods to highlight the differences in objectives.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating planned revisions where appropriate to improve clarity and robustness.

read point-by-point responses

Referee: [Synthetic Experiments] Synthetic Experiments section: The synthetic data appears to be generated from a latent model similar to the one used in STABLEVAL, raising the possibility that the reported improvements in stability are due to model alignment rather than general applicability. This is load-bearing for the claim of robustness under annotator heterogeneity.

Authors: We agree that the synthetic data generation shares structural elements with STABLEVAL to enable controlled simulation of annotator confusion and heterogeneity with known ground truth. This design choice isolates the impact of aggregation methods rather than testing recovery of the exact generative process. To strengthen the claim, we will add experiments using synthetic data generated from alternative models (e.g., independent per-annotator error rates without shared latent structure and non-probabilistic noise models) and report results in a revised Synthetic Experiments section. revision: partial
Referee: [Real-world Benchmarks] Real-world Benchmarks section: There is no independent ground-truth measure of ranking stability provided for the human-annotated datasets, making it challenging to verify that the reductions in score error are not artifacts of the probabilistic modeling assumptions.

Authors: We acknowledge that real-world human annotations lack direct ground truth for system rankings, as item correctness is latent by nature. Stability is assessed via proxies including ranking variance across random annotator subsets and degradation under injected adversarial noise, which are standard for evaluating robustness in the absence of oracle labels. We will revise the Real-world Benchmarks section to more explicitly describe these proxies, include sensitivity checks to modeling assumptions, and discuss their limitations as indirect measures. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of a distinct modeling framework

full rationale

The paper introduces STABLEVAL as a new disagreement-aware framework that models latent item correctness and annotator confusion patterns to produce posterior expected credits and calibrated scores, explicitly distinguishing it from label-recovery methods like Dawid-Skene. It formalizes ranking stability as an objective and supports claims via controlled synthetic experiments plus real-world human-annotated benchmarks showing reduced score error and instability under heterogeneity. No equations, derivations, or self-citations are shown that reduce outputs to inputs by construction, fitted parameters renamed as predictions, or ansatz smuggling. The central results depend on external benchmark comparisons rather than internal definitional equivalence or load-bearing self-references, making the derivation self-contained against the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard domain assumptions about annotator behavior that are common in crowdsourcing literature but not independently validated here.

axioms (2)

domain assumption Annotator responses arise from latent item correctness combined with annotator-specific confusion patterns
Core modeling premise stated in the abstract for producing posterior expected item credit
domain assumption Modeling disagreement explicitly improves ranking stability over majority vote
Claimed outcome of the framework that underpins the comparison to baseline aggregation

pith-pipeline@v0.9.0 · 5510 in / 1392 out tokens · 34108 ms · 2026-05-09T16:49:03.882664+00:00 · methodology

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)