pith. sign in

arxiv: 2605.02122 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

Pith reviewed 2026-05-09 16:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords AI evaluationannotator disagreementranking stabilityprobabilistic modelinghuman annotationsystem rankingdisagreement-aware evaluationmajority vote
0
0 comments X

The pith

STABLEVAL models latent item correctness and annotator confusion to produce stable AI system rankings where majority vote fails.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that human evaluations of AI systems suffer from unstable rankings when using simple majority vote, because that method ignores differences in annotator reliability and item ambiguity. STABLEVAL instead builds a probabilistic model of latent correctness for each item and confusion patterns specific to each annotator, then computes posterior expected credits and calibrated system scores with explicit uncertainty. A sympathetic reader would care because AI progress still depends on human judgment, and fragile rankings make it hard to know whether one system truly outperforms another across repeated evaluations. The authors support the claim with synthetic experiments that vary heterogeneity and noise plus several real human-annotated benchmarks, where majority vote degrades while STABLEVAL remains steadier.

Core claim

STABLEVAL is a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. It treats ranking stability as a first-class objective and shows that this approach preserves underlying annotator behavior better than majority vote or label-denoising methods such as Dawid-Skene, resulting in lower score error and more consistent system orderings under controlled heterogeneity and adversarial noise.

What carries the argument

The probabilistic model of latent item correctness together with annotator-specific confusion patterns, which generates posterior expected credits and calibrated scores rather than hard labels.

If this is right

  • Majority vote exhibits increasing score error and ranking instability as annotator heterogeneity and adversarial noise grow.
  • STABLEVAL produces lower error and more stable system rankings across the same conditions.
  • Ranking stability must be treated as an explicit goal separate from recovering individual hard labels.
  • Disagreement modeling improves reproducibility of AI evaluations on both synthetic and real human-annotated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modeling approach could be applied to other subjective ranking tasks such as content moderation or creative evaluation to reduce dependence on single annotator pools.
  • Quantifying the amount of disagreement that still allows reliable rankings might let practitioners decide when additional annotators are worth the cost.
  • If the posteriors prove reliable, evaluation pipelines could report confidence intervals on system scores instead of point estimates.

Load-bearing premise

The chosen probabilistic model of latent item correctness and annotator confusion patterns will produce posteriors that genuinely reflect real-world stability rather than artifacts of the modeling assumptions.

What would settle it

Run the same set of items through multiple independent annotator groups and check whether STABLEVAL system rankings remain consistent across groups while majority-vote rankings flip; reversal of that pattern would falsify the stability advantage.

Figures

Figures reproduced from arXiv: 2605.02122 by Akash Bonagiri, Angelina Lai, Devang Borkar, Gerard Janno Anderias, Gezheng Kang, Houman Homayoun, Ishant Gandhi, Saee Patil, Setareh Rafatirad.

Figure 1
Figure 1. Figure 1: Synthetic Evaluation Pipeline. Starting from a base configuration, we systematically vary six ablation parameters: ad￾versarial fraction, strict and lenient annotator fractions, hard item probability, labels per item, and agent quality gaps. For each configuration, we generate observed labels, fit three aggregation methods Majority Vote, Dawid-Skene (Hard), and Posterior Ex￾pected Credit and compute evalua… view at source ↗
Figure 2
Figure 2. Figure 2: Real dataset evaluation pipeline. Four benchmark datasets (MT-Bench, ConvAbuse, QAGS, MSLR) with collected human labels are aggregated using three methods: Majority Votes, Dawid–Skene(Hard), and Posterior Expected Credit. Agent scores are computed and evaluated across four metrics: Agent Scores, Ranking Stability, Item Ambiguity, and Annotator Diagnostics. 6 view at source ↗
Figure 4
Figure 4. Figure 4: MSE across varying proportions of biased annotators. Top: Strict annotators (0–40%). Bottom: Lenient annotators (0–40%). Dawid–Skene achieves the lowest error across all con￾figurations view at source ↗
Figure 5
Figure 5. Figure 5: Ranking Accuracy vs Adversarial Fraction. Ranking accuracy (with 95% confidence intervals) as the fraction of adver￾sarial annotators increases from 0% to 40%. Posterior Expected Credit maintains near-perfect accuracy across all fractions. Major￾ity Vote drops from 0.998 to 0.988 at 40% adversarial fraction. 15 view at source ↗
Figure 8
Figure 8. Figure 8: MSE Across Agent Quality Configurations. MSE (with 95% confidence intervals) comparing aggregation methods under tight and wide quality gaps among agents. The tight configu￾ration uses agent qualities [0.85, 0.80, 0.70, 0.55, 0.35, 0.20]; the wide configuration uses [0.75, 0.70, 0.65, 0.60, 0.55, 0.50]. Dawid–Skene achieves the lowest error in the tight configuration (0.00047). Majority Vote error increase… view at source ↗
Figure 9
Figure 9. Figure 9: Ranking Accuracy vs Agent Gap Type. Ranking accuracy (with 95% confidence intervals) comparing aggregation methods under tight and wide quality gaps among agents. Dawid–Skene and Posterior Expected Credit converge near 1.000 in the wide configuration. Majority Vote increases from 0.9684 in the tight configuration to 0.9982 in the wide configuration, trailing the other methods by approximately 0.003 in the … view at source ↗
Figure 12
Figure 12. Figure 12: MSE Across Varying Numbers of Labels Per Item. MSE (with 95% confidence intervals) comparing aggregation methods as the number of labels per item increases from 3 to 9. Dawid– Skene achieves the lowest error across all configurations. Majority Vote error decreases from 0.00478 with 3 labels to 0.00110 with 9 labels. Posterior Expected Credit error decreases from 0.00560 to 0.00098 across the same range view at source ↗
Figure 13
Figure 13. Figure 13: Ranking Accuracy vs Labels Per Item. Ranking ac￾curacy (with 95% confidence intervals) comparing aggregation methods as the number of labels per item increases from 3 to 9. Majority Vote improves monotonically from 0.9948 with 3 labels to 1.0000 with 9 labels. Dawid–Skene and Posterior Expected Credit show non-monotonic behavior, peaking near 5 labels before declining slightly, then recovering at 9 labels… view at source ↗
Figure 16
Figure 16. Figure 16: Agent scores comparison across methods on QAGS for two summarization models (CNN, XSUM). Scores shown for Majority Vote (green), Dawid-Skene Hard (blue), and Posterior Expected Credit (purple) view at source ↗
Figure 17
Figure 17. Figure 17: Agent scores comparison across three evaluation meth￾ods on MSLR for six agents: PX7SGV, 8FWF5T, SPNXTA, AQ85CE, VNCH8M, and JB6Z8F. Scores shown for Majority Vote (green), Dawid-Skene Hard (blue), and Posterior Expected Credit (purple) 18 view at source ↗
read the original abstract

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. It claims that this leads to more stable and statistically grounded system rankings compared to majority vote, as shown in synthetic experiments and real-world human-annotated benchmarks.

Significance. If the empirical findings are robust, STABLEVAL could improve the reliability of human evaluations in AI, addressing a key challenge in reproducible research. The emphasis on ranking stability as a primary objective is a notable contribution to the field of evaluation methodologies.

major comments (2)
  1. [Synthetic Experiments] Synthetic Experiments section: The synthetic data appears to be generated from a latent model similar to the one used in STABLEVAL, raising the possibility that the reported improvements in stability are due to model alignment rather than general applicability. This is load-bearing for the claim of robustness under annotator heterogeneity.
  2. [Real-world Benchmarks] Real-world Benchmarks section: There is no independent ground-truth measure of ranking stability provided for the human-annotated datasets, making it challenging to verify that the reductions in score error are not artifacts of the probabilistic modeling assumptions.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'statistically grounded system rankings' should be clarified with specific statistical measures or tests used to support the claims.
  2. [Related Work] Related Work: Consider adding a more detailed comparison table with Dawid-Skene and other label aggregation methods to highlight the differences in objectives.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating planned revisions where appropriate to improve clarity and robustness.

read point-by-point responses
  1. Referee: [Synthetic Experiments] Synthetic Experiments section: The synthetic data appears to be generated from a latent model similar to the one used in STABLEVAL, raising the possibility that the reported improvements in stability are due to model alignment rather than general applicability. This is load-bearing for the claim of robustness under annotator heterogeneity.

    Authors: We agree that the synthetic data generation shares structural elements with STABLEVAL to enable controlled simulation of annotator confusion and heterogeneity with known ground truth. This design choice isolates the impact of aggregation methods rather than testing recovery of the exact generative process. To strengthen the claim, we will add experiments using synthetic data generated from alternative models (e.g., independent per-annotator error rates without shared latent structure and non-probabilistic noise models) and report results in a revised Synthetic Experiments section. revision: partial

  2. Referee: [Real-world Benchmarks] Real-world Benchmarks section: There is no independent ground-truth measure of ranking stability provided for the human-annotated datasets, making it challenging to verify that the reductions in score error are not artifacts of the probabilistic modeling assumptions.

    Authors: We acknowledge that real-world human annotations lack direct ground truth for system rankings, as item correctness is latent by nature. Stability is assessed via proxies including ranking variance across random annotator subsets and degradation under injected adversarial noise, which are standard for evaluating robustness in the absence of oracle labels. We will revise the Real-world Benchmarks section to more explicitly describe these proxies, include sensitivity checks to modeling assumptions, and discuss their limitations as indirect measures. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of a distinct modeling framework

full rationale

The paper introduces STABLEVAL as a new disagreement-aware framework that models latent item correctness and annotator confusion patterns to produce posterior expected credits and calibrated scores, explicitly distinguishing it from label-recovery methods like Dawid-Skene. It formalizes ranking stability as an objective and supports claims via controlled synthetic experiments plus real-world human-annotated benchmarks showing reduced score error and instability under heterogeneity. No equations, derivations, or self-citations are shown that reduce outputs to inputs by construction, fitted parameters renamed as predictions, or ansatz smuggling. The central results depend on external benchmark comparisons rather than internal definitional equivalence or load-bearing self-references, making the derivation self-contained against the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard domain assumptions about annotator behavior that are common in crowdsourcing literature but not independently validated here.

axioms (2)
  • domain assumption Annotator responses arise from latent item correctness combined with annotator-specific confusion patterns
    Core modeling premise stated in the abstract for producing posterior expected item credit
  • domain assumption Modeling disagreement explicitly improves ranking stability over majority vote
    Claimed outcome of the framework that underpins the comparison to baseline aggregation

pith-pipeline@v0.9.0 · 5510 in / 1392 out tokens · 34108 ms · 2026-05-09T16:49:03.882664+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    and KhudaBukhsh, Ashiqur R

    Dutta, Sujan and Pandita, Deepak and Weerasooriya, Tharindu Cyril and Zampieri, Marcos and Homan, Christopher M. and KhudaBukhsh, Ashiqur R. , title =. 2025 , isbn =. doi:10.1609/aaai.v39i13.33558 , booktitle =

  2. [2]

    Leveraging Annotator Disagreement for Text Classification

    Xu, Jin and Theune, Mari. Leveraging Annotator Disagreement for Text Classification. Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024). 2024

  3. [3]

    2025 , eprint=

    Improving Deep Ensembles by Estimating Confusion Matrices , author=. 2025 , eprint=

  4. [4]

    Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

    Don’t blame the annotator: Bias already starts in the annotation instructions , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

  5. [5]

    Highlights in Science, Engineering and Technology , volume=

    Analysis of the different statistical metrics in machine learning , author=. Highlights in Science, Engineering and Technology , volume=

  6. [6]

    arXiv preprint arXiv:2507.03392 , year=

    Absolute evaluation measures for machine learning: a survey , author=. arXiv preprint arXiv:2507.03392 , year=

  7. [7]

    2023 , eprint=

    A Holistic Assessment of the Reliability of Machine Learning Systems , author=. 2023 , eprint=

  8. [8]

    Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume=

    Maximum likelihood estimation of observer error-rates using the EM algorithm , author=. Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume=. 1979 , publisher=

  9. [9]

    B ayesian Calibration of Win Rate Estimation with LLM Evaluators

    Gao, Yicheng and Xu, Gonghan and Wang, Zhe and Cohan, Arman. B ayesian Calibration of Win Rate Estimation with LLM Evaluators. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.273

  10. [10]

    When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

    Fleisig, Eve and Abebe, Rediet and Klein, Dan. When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.415

  11. [11]

    2025 , eprint=

    Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives , author=. 2025 , eprint=

  12. [12]

    2025 , eprint=

    QuMAB: Query-based Multi-Annotator Behavior Modeling with Reliability under Sparse Labels , author=. 2025 , eprint=

  13. [13]

    Collective Human Opinions in Semantic Textual Similarity

    Wang, Yuxia and Tao, Shimin and Xie, Ning and Yang, Hao and Baldwin, Timothy and Verspoor, Karin. Collective Human Opinions in Semantic Textual Similarity. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00584

  14. [14]

    PLOS ONE , publisher =

    Aggregating soft labels from crowd annotations improves uncertainty estimation under distribution shift , year =. PLOS ONE , publisher =. doi:10.1371/journal.pone.0323064 , author =

  15. [15]

    KhudaBukhsh , and Christopher Homan

    Weerasooriya, Tharindu Cyril and Ororbia, Alexander and Bhensadadia, Raj and KhudaBukhsh, Ashiqur and Homan, Christopher. Disagreement Matters: Preserving Label Diversity by Jointly Modeling Item and Annotator Label Distributions with D is C o. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.287

  16. [16]

    2026 , eprint=

    Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? , author=. 2026 , eprint=

  17. [17]

    2026 , eprint=

    Beyond Consensus: Perspectivist Modeling and Evaluation of Annotator Disagreement in NLP , author=. 2026 , eprint=

  18. [18]

    Beyond Averages: Learning with Annotator Disagreement in STS

    Benito-Santos, Alejandro and Ghajari, Adrian. Beyond Averages: Learning with Annotator Disagreement in STS. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1800

  19. [19]

    Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio

    Uma, Alexandra N. and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , title =. 2022 , issue_date =. doi:10.1613/jair.1.12752 , journal =

  20. [20]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  21. [21]

    C onv A buse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI

    Cercas Curry, Amanda and Abercrombie, Gavin and Rieser, Verena. C onv A buse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.587

  22. [22]

    Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , url=

    Wang, Alex and Cho, Kyunghyun and Lewis, Mike , year=. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , url=. doi:10.18653/v1/2020.acl-main.450 , journal=

  23. [23]

    Overview of MSLR 2022: A Shared Task on Multi-document Summarization for Literature Reviews

    Wang, Lucy Lu and DeYoung, Jay and Wallace, Byron. Overview of MSLR 2022: A Shared Task on Multi-document Summarization for Literature Reviews. Proceedings of the Third Workshop on Scholarly Document Processing. 2022

  24. [24]

    EMNLP , year=

    MSˆ2: Multi-Document Summarization of Medical Studies , author=. EMNLP , year=

  25. [25]

    AMIA Annual Symposium , year=

    Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization , author=. AMIA Annual Symposium , year=

  26. [26]

    2025 , eprint=

    The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs , author=. 2025 , eprint=

  27. [27]

    2022 , eprint=

    The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation , author=. 2022 , eprint=

  28. [28]

    Hierarchical Evaluation Framework: Best Practices for Human Evaluation

    Bojic, Iva and Chen, Jessica and Chang, Si Yuan and Ong, Qi Chwen and Joty, Shafiq and Car, Josip. Hierarchical Evaluation Framework: Best Practices for Human Evaluation. Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems. 2023