We Need to Consider Disagreement in Evaluation

Valerio Basile, Michael Fell, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, Massimo Poesio, Alexandra Uma · 2021 · DOI 10.18653/v1/2021.bppf-1.3

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

Large-scale statistical analysis of four harmful language datasets reveals that interactions between annotator characteristics and linguistic cues drive annotation variation, with lexical features and attitudes prominent but patterns varying by dataset.

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

Multi-level bootstrapping models annotator variance using large rater-ID datasets to find optimal tradeoffs between number of items N and ratings per item K for statistically significant AI evaluations.

Structured Disagreement in Health-Literacy Annotation: Epistemic Stability, Conceptual Difficulty, and Agreement-Stratified Inference

cs.CL · 2026-04-21 · unverdicted · novelty 5.0

Disagreement in health-literacy annotations is driven by conceptual task difficulty rather than annotator differences, with social effects varying or reversing by agreement level, making perspectivist modeling necessary.

IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

Automated hate speech detectors show poor alignment with heterogeneous in-group judgments on reclaimed slur usage, driven by low inter-annotator agreement and contextual features like derogatory intent.

citing papers explorer

Showing 4 of 4 citing papers.

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation cs.CL · 2026-05-07 · unverdicted · none · ref 63
Large-scale statistical analysis of four harmful language datasets reveals that interactions between annotator characteristics and linguistic cues drive annotation variation, with lexical features and attitudes prominent but patterns varying by dataset.
Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 6
Multi-level bootstrapping models annotator variance using large rater-ID datasets to find optimal tradeoffs between number of items N and ratings per item K for statistically significant AI evaluations.
Structured Disagreement in Health-Literacy Annotation: Epistemic Stability, Conceptual Difficulty, and Agreement-Stratified Inference cs.CL · 2026-04-21 · unverdicted · none · ref 28
Disagreement in health-literacy annotations is driven by conceptual task difficulty rather than annotator differences, with social effects varying or reversing by agreement level, making perspectivist modeling necessary.
IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language cs.CL · 2026-04-17 · unverdicted · none · ref 10
Automated hate speech detectors show poor alignment with heterogeneous in-group judgments on reclaimed slur usage, driven by low inter-annotator agreement and contextual features like derogatory intent.

We Need to Consider Disagreement in Evaluation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer