pith. sign in

arxiv: 2604.07591 · v1 · submitted 2026-04-08 · 📊 stat.ME · cs.AI· cs.CL· cs.LG· stat.ML

From Ground Truth to Measurement: A Statistical Framework for Human Labeling

Pith reviewed 2026-05-10 17:04 UTC · model grok-4.3

classification 📊 stat.ME cs.AIcs.CLcs.LGstat.ML
keywords human labelingmeasurement errorlabel variationannotator biasinstance difficultynatural language inferencestatistical modelingdata quality
0
0 comments X

The pith

A statistical framework decomposes human labeling into four distinct sources of variation rather than treating all disagreement as noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that supervised machine learning should treat labeled data as the outcome of a measurement process instead of assuming labels directly reflect an accurate ground truth. It introduces a model that separates labeling variation into instance difficulty, annotator bias, situational noise, and relational alignment among annotators. This setup extends classical measurement-error models so they can handle both a single shared truth and cases where different annotators hold valid but differing interpretations. The framework also supplies a diagnostic tool to determine which of those two regimes better describes a particular annotation task. When applied to a multi-annotator natural language inference dataset, the model detects evidence of all four variation components.

Core claim

The central claim is that labeling outcomes can be decomposed into four interpretable components—instance difficulty, annotator bias, situational noise, and relational alignment—within a statistical model that extends classical measurement-error models to accommodate both shared and individualized notions of truth, together with a diagnostic for deciding which regime fits a given task.

What carries the argument

The four-component statistical decomposition of labeling variation inside an extended classical measurement-error model.

If this is right

  • Label disagreement can be diagnosed as arising from item difficulty, annotator bias, random noise, or alignment differences rather than lumped together as uniform error.
  • A given annotation task can be classified as closer to a traditional shared-truth regime or to an individualized-interpretation regime.
  • Data collection and quality control can be targeted at the specific variation components that dominate in a task.
  • Machine-learning pipelines can move from treating all labels as equally reliable toward accounting for the identified sources of measurement variation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition could be applied to image, audio, or medical annotation tasks to reveal whether their dominant variation sources differ systematically from text tasks.
  • Active learning or crowdsourcing platforms might use the estimated annotator-bias and alignment parameters to assign items to specific workers.
  • The diagnostic could be used to decide when to invest in clearer annotation guidelines versus when to accept and model multiple valid labels.

Load-bearing premise

The four sources of variation can be statistically identified and cleanly separated from ordinary multi-annotator data without the split being dictated by the particular model structure or fitting method chosen.

What would settle it

Fitting the model to the same multi-annotator dataset under two different estimation procedures or model structures and finding that the estimated sizes or identities of the four components change substantially or become statistically indistinguishable.

Figures

Figures reproduced from arXiv: 2604.07591 by Christoph Kern, Frauke Kreuter, Robert Chew, Stephanie Eckman.

Figure 1
Figure 1. Figure 1: Illustration of error components in the Global Ground Truth model. Each row demonstrates a different source of annotation error using a claim veri￾fication task. Instance-level error (βj ): Both annotators make the same mistake on an ambiguous claim where multiple interpretations are justifiable. Between￾person error (ρi): Annotators differ consistently across trials due to stable indi￾vidual characteristi… view at source ↗
Figure 2
Figure 2. Figure 2: Measurement model structure. The conceptual error decomposition (top) is operationalized as a latent propensity Y ∗ ijt (middle), which is transformed via link function Φ to generate observed categorical labels (bottom). The additive structure on the latent scale implies multiplicative effects on the probability/odds scale. Following the tradition of Item Response Theory (Lord and Novick, 2008) and general… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of variation components in the Individual Ground Truth model. Interpretive variation (δij ): Both annotators are internally consistent but apply different offensiveness thresholds. Instance ambiguity (βj ): Context￾dependence creates genuine uncertainty; neither annotator can consistently self￾validate any interpretation across trials, demonstrating that some instances resist stable judgment e… view at source ↗
read the original abstract

Supervised machine learning assumes that labeled data provide accurate measurements of the concepts models are meant to learn. Yet in practice, human labeling introduces systematic variation arising from ambiguous items, divergent interpretations, and simple mistakes. Machine learning research commonly treats all disagreement as noise, which obscures these distinctions and limits our understanding of what models actually learn. This paper reframes annotation as a measurement process and introduces a statistical framework for decomposing labeling outcomes into interpretable sources of variation: instance difficulty, annotator bias, situational noise, and relational alignment. The framework extends classical measurement-error models to accommodate both shared and individualized notions of truth, reflecting traditional and human label variation interpretations of error, and provides a diagnostic for assessing which regime better characterizes a given task. Applying the proposed model to a multi-annotator natural language inference dataset, we find empirical evidence for all four theorized components and demonstrate the effectiveness of our approach. We conclude with implications for data-centric machine learning and outline how this approach can guide the development of a more systematic science of labeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a statistical framework for decomposing human annotation outcomes into four interpretable sources of variation—instance difficulty, annotator bias, situational noise, and relational alignment—by extending classical measurement-error models. The framework accommodates both shared and individualized notions of truth and includes a diagnostic to assess which regime better characterizes a labeling task. Application to a multi-annotator NLI dataset is reported to show empirical support for all four components, with implications for data-centric machine learning.

Significance. If the decomposition is statistically identifiable and the diagnostic is robust to modeling choices, the work could meaningfully advance annotation science in ML by moving beyond treating disagreement as undifferentiated noise. The explicit linkage to measurement theory and the empirical demonstration on NLI data are strengths that could guide more systematic data curation and evaluation practices.

major comments (2)
  1. [§3] §3 (model formulation): The central claim that the four variation components can be separated from multi-annotator labels requires an identifiability analysis or recovery simulation. Different parameter configurations must produce observably distinct label distributions; absent this, the diagnostic for shared vs. individualized truth regimes risks being an artifact of the likelihood structure or fitting constraints rather than a data-driven property.
  2. [§5] §5 (empirical results): The NLI application reports that all four components appear, but without reported parameter recovery experiments under controlled misspecification or sensitivity checks on the decomposition, it is unclear whether the separation is driven by the data or by the model specification (e.g., orthogonality assumptions or scale constraints on the four components).
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly state the quantitative criteria used for the diagnostic (e.g., thresholds on component variances or model comparison metrics) to make the regime-assessment procedure reproducible.
  2. [Notation] Notation for the four components should be introduced once with a clear table or equation reference and then used consistently to improve readability across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript's potential contribution and for the detailed, constructive comments on identifiability and empirical validation. We address each major comment below and have revised the manuscript to incorporate additional analyses that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§3] §3 (model formulation): The central claim that the four variation components can be separated from multi-annotator labels requires an identifiability analysis or recovery simulation. Different parameter configurations must produce observably distinct label distributions; absent this, the diagnostic for shared vs. individualized truth regimes risks being an artifact of the likelihood structure or fitting constraints rather than a data-driven property.

    Authors: We agree that a rigorous demonstration of identifiability is necessary to support the central claims. In the revised manuscript we have added a new subsection to §3 that provides a formal identifiability analysis. We derive sufficient conditions (including rank conditions on the design matrix of annotator-item interactions and non-degeneracy of the noise distributions) under which the four components are uniquely recoverable from the observed label distributions. We further include Monte Carlo recovery simulations that generate synthetic data from known parameter configurations and show that the model recovers the true values with low bias and that the shared-versus-individualized diagnostic recovers the ground-truth regime at high accuracy. These additions establish that distinct parameter configurations produce observably different label distributions and that the diagnostic is not an artifact of the likelihood or fitting constraints. revision: yes

  2. Referee: [§5] §5 (empirical results): The NLI application reports that all four components appear, but without reported parameter recovery experiments under controlled misspecification or sensitivity checks on the decomposition, it is unclear whether the separation is driven by the data or by the model specification (e.g., orthogonality assumptions or scale constraints on the four components).

    Authors: We acknowledge that additional checks are needed to confirm that the observed decomposition is data-driven. The revised §5 now contains two new sets of experiments. First, we report parameter-recovery simulations on synthetic data whose marginal statistics match those of the NLI corpus, including controlled misspecification of the noise model. Second, we present sensitivity analyses that systematically vary the orthogonality constraints between components and the scale-normalization choices, re-fitting the model on the real NLI data under each variant. The results show that the estimated magnitudes of all four components remain stable and that the conclusion of non-negligible contributions from each source is robust to these modeling choices. We have also added a brief discussion of the identifiability conditions that justify the chosen constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a new statistical framework that extends classical measurement-error models by introducing four interpretable components of labeling variation and a diagnostic for shared vs. individualized truth. No load-bearing step reduces by construction to fitted inputs, self-citations, or renamed known results; the decomposition is defined via the model structure and applied empirically to data. The abstract and description contain no equations or claims where a 'prediction' or uniqueness result is forced by the inputs themselves. This is the common case of an honest modeling contribution whose validity rests on external identifiability and recovery properties rather than tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is inferred from the described extension of classical measurement-error models. The framework likely introduces parameters for each of the four variation sources and assumes the components are identifiable from annotation patterns.

free parameters (1)
  • parameters for instance difficulty, annotator bias, situational noise, and relational alignment
    The model must estimate separate variance or effect terms for each of the four sources to decompose the observed labels.
axioms (1)
  • domain assumption Classical measurement-error models can be extended to labeling tasks with both shared and individualized truth
    The paper states that the framework extends these models to accommodate both interpretations of error.

pith-pipeline@v0.9.0 · 5491 in / 1257 out tokens · 37327 ms · 2026-05-10T17:04:39.599873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Order effects in annotation tasks: Further evidence of annotation sensitivity

    Jacob Beck, Stephanie Eckman, Bolei Ma, Rob Chew, and Frauke Kreuter. Order effects in annotation tasks: Further evidence of annotation sensitivity. InProceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024), pages 81–86,

  2. [2]

    Fair inference on error-prone outcomes.arXiv preprint arXiv:2003.07621,

    Laura Boeschoten, Erik-Jan van Kesteren, Ayoub Bagheri, and Daniel L Oberski. Fair inference on error-prone outcomes.arXiv preprint arXiv:2003.07621,

  3. [3]

    You are what you annotate: Towards better models through annotator repre- sentations.arXiv preprint arXiv:2305.14663,

    Naihao Deng, Xinliang Frederick Zhang, Siyang Liu, Winston Wu, Lu Wang, and Rada Mihalcea. You are what you annotate: Towards better models through annotator repre- sentations.arXiv preprint arXiv:2305.14663,

  4. [4]

    The perspectivist paradigm shift: Assumptions and challenges of capturing human labels.arXiv preprint arXiv:2405.05860,

    Eve Fleisig, Su Lin Blodgett, Dan Klein, and Zeerak Talat. The perspectivist paradigm shift: Assumptions and challenges of capturing human labels.arXiv preprint arXiv:2405.05860,

  5. [5]

    ” garbage in, garbage out” revisited: What do machine learn- ing application papers report about human-labeled training data?arXiv preprint arXiv:2107.02278,

    R Stuart Geiger, Dominique Cope, Jamie Ip, Marsha Lotosh, Aayush Shah, Jenny Weng, and Rebekah Tang. ” garbage in, garbage out” revisited: What do machine learn- ing application papers report about human-labeled training data?arXiv preprint arXiv:2107.02278,

  6. [6]

    Sources of uncertainty in machine learning–a statisticians’ view,

    Cornelia Gruber, Patrick Oliver Schenk, Malte Schierholz, Frauke Kreuter, and G¨ oran Kauermann. Sources of uncertainty in machine learning–a statisticians’ view.arXiv preprint arXiv:2305.16703,

  7. [7]

    Litex: A linguistic taxonomy of explanations for understanding within-label variation in natural language inference.arXiv preprint arXiv:2505.22848,

    Pingjun Hong, Beiduo Chen, Siyao Peng, Marie-Catherine de Marneffe, and Barbara Plank. Litex: A linguistic taxonomy of explanations for understanding within-label variation in natural language inference.arXiv preprint arXiv:2505.22848,

  8. [8]

    Jacobs and Hanna Wallach

    22 Measurement Error Labeling Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 375–385, New York, NY, USA,

  9. [9]

    Jacobs and Hanna Wallach

    Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445901. URLhttps://doi.org/10.1145/ 3442188.3445901. Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. The principles of data-centric ai.Communications of the ACM, 66(8):84–92,

  10. [10]

    Ecologically valid explanations for label variation in NLI

    Nan-Jiang Jiang, Chenhao Tan, and Marie-Catherine de Marneffe. Ecologically valid explanations for label variation in NLI. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10622–10633, Singapore, December

  11. [11]

    doi: 10.18653/v1/2023.findings-emnlp.712

    Association for Computational Linguis- tics. doi: 10.18653/v1/2023.findings-emnlp.712. URLhttps://aclanthology.org/ 2023.findings-emnlp.712/. Christoph Kern, Stephanie Eckman, Jacob Beck, Rob Chew, Bolei Ma, and Frauke Kreuter. Annotation sensitivity: Training data collection methods affect model performance.arXiv preprint arXiv:2311.14212,

  12. [12]

    Yixin Nie, Xiang Zhou, and Mohit Bansal. What can we learn from collective human opinions on natural language inference data? In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9131–9143, Online, November

  13. [13]

    doi: 10.18653/v1/2020.emnlp-main.734

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.734. URL https://aclanthology.org/2020.emnlp-main.734/. Curtis Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets desta- bilize machine learning benchmarks. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing System...

  14. [14]

    ” is a picture of a bird a bird”: Policy recommendations for dealing with ambiguity in machine vision models.arXiv preprint arXiv:2306.15777,

    Alicia Parrish, Sarah Laszlo, and Lora Aroyo. ” is a picture of a bird a bird”: Policy recommendations for dealing with ambiguity in machine vision models.arXiv preprint arXiv:2306.15777,

  15. [15]

    Making deep neural networks robust to label noise: A loss correction approach

    Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952,

  16. [16]

    arXiv preprint arXiv:2306.06826 , year=

    Jiaxin Pei and David Jurgens. When do annotator demographics matter? measuring the influence of annotator demographics with the popquorn dataset.arXiv preprint arXiv:2306.06826,

  17. [17]

    The “problem” of human label variation: On ground truth in data, modeling and evaluation.arXiv:2211.02570, 2022

    Barbara Plank. The’problem’of human label variation: On ground truth in data, modeling and evaluation.arXiv preprint arXiv:2211.02570,

  18. [18]

    Perturbing the per- spective: How explanation impacts annotator labels

    Vinodkumar Prabhakaran, Ben Hutchinson, and Margaret Mitchell. Perturbing the per- spective: How explanation impacts annotator labels. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,

  19. [19]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing. Association for Computational Linguistics, 11

  20. [20]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    URL https://arxiv.org/abs/1908.10084. Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. Inproceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15,

  21. [21]

    Maarten Sap, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, and Noah A. Smith. Annotators with attitudes: How annotator beliefs and identities bias toxic lan- guage detection. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Proceedings of the 2022 Conference of the North American Chapter of the Association f...

  22. [22]

    doi: 10.18653/v1/2022.naacl-main.431

    Association for Computational Linguis- tics. doi: 10.18653/v1/2022.naacl-main.431. URLhttps://aclanthology.org/2022. naacl-main.431/. David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems.Advances in n...

  23. [23]

    Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks

    Rion Snow, Brendan O’connor, Dan Jurafsky, and Andrew Y Ng. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. InProceedings of the 2008 conference on empirical methods in natural language processing, pages 254–263,

  24. [24]

    Alexandra Uma, Tommaso Fornaciari, Silviu Paun, Barbara Plank, Dirk Hovy, and Massimo Poesio

    25 Chew et al. Alexandra Uma, Tommaso Fornaciari, Silviu Paun, Barbara Plank, Dirk Hovy, and Massimo Poesio. Learning from disagreement: A survey. InProceedings of the 2021 Conference of the European Chapter of the Association for Computational Linguistics,

  25. [25]

    ACM Transactions on Computer-Human Interaction, 27(5)

    Hanna Wallach, Meera Desai, A Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P Alex Dow, et al. Position: Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2502.00561,

  26. [26]

    URLhttps://aclanthology.org/2024.acl-long.123/

    18653/v1/2024.acl-long.123. URLhttps://aclanthology.org/2024.acl-long.123/. Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier Movellan, and Paul Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in neural information processing systems, 22,

  27. [27]

    Don’t waste a single annotation: Improving single-label classifiers through soft labels.arXiv preprint arXiv:2311.05265,

    Ben Wu, Yue Li, Yida Mu, Carolina Scarton, Kalina Bontcheva, and Xingyi Song. Don’t waste a single annotation: Improving single-label classifiers through soft labels.arXiv preprint arXiv:2311.05265,

  28. [28]

    These features serve two complementary purposes: (1) to test whether systematic properties of the text predict labeling error, and (2) to illustrate which kinds of instances contribute most to disagreement under alternative ground-truth assumptions. The variables encompass both objective textual characteristics (e.g., length, 28 Measurement Error Labeling...