The “problem” of human label variation: On ground truth in data, modeling and evaluation

Plank, B · 2022 · arXiv 2211.02570

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Implicit Humanization in Everyday LLM Moral Judgments

cs.CY · 2026-03-23 · unverdicted · novelty 7.0

LLM responses to moral judgment queries reinforce implicit humanization, potentially exacerbating overreliance and misplaced trust.

Understanding Annotator Safety Policy with Interpretability

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.

Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care

cs.LG · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

A framework treating clinician overrides as implicit preferences to jointly train reward and capability models for clinical AI, with a taxonomy and alternating optimization to prevent suppression bias.

Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.

Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.

From Ground Truth to Measurement: A Statistical Framework for Human Labeling

stat.ME · 2026-04-08 · unverdicted · novelty 6.0

A statistical framework decomposes human annotation outcomes into four interpretable variation sources and extends classical measurement-error models to handle both shared and individualized notions of truth.

Calibrating Model-Based Evaluation Metrics for Summarization

cs.CL · 2026-04-19 · unverdicted · novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

citing papers explorer

Showing 7 of 7 citing papers.

Implicit Humanization in Everyday LLM Moral Judgments cs.CY · 2026-03-23 · unverdicted · none · ref 37
LLM responses to moral judgment queries reinforce implicit humanization, potentially exacerbating overreliance and misplaced trust.
Understanding Annotator Safety Policy with Interpretability cs.AI · 2026-05-06 · unverdicted · none · ref 57
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care cs.LG · 2026-04-30 · unverdicted · none · ref 17 · 2 links
A framework treating clinician overrides as implicit preferences to jointly train reward and capability models for clinical AI, with a taxonomy and alternating optimization to prevent suppression bias.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives cs.CL · 2026-04-22 · unverdicted · none · ref 156
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains cs.CL · 2026-04-19 · unverdicted · none · ref 13
Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.
From Ground Truth to Measurement: A Statistical Framework for Human Labeling stat.ME · 2026-04-08 · unverdicted · none · ref 17
A statistical framework decomposes human annotation outcomes into four interpretable variation sources and extends classical measurement-error models to handle both shared and individualized notions of truth.
Calibrating Model-Based Evaluation Metrics for Summarization cs.CL · 2026-04-19 · unverdicted · none · ref 126
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

The “problem” of human label variation: On ground truth in data, modeling and evaluation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer