Stop measuring calibration when humans disagree

Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fernandez · 2022 · DOI 10.18653/v1/2022.emnlp-main.124

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Controlled experiments on MNIST show human soft-labels act as a regularizer that improves calibration on hard samples and aligns model uncertainty with humans, beyond accuracy gains from correcting mislabels.

NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.

Quantifying and Predicting Disagreement in Graded Human Ratings

cs.CL · 2026-05-01 · unverdicted · novelty 5.0

Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.

Modeling Human Perspectives with Socio-Demographic Representations

cs.CL · 2026-04-20 · unverdicted · novelty 5.0

Socio-Contrastive Learning jointly learns socio-demographic representations and textual features via contrastive objectives to predict annotator perspectives more accurately than concatenation baselines.

IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

Automated hate speech detectors show poor alignment with heterogeneous in-group judgments on reclaimed slur usage, driven by low inter-annotator agreement and contextual features like derogatory intent.

citing papers explorer

Showing 5 of 5 citing papers.

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration cs.LG · 2026-05-18 · unverdicted · none · ref 5
Controlled experiments on MNIST show human soft-labels act as a regularizer that improves calibration on hard samples and aligns model uncertainty with humans, beyond accuracy gains from correcting mislabels.
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics cs.CV · 2026-05-08 · unverdicted · none · ref 10
VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
Quantifying and Predicting Disagreement in Graded Human Ratings cs.CL · 2026-05-01 · unverdicted · none · ref 3
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
Modeling Human Perspectives with Socio-Demographic Representations cs.CL · 2026-04-20 · unverdicted · none · ref 118
Socio-Contrastive Learning jointly learns socio-demographic representations and textual features via contrastive objectives to predict annotator perspectives more accurately than concatenation baselines.
IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language cs.CL · 2026-04-17 · unverdicted · none · ref 8
Automated hate speech detectors show poor alignment with heterogeneous in-group judgments on reclaimed slur usage, driven by low inter-annotator agreement and contextual features like derogatory intent.

Stop measuring calibration when humans disagree

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer