A framework jointly models annotator-specific NLI labels and explanations using conditioned representations and two explainer architectures, improving predictive performance over baselines.
Toward a perspectivist turn in ground truthing for predictive computing
8 Pith papers cite this work. Polarity classification is still indexing.
years
2026 8verdicts
UNVERDICTED 8representative citing papers
Demographic-conditioned fusion embeddings improve prediction of perspectivist social meaning interpretations by 5.9-6.5% relative macro PR-AUC over text-only baselines, with ablations confirming demographic signal.
The Ghost Annotator framework applies conformal prediction and collaborative filtering representations to measure LLM divergence from human annotations across four models and datasets, revealing higher confidence in misaligned cases and consistent demographic misalignment.
Agreement-based clustering of annotators improves performance on subjective NLP tasks by capturing diverse perspectives better than majority voting or per-annotator modeling.
Large-scale statistical analysis of four harmful language datasets reveals that interactions between annotator characteristics and linguistic cues drive annotation variation, with lexical features and attitudes prominent but patterns varying by dataset.
Extending language models with annotator-specific layers improves individual moral annotation predictions and reveals perspective variations hidden by label aggregation.
A domain-agnostic framework extracts perspectives from book reviews showing LLMs underrepresent rarer viewpoints relative to human text.
Multi-level bootstrapping models annotator variance using large rater-ID datasets to find optimal tradeoffs between number of items N and ratings per item K for statistically significant AI evaluations.
citing papers explorer
-
Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling
Multi-level bootstrapping models annotator variance using large rater-ID datasets to find optimal tradeoffs between number of items N and ratings per item K for statistically significant AI evaluations.