Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM
Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3
The pith
DiADEM learns a demographic importance vector to predict which annotators will disagree on subjective items rather than defaulting to majority labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiADEM encodes annotators through per-demographic projections scaled by a learned importance vector α, fuses these representations with item content using both concatenation and Hadamard interactions, and trains end-to-end with an item-level disagreement loss that penalizes incorrect variance predictions. Evaluated on the DICES conversational-safety and VOICED political-offense datasets, the model exceeds LLM-as-a-judge and neural baselines on both standard metrics and perspectivist disagreement tracking, reaching r=0.75 correlation on DICES. The resulting α weights identify race and age as the most influential demographic factors across both benchmarks.
What carries the argument
The learned importance vector α that governs per-demographic projections to model annotator disagreement distributions.
If this is right
- Explicit demographic modeling produces higher-fidelity representations of annotator variance than majority aggregation or prompted LLMs.
- Race and age receive the highest learned importance for disagreement prediction in both conversational safety and political offense tasks.
- Item-level disagreement losses enable direct optimization for variance rather than point estimates.
- The same architecture yields consistent demographic insights across two independent subjective benchmarks.
- NLP systems that incorporate who the annotators are can preserve interpretive diversity instead of erasing it.
Where Pith is reading between the lines
- The method could be tested on intersectional demographic combinations to check whether combined factors like race and gender interact beyond single-axis weights.
- Similar importance weighting might reduce over-reliance on majority labels in content moderation pipelines where perspective differences affect safety judgments.
- Collecting finer-grained or self-reported lived-experience data instead of standard demographic categories could strengthen the proxy assumption in future datasets.
Load-bearing premise
Recorded demographic attributes serve as accurate and sufficient proxies for the lived experiences that actually determine annotator perspectives, and the learned weights apply to new annotators and items.
What would settle it
DiADEM's disagreement correlation falling below that of majority-vote or LLM baselines on a held-out subjective labeling dataset whose demographic labels show no systematic link to observed variance.
Figures
read the original abstract
When humans label subjective content, they disagree, and that disagreement is not noise. It reflects genuine differences in perspective shaped by annotators' social identities and lived experiences. Yet standard practice still flattens these judgments into a single majority label, and recent LLM-based approaches fare no better: we show that prompted large language models, even with chain-of-thought reasoning, fail to recover the structure of human disagreement. We introduce DiADEM, a neural architecture that learns "how much each demographic axis matters" for predicting who will disagree and on what. DiADEM encodes annotators through per-demographic projections governed by a learned importance vector $\boldsymbol{\alpha}$, fuses annotator and item representations via complementary concatenation and Hadamard interactions, and is trained with a novel item-level disagreement loss that directly penalizes mispredicted annotation variance. On the DICES conversational-safety and VOICED political-offense benchmarks, DiADEM substantially outperforms both the LLM-as-a-judge and neural model baselines across standard and perspectivist metrics, achieving strong disagreement tracking ($r{=}0.75$ on DICES). The learned $\boldsymbol{\alpha}$ weights reveal that race and age consistently emerge as the most influential demographic factors driving annotator disagreement across both datasets. Our results demonstrate that explicitly modeling who annotators are not just what they label is essential for NLP systems that aim to faithfully represent human interpretive diversity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DiADEM, a neural architecture that learns a demographic importance vector α to encode annotators via per-demographic projections, fuses representations using complementary concatenation and Hadamard interactions, and trains with an item-level disagreement loss. It claims to substantially outperform LLM-as-a-judge and neural baselines on the DICES and VOICED benchmarks, achieving r=0.75 disagreement correlation on DICES, while the learned α consistently identifies race and age as the most influential factors driving annotator disagreement.
Significance. If the empirical results hold under more rigorous validation, the work is significant for perspectivist NLP and annotator modeling: it moves beyond majority-vote flattening or generic LLM prompting by explicitly learning which demographic axes matter for disagreement prediction. The parameter-free derivation of the importance vector and the direct optimization of annotation variance are strengths that could influence dataset curation and evaluation practices in subjective content labeling.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the central claim of substantial outperformance (including r=0.75 on DICES) is presented without reported baseline implementation details, statistical significance tests, data-split protocols, or ablation results on the α vector; this directly affects verifiability of the headline result.
- [§5] §5 (Results and Analysis): no cross-dataset or out-of-distribution evaluation is reported for the learned α (e.g., applying the model to new annotators whose demographics are drawn from a shifted distribution), so the claim that race and age are consistently the most influential factors rests on only two benchmarks and cannot yet be treated as general.
- [§3] §3 (Method): the architecture assumes the recorded demographic fields are accurate, complete, and causally primary proxies for the perspectives that drive disagreement; the paper provides no sensitivity analysis or controls for potential confounders (education, region, ideology) that could make the fitted α an artifact of the particular dataset encoding.
minor comments (2)
- [§3] Notation: ensure the importance vector is consistently denoted as bold α throughout and that the Hadamard product is explicitly defined when first introduced.
- [Tables] Table 1 or equivalent: add a row or column reporting the number of annotators per demographic category to allow readers to assess sparsity in the learned weights.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of verifiability, generalizability, and methodological assumptions that we address point by point below. We have outlined specific revisions to strengthen the manuscript while preserving the integrity of the reported results.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of substantial outperformance (including r=0.75 on DICES) is presented without reported baseline implementation details, statistical significance tests, data-split protocols, or ablation results on the α vector; this directly affects verifiability of the headline result.
Authors: We agree that additional implementation details are required for full reproducibility and verification. In the revised manuscript, we will expand §4 with: complete descriptions of baseline implementations (including exact LLM prompts and neural model hyperparameters), explicit data-split protocols (annotator-level versus item-level), statistical significance tests (e.g., bootstrap confidence intervals or paired tests on the correlation metrics), and ablation results on the α vector (uniform weights, per-demographic removals). These additions will directly support the reported performance, including the r=0.75 disagreement correlation on DICES. revision: yes
-
Referee: [§5] §5 (Results and Analysis): no cross-dataset or out-of-distribution evaluation is reported for the learned α (e.g., applying the model to new annotators whose demographics are drawn from a shifted distribution), so the claim that race and age are consistently the most influential factors driving annotator disagreement rests on only two benchmarks and cannot yet be treated as general.
Authors: We acknowledge that the consistency of race and age as top-weighted factors in α is observed only across the two benchmarks used. Cross-dataset and out-of-distribution evaluations for α are not feasible without additional datasets containing compatible demographic annotations and annotator distributions, which are unavailable. In the revision, we will update §5 to qualify the language, present the finding as specific to DICES and VOICED, and add a limitations discussion calling for future multi-benchmark validation. This preserves the observed pattern without overstating generality. revision: partial
-
Referee: [§3] §3 (Method): the architecture assumes the recorded demographic fields are accurate, complete, and causally primary proxies for the perspectives that drive disagreement; the paper provides no sensitivity analysis or controls for potential confounders (education, region, ideology) that could make the fitted α an artifact of the particular dataset encoding.
Authors: This is a valid observation on the scope of the demographic proxies. DiADEM uses the self-reported fields exactly as provided in the source datasets and does not assert causal primacy. The benchmarks lack metadata on potential confounders such as education, region, or ideology, precluding sensitivity analyses. In the revised §3 and a new limitations subsection, we will explicitly state this assumption, clarify that α captures associations within the given encodings, and recommend richer annotator profiles for future work. These clarifications qualify the interpretation without altering the method or results. revision: yes
Circularity Check
No significant circularity; DiADEM performance and α weights are direct empirical outputs
full rationale
The paper defines a neural model that learns a demographic importance vector α from training data, fuses representations, and optimizes a disagreement loss; it then reports empirical outperformance on DICES and VOICED benchmarks plus the resulting α ranking. These quantities are produced by standard supervised training and evaluation on held-out annotations rather than by any equation that reduces the claimed result to a re-expression of its own fitted parameters. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described architecture. The central claims remain falsifiable against external data and do not collapse by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- α (demographic importance vector)
axioms (1)
- domain assumption Neural networks with the described fusion operations can learn to predict annotator label distributions from demographic and item features.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.