Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

Christopher M. Homan; Deepak Pandita; Samay U. Shetty; Tharindu Cyril Weerasooriya

arxiv: 2604.08425 · v1 · submitted 2026-04-09 · 💻 cs.AI · cs.CL

Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

Samay U. Shetty , Tharindu Cyril Weerasooriya , Deepak Pandita , Christopher M. Homan This is my paper

Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords annotator disagreementdemographic importance weightingperspectivist NLPsubjective labelingdisagreement modelingDICES benchmarkVOICED benchmarkneural architecture

0 comments

The pith

DiADEM learns a demographic importance vector to predict which annotators will disagree on subjective items rather than defaulting to majority labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that disagreement among human annotators on subjective content arises from real differences in perspective tied to social identities, not random error. It demonstrates that prompted large language models cannot recover the patterns of this disagreement even with reasoning steps. DiADEM addresses this by learning how much each demographic attribute should shape predictions of annotator distributions through a dedicated importance vector. The approach combines annotator and item features with a custom loss that targets variance directly and is shown to outperform standard baselines on two benchmarks. A sympathetic reader would care because many NLP tasks involve subjective judgments where flattening to one label erases meaningful diversity.

Core claim

DiADEM encodes annotators through per-demographic projections scaled by a learned importance vector α, fuses these representations with item content using both concatenation and Hadamard interactions, and trains end-to-end with an item-level disagreement loss that penalizes incorrect variance predictions. Evaluated on the DICES conversational-safety and VOICED political-offense datasets, the model exceeds LLM-as-a-judge and neural baselines on both standard metrics and perspectivist disagreement tracking, reaching r=0.75 correlation on DICES. The resulting α weights identify race and age as the most influential demographic factors across both benchmarks.

What carries the argument

The learned importance vector α that governs per-demographic projections to model annotator disagreement distributions.

If this is right

Explicit demographic modeling produces higher-fidelity representations of annotator variance than majority aggregation or prompted LLMs.
Race and age receive the highest learned importance for disagreement prediction in both conversational safety and political offense tasks.
Item-level disagreement losses enable direct optimization for variance rather than point estimates.
The same architecture yields consistent demographic insights across two independent subjective benchmarks.
NLP systems that incorporate who the annotators are can preserve interpretive diversity instead of erasing it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on intersectional demographic combinations to check whether combined factors like race and gender interact beyond single-axis weights.
Similar importance weighting might reduce over-reliance on majority labels in content moderation pipelines where perspective differences affect safety judgments.
Collecting finer-grained or self-reported lived-experience data instead of standard demographic categories could strengthen the proxy assumption in future datasets.

Load-bearing premise

Recorded demographic attributes serve as accurate and sufficient proxies for the lived experiences that actually determine annotator perspectives, and the learned weights apply to new annotators and items.

What would settle it

DiADEM's disagreement correlation falling below that of majority-vote or LLM baselines on a held-out subjective labeling dataset whose demographic labels show no systematic link to observed variance.

Figures

Figures reproduced from arXiv: 2604.08425 by Christopher M. Homan, Deepak Pandita, Samay U. Shetty, Tharindu Cyril Weerasooriya.

**Figure 1.** Figure 1: Block Diagram of DiADEM Encoder and Decoder architecture where: • D is the number of demographic features (e.g., gender, race, age, education, locale) • ad is the one-hot or dense encoding for demographic d • Wd ∈ R|ad |×da is the learnable projection matrix for demographic d • α = softmax(αraw) ∈ RD are normalized importance weights satisfying ∑d αd = 1 These weights α are learned end to end via backp… view at source ↗

**Figure 2.** Figure 2: DICES item split: confusion matrix and per-class F1/support. [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: DICES item split: disagreement calibration (variance and entropy). [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: DICES annotator split: confusion matrix and per-class F1/support. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: DICES annotator split: disagreement calibration (variance and entropy). [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 8.** Figure 8: VOICED annotator split: confusion matrix and per-class F1/support. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 6.** Figure 6: VOICED item split: confusion matrix and per-class F1/support. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: VOICED item split: disagreement calibration (variance and entropy). [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 9.** Figure 9: VOICED annotator split: disagreement calibration (variance and entropy). [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

When humans label subjective content, they disagree, and that disagreement is not noise. It reflects genuine differences in perspective shaped by annotators' social identities and lived experiences. Yet standard practice still flattens these judgments into a single majority label, and recent LLM-based approaches fare no better: we show that prompted large language models, even with chain-of-thought reasoning, fail to recover the structure of human disagreement. We introduce DiADEM, a neural architecture that learns "how much each demographic axis matters" for predicting who will disagree and on what. DiADEM encodes annotators through per-demographic projections governed by a learned importance vector $\boldsymbol{\alpha}$, fuses annotator and item representations via complementary concatenation and Hadamard interactions, and is trained with a novel item-level disagreement loss that directly penalizes mispredicted annotation variance. On the DICES conversational-safety and VOICED political-offense benchmarks, DiADEM substantially outperforms both the LLM-as-a-judge and neural model baselines across standard and perspectivist metrics, achieving strong disagreement tracking ($r{=}0.75$ on DICES). The learned $\boldsymbol{\alpha}$ weights reveal that race and age consistently emerge as the most influential demographic factors driving annotator disagreement across both datasets. Our results demonstrate that explicitly modeling who annotators are not just what they label is essential for NLP systems that aim to faithfully represent human interpretive diversity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiADEM learns demographic importance weights to model annotator disagreement and reports better tracking than baselines on two datasets, but the proxy quality of those demographics remains an untested assumption.

read the letter

The main takeaway is that DiADEM uses a learned vector alpha to weight per-demographic projections when predicting who will disagree on subjective items, and it reaches 0.75 correlation on disagreement for DICES while beating the LLM and neural baselines they report. The architecture adds an item-level loss that directly targets variance instead of just label accuracy, plus a fusion step with complementary concatenation and Hadamard products. That combination is not a routine extension of earlier perspectivist work, and the consistent top ranking of race and age in alpha across both DICES and VOICED is a concrete output worth noting. The paper shows the model can recover more of the structure of human disagreement than prompted LLMs with chain-of-thought, which is a useful empirical point for tasks where flattening to majority labels loses information. The results line up with the goal of representing interpretive diversity rather than treating disagreement as noise. The central vulnerability is that everything hinges on the recorded demographic fields being accurate and primary drivers of perspective. If those fields are noisy, incomplete, or stand in for unmeasured factors like education or ideology, then the learned alpha may reflect dataset encoding more than a stable discovery. The paper tests only on DICES and VOICED, with no check on whether alpha stays stable for new annotators or shifted distributions, so the generalization claim stays provisional. The abstract also skips ablations, significance tests, and baseline implementation details, which makes it harder to judge how much of the reported gains are robust. This work is aimed at researchers handling subjective annotation in NLP, especially fairness or moderation settings. It has enough of a distinct method and empirical results to deserve a serious referee, though the proxy and generalization issues will need attention in review.

Referee Report

3 major / 2 minor

Summary. The paper introduces DiADEM, a neural architecture that learns a demographic importance vector α to encode annotators via per-demographic projections, fuses representations using complementary concatenation and Hadamard interactions, and trains with an item-level disagreement loss. It claims to substantially outperform LLM-as-a-judge and neural baselines on the DICES and VOICED benchmarks, achieving r=0.75 disagreement correlation on DICES, while the learned α consistently identifies race and age as the most influential factors driving annotator disagreement.

Significance. If the empirical results hold under more rigorous validation, the work is significant for perspectivist NLP and annotator modeling: it moves beyond majority-vote flattening or generic LLM prompting by explicitly learning which demographic axes matter for disagreement prediction. The parameter-free derivation of the importance vector and the direct optimization of annotation variance are strengths that could influence dataset curation and evaluation practices in subjective content labeling.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the central claim of substantial outperformance (including r=0.75 on DICES) is presented without reported baseline implementation details, statistical significance tests, data-split protocols, or ablation results on the α vector; this directly affects verifiability of the headline result.
[§5] §5 (Results and Analysis): no cross-dataset or out-of-distribution evaluation is reported for the learned α (e.g., applying the model to new annotators whose demographics are drawn from a shifted distribution), so the claim that race and age are consistently the most influential factors rests on only two benchmarks and cannot yet be treated as general.
[§3] §3 (Method): the architecture assumes the recorded demographic fields are accurate, complete, and causally primary proxies for the perspectives that drive disagreement; the paper provides no sensitivity analysis or controls for potential confounders (education, region, ideology) that could make the fitted α an artifact of the particular dataset encoding.

minor comments (2)

[§3] Notation: ensure the importance vector is consistently denoted as bold α throughout and that the Hadamard product is explicitly defined when first introduced.
[Tables] Table 1 or equivalent: add a row or column reporting the number of annotators per demographic category to allow readers to assess sparsity in the learned weights.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of verifiability, generalizability, and methodological assumptions that we address point by point below. We have outlined specific revisions to strengthen the manuscript while preserving the integrity of the reported results.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of substantial outperformance (including r=0.75 on DICES) is presented without reported baseline implementation details, statistical significance tests, data-split protocols, or ablation results on the α vector; this directly affects verifiability of the headline result.

Authors: We agree that additional implementation details are required for full reproducibility and verification. In the revised manuscript, we will expand §4 with: complete descriptions of baseline implementations (including exact LLM prompts and neural model hyperparameters), explicit data-split protocols (annotator-level versus item-level), statistical significance tests (e.g., bootstrap confidence intervals or paired tests on the correlation metrics), and ablation results on the α vector (uniform weights, per-demographic removals). These additions will directly support the reported performance, including the r=0.75 disagreement correlation on DICES. revision: yes
Referee: [§5] §5 (Results and Analysis): no cross-dataset or out-of-distribution evaluation is reported for the learned α (e.g., applying the model to new annotators whose demographics are drawn from a shifted distribution), so the claim that race and age are consistently the most influential factors driving annotator disagreement rests on only two benchmarks and cannot yet be treated as general.

Authors: We acknowledge that the consistency of race and age as top-weighted factors in α is observed only across the two benchmarks used. Cross-dataset and out-of-distribution evaluations for α are not feasible without additional datasets containing compatible demographic annotations and annotator distributions, which are unavailable. In the revision, we will update §5 to qualify the language, present the finding as specific to DICES and VOICED, and add a limitations discussion calling for future multi-benchmark validation. This preserves the observed pattern without overstating generality. revision: partial
Referee: [§3] §3 (Method): the architecture assumes the recorded demographic fields are accurate, complete, and causally primary proxies for the perspectives that drive disagreement; the paper provides no sensitivity analysis or controls for potential confounders (education, region, ideology) that could make the fitted α an artifact of the particular dataset encoding.

Authors: This is a valid observation on the scope of the demographic proxies. DiADEM uses the self-reported fields exactly as provided in the source datasets and does not assert causal primacy. The benchmarks lack metadata on potential confounders such as education, region, or ideology, precluding sensitivity analyses. In the revised §3 and a new limitations subsection, we will explicitly state this assumption, clarify that α captures associations within the given encodings, and recommend richer annotator profiles for future work. These clarifications qualify the interpretation without altering the method or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; DiADEM performance and α weights are direct empirical outputs

full rationale

The paper defines a neural model that learns a demographic importance vector α from training data, fuses representations, and optimizes a disagreement loss; it then reports empirical outperformance on DICES and VOICED benchmarks plus the resulting α ranking. These quantities are produced by standard supervised training and evaluation on held-out annotations rather than by any equation that reduces the claimed result to a re-expression of its own fitted parameters. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described architecture. The central claims remain falsifiable against external data and do not collapse by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a learned importance vector α fitted to benchmark data and on the assumption that standard neural-network components can capture demographic effects on annotation variance.

free parameters (1)

α (demographic importance vector)
Learned weights that scale the contribution of each demographic axis; fitted during training.

axioms (1)

domain assumption Neural networks with the described fusion operations can learn to predict annotator label distributions from demographic and item features.
Invoked by the model architecture and training procedure.

pith-pipeline@v0.9.0 · 5570 in / 1326 out tokens · 118607 ms · 2026-05-10T17:30:13.245500+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

Victoria Beckham

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 1928

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

Victoria Beckham

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 1928