Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

arxiv: 2605.06318 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.CY

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

Maximilian Maurer , Maximilian Linde , Gabriella Lapesa This is my paper

Pith reviewed 2026-05-08 10:17 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords annotation variationharmful language detectionlinguistic featuresannotator attitudesinteraction effectsintersectionalityNLP data quality

0 comments p. Extension

The pith

Annotation variation in harmful language detection stems primarily from interactions between linguistic cues in the text and annotator attitudes rather than from either factor alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes four established datasets for harmful language detection to determine how linguistic properties of the items and characteristics of the annotators jointly produce differences in labels. It applies statistical models that include interaction terms to test whether these two sources combine in ways previous studies treated separately. The results indicate that such interactions, especially involving lexical cues and annotator attitudes, account for much of the observed variation and reveal intersectional patterns. At the same time the specific directions and strengths of the effects change from dataset to dataset. If this account is correct, current practices of collecting large numbers of annotators and releasing disaggregated labels will only be useful if models and guidelines explicitly treat annotation as the joint outcome of text and person.

Core claim

Our analysis of four reference datasets shows that the interplay between linguistic features of the text and annotator characteristics is essential for explaining label variation in harmful language detection. Interactions uncover intersectional effects that single-factor approaches miss, with lexical cues and annotator attitudes emerging as particularly influential. Effect patterns nevertheless differ substantially across the datasets, which limits generalization and transfer.

What carries the argument

Multivariate statistical models that incorporate interaction terms between linguistic features of the items and annotator characteristics such as attitudes and demographics.

Load-bearing premise

The linguistic features and annotator characteristics measured in the study, together with the statistical models applied, are sufficient to capture the main sources of annotation variation without important omitted factors or dataset-specific artifacts.

What would settle it

A follow-up analysis on new harmful language datasets or with additional linguistic and annotator variables that finds statistically insignificant interaction effects or highly consistent patterns across all datasets would undermine the central claim.

Figures

Figures reproduced from arXiv: 2605.06318 by Gabriella Lapesa, Maximilian Linde, Maximilian Maurer.

**Figure 1.** Figure 1: Cross-classified data structure for ordinal text view at source ↗

**Figure 2.** Figure 2: Posterior estimates for the surviving effects for view at source ↗

**Figure 3.** Figure 3: Model predictions for the interaction age:n_hateful (POPQUORN). Labels ( 1 , 0 , -1 ) refer to SD from mean (0) for n_hateful. The dots represent the mean posterior estimates, and vertical bars represent the 95% highest density interval. Inspection reveals that items with such tokens often are about the author’s opposing views on certain positions on moral grounds or are ironic6 . We find two surviving in… view at source ↗

**Figure 4.** Figure 4: Posterior estimates for the surviving effects view at source ↗

**Figure 5.** Figure 5: Posterior estimates for the surviving effects view at source ↗

**Figure 6.** Figure 6: Posterior estimates for surviving effects of view at source ↗

**Figure 8.** Figure 8: Example items from POPQUORN containing a relatively high number of words related to moral/behavioral deficiencies (n_dmc colored in cyan ). Don’t worry. Israel has already told the UN there will be no investigation. Gotta love that jew privilege. (a) Congrats on the 1:30 Israeli / Palestinian casualty ratio. Hamas must be patting themselves on their backs and looking for a repeat of that success. (b) A l… view at source ↗

**Figure 7.** Figure 7: Posterior estimates for surviving effects of view at source ↗

**Figure 10.** Figure 10: Example items from MHS containing a rel view at source ↗

**Figure 11.** Figure 11: Example items from MHS containing a rel view at source ↗

**Figure 12.** Figure 12: Example: Cluster 6 in the linguistic feature view at source ↗

**Figure 14.** Figure 14: age:n_hateful_all_lexicons (POPQUORN) 0.5 0.6 0.7 0.8 extremely_conservative conservative slightly_conservative neutral slightly_liberal liberal extremely_liberal ideology annotation age 18−24 30−34 35−39 40−44 45−49 50−54 55−59 60−64 >65 view at source ↗

**Figure 15.** Figure 15: ideology:age (MHS) 24 view at source ↗

read the original abstract

Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced. Data collection practices thus shifted towards increasing the annotator numbers and releasing disaggregated datasets, harmful language being most resourced due to its high subjectivity. While this resulted in rich information about \textit{who} annotated (sociodemographics, attitudes, etc.), the \textit{what} (e.g., linguistic properties of items), and their interplay has received little attention. We present the first large-scale analysis of four reference datasets for harmful language detection, bringing together annotator characteristics, linguistic properties of the items, and their interactions in a statistically informed picture. We find that interactions are crucial, revealing intersectional effects ignored in previous work, and that a strong role is played by lexical cues and annotator attitudes. Effect patterns, however, vary considerably across datasets. This urges caution about generalization and transferability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is showing that annotator attitudes and lexical features interact in annotation decisions for harmful language, with those patterns shifting across the four datasets examined.

read the letter

The new piece here is the explicit joint modeling of annotator traits and linguistic properties on the same items, rather than treating the two separately as most prior work has done. They pull together four existing harmful language datasets and report that interactions matter, that lexical cues and attitudes come through strongly, and that the specific patterns are not stable from one dataset to the next. That last observation is useful because it already pushes back against easy generalization or transfer of annotation guidelines. The analysis is empirical and stays close to the data, which keeps it grounded. The abstract's caution about transferability is a plus. The soft spot is that the abstract gives almost no information on the actual regression specifications, controls for dataset identity, or checks for multiple testing. If dataset-specific annotation protocols or label distributions are correlated with the chosen features or traits, the reported cross-dataset differences could partly reflect those artifacts rather than true variation in how people interpret the same linguistic signals. Without seeing the model details and any robustness checks, it is difficult to judge how much weight the interaction terms should carry. This work is aimed at researchers who collect or use subjective annotations, especially in content moderation or toxicity detection. It is incremental rather than foundational, but the scale and the focus on interactions give it enough substance to warrant referee time. I would send it out for review with a request for clearer methods and any available sensitivity analyses.

Referee Report

2 major / 1 minor

Summary. The manuscript presents the first large-scale analysis of four reference datasets for harmful language detection. It integrates annotator characteristics (sociodemographics and attitudes), linguistic properties of the items, and their interactions within statistical models to examine sources of annotation variation. The central claims are that interactions are crucial (revealing intersectional effects ignored in prior work), that lexical cues and annotator attitudes play strong roles, and that effect patterns vary considerably across datasets, urging caution about generalization and transferability.

Significance. If the statistical findings prove robust, the work would advance NLP research on subjective annotation tasks by demonstrating that isolated analyses of annotator traits or item features are insufficient and that modeling their interactions is necessary to capture intersectional effects. This could influence data collection practices and model development for harmful language detection by highlighting dataset-specific patterns and the risks of overgeneralization.

major comments (2)

[Methods] Methods: The manuscript provides no details on the regression specifications used to assess interactions (e.g., logistic vs. linear mixed-effects models, exact terms for annotator-linguistic interactions, inclusion of random effects for annotators/items, or controls for dataset as a factor). Without these, the claim that interactions are 'crucial' cannot be evaluated for robustness against omitted-variable bias or dataset artifacts.
[Results] Results: The assertion that 'effect patterns vary considerably across datasets' is presented without quantitative support such as tests for coefficient heterogeneity, cross-dataset interaction significance, or formal comparisons of model fits. This weakens the argument that the variation undermines generalization.

minor comments (1)

[Abstract] Abstract: The phrase 'statistically informed picture' is vague; a one-sentence summary of the modeling approach (e.g., 'via mixed-effects regressions with interaction terms') would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the paper to provide the requested details and quantitative support.

read point-by-point responses

Referee: [Methods] Methods: The manuscript provides no details on the regression specifications used to assess interactions (e.g., logistic vs. linear mixed-effects models, exact terms for annotator-linguistic interactions, inclusion of random effects for annotators/items, or controls for dataset as a factor). Without these, the claim that interactions are 'crucial' cannot be evaluated for robustness against omitted-variable bias or dataset artifacts.

Authors: We acknowledge that the original manuscript did not include sufficient detail on the regression specifications. In the revised version, we have added a new subsection titled 'Statistical Analysis' under Methods. We employed logistic mixed-effects models (implemented in R using the lme4 package) with the binary annotation label (harmful vs. non-harmful) as the dependent variable. Fixed effects comprised main effects for annotator characteristics (sociodemographics and attitudes), linguistic features (lexical, syntactic, and semantic cues extracted via standard NLP pipelines), and all two-way interaction terms between annotator traits and linguistic features. Random intercepts were specified for both annotators and items to account for repeated measures and individual variability. Dataset was included as a fixed factor, with additional interactions to permit dataset-specific effects. These specifications directly mitigate concerns about omitted-variable bias and enable evaluation of the robustness of the interaction effects. revision: yes
Referee: [Results] Results: The assertion that 'effect patterns vary considerably across datasets' is presented without quantitative support such as tests for coefficient heterogeneity, cross-dataset interaction significance, or formal comparisons of model fits. This weakens the argument that the variation undermines generalization.

Authors: We agree that additional quantitative evidence would strengthen this claim. The revised manuscript now includes a combined multi-dataset model with three-way interaction terms (annotator characteristic × linguistic feature × dataset). We report results from likelihood ratio tests comparing nested models with and without the dataset interactions, as well as Wald tests for pairwise differences in key interaction coefficients across datasets. Several interactions (particularly those involving annotator attitudes and lexical cues) show statistically significant heterogeneity (p < 0.05 after correction). We also added a supplementary table with model fit comparisons (AIC/BIC) between pooled and dataset-specific models. These additions provide formal support for the observed variation and the associated caution regarding generalization and transferability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical regression analysis on existing datasets

full rationale

The paper conducts statistical analysis (regressions on linguistic features, annotator traits, and interactions) across four pre-existing harmful language datasets. No derivations, predictions, or results reduce to inputs by construction; all claims are observational patterns extracted from fitted models on the data. No self-citations support load-bearing uniqueness theorems, ansatzes, or self-definitions. The work is self-contained empirical analysis without closed theoretical loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis relies on standard statistical assumptions about feature independence and model linearity but introduces no new free parameters or invented entities beyond those in the reference datasets and chosen linguistic/annotator variables.

axioms (1)

domain assumption Linguistic features and annotator characteristics can be treated as measurable, independent inputs to regression or similar models without substantial measurement error.
Invoked implicitly when combining the two sources of variation in a single analysis framework.

pith-pipeline@v0.9.0 · 5466 in / 1191 out tokens · 21320 ms · 2026-05-08T10:17:43.445450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 37 canonical work pages

[1]

2020 , url=

Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane , doi =. 2020 , url=

2020
[2]

2024 , url =

Polars , title =. 2024 , url =

2024
[3]

1967 , publisher=

Automated Readability Index , author=. 1967 , publisher=

1967
[4]

2014 , publisher=

Brysbaert, Marc and Warriner, Amy Beth and Kuperman, Victor , journal=. 2014 , publisher=

2014
[5]

2019 , publisher=

Brysbaert, Marc and Mandera, Pawe. 2019 , publisher=

2019
[6]

2012 , publisher=

Kuperman, Victor and Stadthagen-Gonzalez, Hans and Brysbaert, Marc , journal=. 2012 , publisher=

2012
[7]

and Binney, Richard J

Diveica, Veronica and Pexman, Penny M. and Binney, Richard J. , journal=. 2023 , publisher=

2023
[8]

2024 , publisher=

Winter, Bodo and Lupyan, Gary and Perry, Lynn K and Dingemanse, Mark and Perlman, Marcus , journal=. 2024 , publisher=

2024
[9]

2020 , publisher=

Lynott, Dermot and Connell, Louise and Brysbaert, Marc and Brand, James and Carney, James , journal=. 2020 , publisher=

2020
[10]

Certain Language Skills in Children: Their Development and Interrelationships

MILDRED C. TEMPLIN , edition =. "Certain Language Skills in Children: Their Development and Interrelationships" , urldate =
[11]

Sur quoi se fonde la notion d'etendue theoratique du vocabulaire?

Dugast, Daniel. Sur quoi se fonde la notion d'etendue theoratique du vocabulaire?. Le francais Modern. 1978

1978
[12]

1972 , publisher=

Mass, Heinz-Dieter , journal=. 1972 , publisher=

1972
[13]

Sichel , title =

Herbert S. Sichel , title =. Journal of the American Statistical Association , volume =. 1975 , publisher =. doi:10.1080/01621459.1975.10482469 , URL =

work page doi:10.1080/01621459.1975.10482469 1975
[14]

1944 , publisher=

The statistical study of literary vocabulary , author=. 1944 , publisher=

1944
[15]

, address =

Guiraud, Pierre. , address =. Les caracte\`eres statistiques du vocabulaire : essai de m\'ethodologie , year =. Les caracte\`eres statistiques du vocabulaire : essai de m\'ethodologie , keywords =
[16]

Language and Thought , year =

John Bissell Carroll , editor =. Language and Thought , year =
[17]

1964 , publisher=

Quantitative Linguistics , author=. 1964 , publisher=

1964
[18]

1955 , publisher=

Herdan, Gustav , journal=. 1955 , publisher=

1955
[19]

, journal=

Simpson, Edward H. , journal=. 1949 , url=

1949
[20]

1997 , publisher=

Quantifying lexical diversity in the study of language development , author=. 1997 , publisher=

1997
[21]

Covington and Joe D

Michael A. Covington and Joe D. McFall and , title =. Journal of Quantitative Linguistics , volume =. 2010 , publisher =. doi:10.1080/09296171003643098 , URL =

work page doi:10.1080/09296171003643098 2010
[22]

McCarthy and Scott Jarvis , title =

Philip M. McCarthy and Scott Jarvis , title =. Language Testing , volume =. 2007 , doi =

2007
[23]

and Jarvis, Scott , journal=

McCarthy, Philip M. and Jarvis, Scott , journal=. 2010 , publisher=

2010
[24]

Studies in Second Language Acquisition , year=

Lexis in composition: a performance analysis of Swedish learners' written English , author=. Studies in Second Language Acquisition , year=
[25]

Peter and Fishburne, Robert P

Kincaid, J. Peter and Fishburne, Robert P. Jr. and Rogers, Richard L. and Chissom, Brad S. , institution=. 1975 , url=

1975
[26]

, author=

A computer readability formula designed for machine scoring. , author=. Journal of Applied Psychology , volume=. 1975 , publisher=

1975
[27]

Journal of reading , volume=

SMOG grading-a new readability formula , author=. Journal of reading , volume=. 1969 , publisher=

1969
[28]

Bj. L. 1968 , publisher=

1968
[29]

Seventh Australian Reading Association Conference , pages=

Anderson, Jonathan , year=. Seventh Australian Reading Association Conference , pages=
[30]

and Turney, Peter D

Mohammad, Saif M. and Turney, Peter D. , title =. Computational Intelligence , volume =. doi:https://doi.org/10.1111/j.1467-8640.2012.00460.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-8640.2012.00460.x , abstract =

work page doi:10.1111/j.1467-8640.2012.00460.x 2012
[31]

CEUR Workshop proceedings , volume=

Hurtlex: A multilingual lexicon of words to hurt , author=. CEUR Workshop proceedings , volume=. 2018 , organization=

2018
[32]

and Sheth, Amit , title =

Rezvan, Mohammadreza and Shekarpour, Saeedeh and Balasuriya, Lakshika and Thirunarayan, Krishnaprasad and Shalin, Valerie L. and Sheth, Amit , title =. Proceedings of the 10th ACM Conference on Web Science , pages =. 2018 , isbn =. doi:10.1145/3201064.3201103 , abstract =

work page doi:10.1145/3201064.3201103 2018
[33]

Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021) , year =

Deepak Kumar and Patrick Gage Kelley and Sunny Consolvo and Joshua Mason and Elie Bursztein and Zakir Durumeric and Kurt Thomas and Michael Bailey , title =. Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021) , year =

2021
[34]

Behavior research methods, instruments, & computers , volume=

Coh-Metrix: Analysis of text on cohesion and language , author=. Behavior research methods, instruments, & computers , volume=. 2004 , publisher=

2004
[35]

Data Protection and Privacy , volume=

The dataset nutrition label , author=. Data Protection and Privacy , volume=. 2020 , publisher=

2020
[36]

Smith, Nicole DeCario, and Will Buchanan

Pushkarna, Mahima and Zaldivar, Andrew and Kjartansson, Oddur , title =. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2022 , isbn =. doi:10.1145/3531146.3533231 , abstract =

work page doi:10.1145/3531146.3533231 2022
[37]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The
[38]

Measuring Massive Multitask Language Understanding , author=
[39]

2025 , eprint=

Are We Done with MMLU? , author=. 2025 , eprint=

2025
[41]

Toward a perspectivist turn in ground truthing for predictive computing

Toward a Perspectivist Turn in Ground Truthing for Predictive Computing , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2023 , month=. doi:10.1609/aaai.v37i6.25840 , abstractNote=

work page doi:10.1609/aaai.v37i6.25840 2023
[42]

The 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates , author=. The 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

2024
[43]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Everyone’s voice matters: Quantifying annotation disagreement using demographic information , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2023 , url=

2023
[44]

2311.04345 , archivePrefix=

Wenbo Zhang and Hangzhi Guo and Ian D Kivlichan and Vinodkumar Prabhakaran and Davis Yadav and Amulya Yadav , year=. 2311.04345 , archivePrefix=

work page arXiv
[45]

Learning to

Liu, Tong and Venkatachalam, Akash and Sanjay Bongale, Pratik and Homan, Christopher , date =. Learning to. Companion. doi:10.1145/3308560.3317082 , url =

work page doi:10.1145/3308560.3317082
[46]

and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , title =

Uma, Alexandra N. and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , date =. Learning from. 2021 , journal =. doi:10.1613/jair.1.12752 , url =

work page doi:10.1613/jair.1.12752 2021
[47]

and Sanderson, Mark , date =

Hettiachchi, Danula and Holcombe-James, Indigo and Livingstone, Stephanie and Silva, Anjalee de and Lease, Matthew and Salim, Flora D. and Sanderson, Mark , date =. How. 2023 , pages =. doi:10.1609/hcomp.v11i1.27546 , url =

work page doi:10.1609/hcomp.v11i1.27546 2023
[48]

1982 , publisher =

Attitudes Towards Language Variation: Social and Applied Contexts , series =. 1982 , publisher =

1982
[49]

Kircher, Ruth and Zipp, Lena , editor =. An. Research. 2022 , pages =. doi:10.1017/9781108867788.002 , url =

work page doi:10.1017/9781108867788.002 2022
[50]

Ordinal Regression Models in Psychology: A Tutorial , shorttitle =

B. Ordinal Regression Models in Psychology: A Tutorial , shorttitle =. 2019 , journal =

2019
[51]

and Polson, Nicholas G

Carvalho, Carlos M. and Polson, Nicholas G. and Scott, James G. , year =. Handling. Proceedings of the
[52]

2017 , journal =

Sparsity Information and Regularization in the Horseshoe and Other Shrinkage Priors , author =. 2017 , journal =

2017
[53]

2017.brms: AnRPackage for Bayesian Multilevel Models UsingStan.Journal of Statistical Software80, 1 (2017)

Paul-Christian Bürkner , journal =. 2017 , volume =. doi:10.18637/jss.v080.i01 , encoding =

work page doi:10.18637/jss.v080.i01 2017
[54]

2019 , journal =

Shrinkage Priors for. 2019 , journal =

2019
[55]

elfen: A Python Package for Efficient Linguistic Feature Extraction for Natural Language Datasets

Maurer, Maximilian. elfen: A Python Package for Efficient Linguistic Feature Extraction for Natural Language Datasets. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 3: System Demonstrations). 2026. doi:10.18653/v1/2026.eacl-demo.5

work page doi:10.18653/v1/2026.eacl-demo.5 2026
[56]

2024 , url =

R: A Language and Environment for Statistical Computing , author =. 2024 , url =

2024
[57]

The effect of rating scale format on response styles: The number of response categories and response category labels , journal =

Bert Weijters and Elke Cabooter and Niels Schillewaert , keywords =. The effect of rating scale format on response styles: The number of response categories and response category labels , journal =. 2010 , issn =. doi:https://doi.org/10.1016/j.ijresmar.2010.02.004 , url =

work page doi:10.1016/j.ijresmar.2010.02.004 2010
[58]

Frontiers in psychology , volume=

Linguistically modulated perception and cognition: The label-feedback hypothesis , author=. Frontiers in psychology , volume=. 2012 , publisher=

2012
[59]

2015 , publisher =

ISCED 2011 Operational Manual: Guidelines for Classifying National Education Programmes and Related Qualifications , author =. 2015 , publisher =. doi:10.1787/9789264228368-en , url =

work page doi:10.1787/9789264228368-en 2011
[60]

Offensive, aggressive, and hate speech analysis: From data-centric to human-centered approach , journal =

Jan Kocoń and Alicja Figas and Marcin Gruza and Daria Puchalska and Tomasz Kajdanowicz and Przemysław Kazienko , keywords =. Offensive, aggressive, and hate speech analysis: From data-centric to human-centered approach , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.ipm.2021.102643 , url =

work page doi:10.1016/j.ipm.2021.102643 2021
[61]

Wojcik and Peter H

Jesse Graham and Jonathan Haidt and Sena Koleva and Matt Motyl and Ravi Iyer and Sean P. Wojcik and Peter H. Ditto , keywords =. Moral Foundations Theory: The Pragmatic Validity of Moral Pluralism , editor =. Advances in Experimental Social Psychology , publisher =. 2013 , issn =. doi:https://doi.org/10.1016/B978-0-12-407236-7.00002-4 , url =

work page doi:10.1016/b978-0-12-407236-7.00002-4 2013
[62]

In: Proc

Abercrombie, Gavin and Hovy, Dirk and Prabhakaran, Vinodkumar. Temporal and Second Language Influence on Intra-Annotator Agreement and Stability in Hate Speech Labelling. Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII). 2023. doi:10.18653/v1/2023.law-1.10

work page doi:10.18653/v1/2023.law-1.10 2023
[63]

We need to consider disagreement in evaluation

Basile, Valerio and Fell, Michael and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo and Uma, Alexandra. We Need to Consider Disagreement in Evaluation. Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future. 2021. doi:10.18653/v1/2021.bppf-1.3

work page doi:10.18653/v1/2021.bppf-1.3 2021
[64]

Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity

Beck, Jacob and Eckman, Stephanie and Ma, Bolei and Chew, Rob and Kreuter, Frauke. Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity. Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). 2024. doi:10.18653/v1/2024.uncertainlp-1.8

work page doi:10.18653/v1/2024.uncertainlp-1.8 2024
[65]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Beck, Tilman and Schuff, Hendrik and Lauscher, Anne and Gurevych, Iryna. Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.159

work page doi:10.18653/v1/2024.eacl-long.159 2024
[66]

D 3 CODE : Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation

Davani, Aida and D \'i az, Mark and Baker, Dylan and Prabhakaran, Vinodkumar. D 3 CODE : Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1029

work page doi:10.18653/v1/2024.emnlp-main.1029 2024
[67]

When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

Fleisig, Eve and Abebe, Rediet and Klein, Dan. When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.415

work page doi:10.18653/v1/2023.emnlp-main.415 2023
[68]

Intersectionality in AI Safety: Using Multilevel Models to Understand Diverse Perceptions of Safety in Conversational AI

Homan, Christopher and Serapio-Garcia, Gregory and Aroyo, Lora and Diaz, Mark and Parrish, Alicia and Prabhakaran, Vinodkumar and Taylor, Alex and Wang, Ding. Intersectionality in AI Safety: Using Multilevel Models to Understand Diverse Perceptions of Safety in Conversational AI. Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspe...

2024
[69]

Annotation Sensitivity: Training Data Collection Methods Affect Model Performance

Kern, Christoph and Eckman, Stephanie and Beck, Jacob and Chew, Rob and Ma, Bolei and Kreuter, Frauke. Annotation Sensitivity: Training Data Collection Methods Affect Model Performance. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.992

work page doi:10.18653/v1/2023.findings-emnlp.992 2023
[70]

Reconsidering Annotator Disagreement about Racist Language: Noise or Signal?

Larimore, Savannah and Kennedy, Ian and Haskett, Breon and Arseniev-Koehler, Alina. Reconsidering Annotator Disagreement about Racist Language: Noise or Signal?. Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media. 2021. doi:10.18653/v1/2021.socialnlp-1.7

work page doi:10.18653/v1/2021.socialnlp-1.7 2021
[71]

and Nivre, Joakim and Zeman, Daniel

de Marneffe, Marie-Catherine and Manning, Christopher D. and Nivre, Joakim and Zeman, Daniel. U niversal D ependencies. Computational Linguistics. 2021. doi:10.1162/coli_a_00402

work page doi:10.1162/coli_a_00402 2021
[72]

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 E nglish Words

Mohammad, Saif. Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 E nglish Words. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1017

work page doi:10.18653/v1/p18-1017 2018
[73]

Word Affect Intensities

Mohammad, Saif. Word Affect Intensities. Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018). 2018

2018
[74]

Emotions Evoked by Common Words and Phrases: Using M echanical T urk to Create an Emotion Lexicon

Mohammad, Saif and Turney, Peter. Emotions Evoked by Common Words and Phrases: Using M echanical T urk to Create an Emotion Lexicon. Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text. 2010

2010
[75]

Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Davani, Aida and D \'i az, Mark and Prabhakaran, Vinodkumar. Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00449

work page doi:10.1162/tacl_a_00449 2022
[76]

Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions

Orlikowski, Matthias and Pei, Jiaxin and R. Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.104

work page doi:10.18653/v1/2025.acl-long.104 2025
[77]

The Ecological Fallacy in Annotation: Modeling Human Label Variation goes beyond Sociodemographics

Orlikowski, Matthias and R. The Ecological Fallacy in Annotation: Modeling Human Label Variation goes beyond Sociodemographics. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.88

work page doi:10.18653/v1/2023.acl-short.88 2023
[78]

When Do Annotator Demographics Matter? Measuring the Influence of Annotator Demographics with the POPQUORN Dataset

Pei, Jiaxin and Jurgens, David. When Do Annotator Demographics Matter? Measuring the Influence of Annotator Demographics with the POPQUORN Dataset. Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII). 2023. doi:10.18653/v1/2023.law-1.25

work page doi:10.18653/v1/2023.law-1.25 2023
[79]

Is a bunch of words enough to detect disagreement in hateful content?

Rizzi, Giulia and Rosso, Paolo and Fersini, Elisabetta. Is a bunch of words enough to detect disagreement in hateful content?. Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

2025
[80]

The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism

Sachdeva, Pratik and Barreto, Renata and Bacon, Geoff and Sahn, Alexander and von Vacano, Claudia and Kennedy, Chris. The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022. 2022

2022
[81]

doi: 10.18653/v1/2022.naacl-main.431

Sap, Maarten and Swayamdipta, Swabha and Vianna, Laura and Zhou, Xuhui and Choi, Yejin and Smith, Noah A. Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10...

work page doi:10.18653/v1/2022.naacl-main.431 2022

Showing first 80 references.

[1] [1]

2020 , url=

Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane , doi =. 2020 , url=

2020

[2] [2]

2024 , url =

Polars , title =. 2024 , url =

2024

[3] [3]

1967 , publisher=

Automated Readability Index , author=. 1967 , publisher=

1967

[4] [4]

2014 , publisher=

Brysbaert, Marc and Warriner, Amy Beth and Kuperman, Victor , journal=. 2014 , publisher=

2014

[5] [5]

2019 , publisher=

Brysbaert, Marc and Mandera, Pawe. 2019 , publisher=

2019

[6] [6]

2012 , publisher=

Kuperman, Victor and Stadthagen-Gonzalez, Hans and Brysbaert, Marc , journal=. 2012 , publisher=

2012

[7] [7]

and Binney, Richard J

Diveica, Veronica and Pexman, Penny M. and Binney, Richard J. , journal=. 2023 , publisher=

2023

[8] [8]

2024 , publisher=

Winter, Bodo and Lupyan, Gary and Perry, Lynn K and Dingemanse, Mark and Perlman, Marcus , journal=. 2024 , publisher=

2024

[9] [9]

2020 , publisher=

Lynott, Dermot and Connell, Louise and Brysbaert, Marc and Brand, James and Carney, James , journal=. 2020 , publisher=

2020

[10] [10]

Certain Language Skills in Children: Their Development and Interrelationships

MILDRED C. TEMPLIN , edition =. "Certain Language Skills in Children: Their Development and Interrelationships" , urldate =

[11] [11]

Sur quoi se fonde la notion d'etendue theoratique du vocabulaire?

Dugast, Daniel. Sur quoi se fonde la notion d'etendue theoratique du vocabulaire?. Le francais Modern. 1978

1978

[12] [12]

1972 , publisher=

Mass, Heinz-Dieter , journal=. 1972 , publisher=

1972

[13] [13]

Sichel , title =

Herbert S. Sichel , title =. Journal of the American Statistical Association , volume =. 1975 , publisher =. doi:10.1080/01621459.1975.10482469 , URL =

work page doi:10.1080/01621459.1975.10482469 1975

[14] [14]

1944 , publisher=

The statistical study of literary vocabulary , author=. 1944 , publisher=

1944

[15] [15]

, address =

Guiraud, Pierre. , address =. Les caracte\`eres statistiques du vocabulaire : essai de m\'ethodologie , year =. Les caracte\`eres statistiques du vocabulaire : essai de m\'ethodologie , keywords =

[16] [16]

Language and Thought , year =

John Bissell Carroll , editor =. Language and Thought , year =

[17] [17]

1964 , publisher=

Quantitative Linguistics , author=. 1964 , publisher=

1964

[18] [18]

1955 , publisher=

Herdan, Gustav , journal=. 1955 , publisher=

1955

[19] [19]

, journal=

Simpson, Edward H. , journal=. 1949 , url=

1949

[20] [20]

1997 , publisher=

Quantifying lexical diversity in the study of language development , author=. 1997 , publisher=

1997

[21] [21]

Covington and Joe D

Michael A. Covington and Joe D. McFall and , title =. Journal of Quantitative Linguistics , volume =. 2010 , publisher =. doi:10.1080/09296171003643098 , URL =

work page doi:10.1080/09296171003643098 2010

[22] [22]

McCarthy and Scott Jarvis , title =

Philip M. McCarthy and Scott Jarvis , title =. Language Testing , volume =. 2007 , doi =

2007

[23] [23]

and Jarvis, Scott , journal=

McCarthy, Philip M. and Jarvis, Scott , journal=. 2010 , publisher=

2010

[24] [24]

Studies in Second Language Acquisition , year=

Lexis in composition: a performance analysis of Swedish learners' written English , author=. Studies in Second Language Acquisition , year=

[25] [25]

Peter and Fishburne, Robert P

Kincaid, J. Peter and Fishburne, Robert P. Jr. and Rogers, Richard L. and Chissom, Brad S. , institution=. 1975 , url=

1975

[26] [26]

, author=

A computer readability formula designed for machine scoring. , author=. Journal of Applied Psychology , volume=. 1975 , publisher=

1975

[27] [27]

Journal of reading , volume=

SMOG grading-a new readability formula , author=. Journal of reading , volume=. 1969 , publisher=

1969

[28] [28]

Bj. L. 1968 , publisher=

1968

[29] [29]

Seventh Australian Reading Association Conference , pages=

Anderson, Jonathan , year=. Seventh Australian Reading Association Conference , pages=

[30] [30]

and Turney, Peter D

Mohammad, Saif M. and Turney, Peter D. , title =. Computational Intelligence , volume =. doi:https://doi.org/10.1111/j.1467-8640.2012.00460.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-8640.2012.00460.x , abstract =

work page doi:10.1111/j.1467-8640.2012.00460.x 2012

[31] [31]

CEUR Workshop proceedings , volume=

Hurtlex: A multilingual lexicon of words to hurt , author=. CEUR Workshop proceedings , volume=. 2018 , organization=

2018

[32] [32]

and Sheth, Amit , title =

Rezvan, Mohammadreza and Shekarpour, Saeedeh and Balasuriya, Lakshika and Thirunarayan, Krishnaprasad and Shalin, Valerie L. and Sheth, Amit , title =. Proceedings of the 10th ACM Conference on Web Science , pages =. 2018 , isbn =. doi:10.1145/3201064.3201103 , abstract =

work page doi:10.1145/3201064.3201103 2018

[33] [33]

Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021) , year =

Deepak Kumar and Patrick Gage Kelley and Sunny Consolvo and Joshua Mason and Elie Bursztein and Zakir Durumeric and Kurt Thomas and Michael Bailey , title =. Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021) , year =

2021

[34] [34]

Behavior research methods, instruments, & computers , volume=

Coh-Metrix: Analysis of text on cohesion and language , author=. Behavior research methods, instruments, & computers , volume=. 2004 , publisher=

2004

[35] [35]

Data Protection and Privacy , volume=

The dataset nutrition label , author=. Data Protection and Privacy , volume=. 2020 , publisher=

2020

[36] [36]

Smith, Nicole DeCario, and Will Buchanan

Pushkarna, Mahima and Zaldivar, Andrew and Kjartansson, Oddur , title =. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2022 , isbn =. doi:10.1145/3531146.3533231 , abstract =

work page doi:10.1145/3531146.3533231 2022

[37] [37]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The

[38] [38]

Measuring Massive Multitask Language Understanding , author=

[39] [39]

2025 , eprint=

Are We Done with MMLU? , author=. 2025 , eprint=

2025

[40] [41]

Toward a perspectivist turn in ground truthing for predictive computing

Toward a Perspectivist Turn in Ground Truthing for Predictive Computing , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2023 , month=. doi:10.1609/aaai.v37i6.25840 , abstractNote=

work page doi:10.1609/aaai.v37i6.25840 2023

[41] [42]

The 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates , author=. The 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

2024

[42] [43]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Everyone’s voice matters: Quantifying annotation disagreement using demographic information , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2023 , url=

2023

[43] [44]

2311.04345 , archivePrefix=

Wenbo Zhang and Hangzhi Guo and Ian D Kivlichan and Vinodkumar Prabhakaran and Davis Yadav and Amulya Yadav , year=. 2311.04345 , archivePrefix=

work page arXiv

[44] [45]

Learning to

Liu, Tong and Venkatachalam, Akash and Sanjay Bongale, Pratik and Homan, Christopher , date =. Learning to. Companion. doi:10.1145/3308560.3317082 , url =

work page doi:10.1145/3308560.3317082

[45] [46]

and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , title =

Uma, Alexandra N. and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , date =. Learning from. 2021 , journal =. doi:10.1613/jair.1.12752 , url =

work page doi:10.1613/jair.1.12752 2021

[46] [47]

and Sanderson, Mark , date =

Hettiachchi, Danula and Holcombe-James, Indigo and Livingstone, Stephanie and Silva, Anjalee de and Lease, Matthew and Salim, Flora D. and Sanderson, Mark , date =. How. 2023 , pages =. doi:10.1609/hcomp.v11i1.27546 , url =

work page doi:10.1609/hcomp.v11i1.27546 2023

[47] [48]

1982 , publisher =

Attitudes Towards Language Variation: Social and Applied Contexts , series =. 1982 , publisher =

1982

[48] [49]

Kircher, Ruth and Zipp, Lena , editor =. An. Research. 2022 , pages =. doi:10.1017/9781108867788.002 , url =

work page doi:10.1017/9781108867788.002 2022

[49] [50]

Ordinal Regression Models in Psychology: A Tutorial , shorttitle =

B. Ordinal Regression Models in Psychology: A Tutorial , shorttitle =. 2019 , journal =

2019

[50] [51]

and Polson, Nicholas G

Carvalho, Carlos M. and Polson, Nicholas G. and Scott, James G. , year =. Handling. Proceedings of the

[51] [52]

2017 , journal =

Sparsity Information and Regularization in the Horseshoe and Other Shrinkage Priors , author =. 2017 , journal =

2017

[52] [53]

2017.brms: AnRPackage for Bayesian Multilevel Models UsingStan.Journal of Statistical Software80, 1 (2017)

Paul-Christian Bürkner , journal =. 2017 , volume =. doi:10.18637/jss.v080.i01 , encoding =

work page doi:10.18637/jss.v080.i01 2017

[53] [54]

2019 , journal =

Shrinkage Priors for. 2019 , journal =

2019

[54] [55]

elfen: A Python Package for Efficient Linguistic Feature Extraction for Natural Language Datasets

Maurer, Maximilian. elfen: A Python Package for Efficient Linguistic Feature Extraction for Natural Language Datasets. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 3: System Demonstrations). 2026. doi:10.18653/v1/2026.eacl-demo.5

work page doi:10.18653/v1/2026.eacl-demo.5 2026

[55] [56]

2024 , url =

R: A Language and Environment for Statistical Computing , author =. 2024 , url =

2024

[56] [57]

The effect of rating scale format on response styles: The number of response categories and response category labels , journal =

Bert Weijters and Elke Cabooter and Niels Schillewaert , keywords =. The effect of rating scale format on response styles: The number of response categories and response category labels , journal =. 2010 , issn =. doi:https://doi.org/10.1016/j.ijresmar.2010.02.004 , url =

work page doi:10.1016/j.ijresmar.2010.02.004 2010

[57] [58]

Frontiers in psychology , volume=

Linguistically modulated perception and cognition: The label-feedback hypothesis , author=. Frontiers in psychology , volume=. 2012 , publisher=

2012

[58] [59]

2015 , publisher =

ISCED 2011 Operational Manual: Guidelines for Classifying National Education Programmes and Related Qualifications , author =. 2015 , publisher =. doi:10.1787/9789264228368-en , url =

work page doi:10.1787/9789264228368-en 2011

[59] [60]

Offensive, aggressive, and hate speech analysis: From data-centric to human-centered approach , journal =

Jan Kocoń and Alicja Figas and Marcin Gruza and Daria Puchalska and Tomasz Kajdanowicz and Przemysław Kazienko , keywords =. Offensive, aggressive, and hate speech analysis: From data-centric to human-centered approach , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.ipm.2021.102643 , url =

work page doi:10.1016/j.ipm.2021.102643 2021

[60] [61]

Wojcik and Peter H

Jesse Graham and Jonathan Haidt and Sena Koleva and Matt Motyl and Ravi Iyer and Sean P. Wojcik and Peter H. Ditto , keywords =. Moral Foundations Theory: The Pragmatic Validity of Moral Pluralism , editor =. Advances in Experimental Social Psychology , publisher =. 2013 , issn =. doi:https://doi.org/10.1016/B978-0-12-407236-7.00002-4 , url =

work page doi:10.1016/b978-0-12-407236-7.00002-4 2013

[61] [62]

In: Proc

Abercrombie, Gavin and Hovy, Dirk and Prabhakaran, Vinodkumar. Temporal and Second Language Influence on Intra-Annotator Agreement and Stability in Hate Speech Labelling. Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII). 2023. doi:10.18653/v1/2023.law-1.10

work page doi:10.18653/v1/2023.law-1.10 2023

[62] [63]

We need to consider disagreement in evaluation

Basile, Valerio and Fell, Michael and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo and Uma, Alexandra. We Need to Consider Disagreement in Evaluation. Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future. 2021. doi:10.18653/v1/2021.bppf-1.3

work page doi:10.18653/v1/2021.bppf-1.3 2021

[63] [64]

Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity

Beck, Jacob and Eckman, Stephanie and Ma, Bolei and Chew, Rob and Kreuter, Frauke. Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity. Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). 2024. doi:10.18653/v1/2024.uncertainlp-1.8

work page doi:10.18653/v1/2024.uncertainlp-1.8 2024

[64] [65]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Beck, Tilman and Schuff, Hendrik and Lauscher, Anne and Gurevych, Iryna. Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.159

work page doi:10.18653/v1/2024.eacl-long.159 2024

[65] [66]

D 3 CODE : Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation

Davani, Aida and D \'i az, Mark and Baker, Dylan and Prabhakaran, Vinodkumar. D 3 CODE : Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1029

work page doi:10.18653/v1/2024.emnlp-main.1029 2024

[66] [67]

When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks

Fleisig, Eve and Abebe, Rediet and Klein, Dan. When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.415

work page doi:10.18653/v1/2023.emnlp-main.415 2023

[67] [68]

Intersectionality in AI Safety: Using Multilevel Models to Understand Diverse Perceptions of Safety in Conversational AI

Homan, Christopher and Serapio-Garcia, Gregory and Aroyo, Lora and Diaz, Mark and Parrish, Alicia and Prabhakaran, Vinodkumar and Taylor, Alex and Wang, Ding. Intersectionality in AI Safety: Using Multilevel Models to Understand Diverse Perceptions of Safety in Conversational AI. Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspe...

2024

[68] [69]

Annotation Sensitivity: Training Data Collection Methods Affect Model Performance

Kern, Christoph and Eckman, Stephanie and Beck, Jacob and Chew, Rob and Ma, Bolei and Kreuter, Frauke. Annotation Sensitivity: Training Data Collection Methods Affect Model Performance. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.992

work page doi:10.18653/v1/2023.findings-emnlp.992 2023

[69] [70]

Reconsidering Annotator Disagreement about Racist Language: Noise or Signal?

Larimore, Savannah and Kennedy, Ian and Haskett, Breon and Arseniev-Koehler, Alina. Reconsidering Annotator Disagreement about Racist Language: Noise or Signal?. Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media. 2021. doi:10.18653/v1/2021.socialnlp-1.7

work page doi:10.18653/v1/2021.socialnlp-1.7 2021

[70] [71]

and Nivre, Joakim and Zeman, Daniel

de Marneffe, Marie-Catherine and Manning, Christopher D. and Nivre, Joakim and Zeman, Daniel. U niversal D ependencies. Computational Linguistics. 2021. doi:10.1162/coli_a_00402

work page doi:10.1162/coli_a_00402 2021

[71] [72]

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 E nglish Words

Mohammad, Saif. Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 E nglish Words. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1017

work page doi:10.18653/v1/p18-1017 2018

[72] [73]

Word Affect Intensities

Mohammad, Saif. Word Affect Intensities. Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018). 2018

2018

[73] [74]

Emotions Evoked by Common Words and Phrases: Using M echanical T urk to Create an Emotion Lexicon

Mohammad, Saif and Turney, Peter. Emotions Evoked by Common Words and Phrases: Using M echanical T urk to Create an Emotion Lexicon. Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text. 2010

2010

[74] [75]

Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Davani, Aida and D \'i az, Mark and Prabhakaran, Vinodkumar. Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00449

work page doi:10.1162/tacl_a_00449 2022

[75] [76]

Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions

Orlikowski, Matthias and Pei, Jiaxin and R. Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.104

work page doi:10.18653/v1/2025.acl-long.104 2025

[76] [77]

The Ecological Fallacy in Annotation: Modeling Human Label Variation goes beyond Sociodemographics

Orlikowski, Matthias and R. The Ecological Fallacy in Annotation: Modeling Human Label Variation goes beyond Sociodemographics. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.88

work page doi:10.18653/v1/2023.acl-short.88 2023

[77] [78]

When Do Annotator Demographics Matter? Measuring the Influence of Annotator Demographics with the POPQUORN Dataset

Pei, Jiaxin and Jurgens, David. When Do Annotator Demographics Matter? Measuring the Influence of Annotator Demographics with the POPQUORN Dataset. Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII). 2023. doi:10.18653/v1/2023.law-1.25

work page doi:10.18653/v1/2023.law-1.25 2023

[78] [79]

Is a bunch of words enough to detect disagreement in hateful content?

Rizzi, Giulia and Rosso, Paolo and Fersini, Elisabetta. Is a bunch of words enough to detect disagreement in hateful content?. Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

2025

[79] [80]

The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism

Sachdeva, Pratik and Barreto, Renata and Bacon, Geoff and Sahn, Alexander and von Vacano, Claudia and Kennedy, Chris. The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022. 2022

2022

[80] [81]

doi: 10.18653/v1/2022.naacl-main.431

Sap, Maarten and Swayamdipta, Swabha and Vianna, Laura and Zhou, Xuhui and Choi, Yejin and Smith, Noah A. Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10...

work page doi:10.18653/v1/2022.naacl-main.431 2022