pith. machine review for the scientific record. sign in

arxiv: 2602.11318 · v3 · submitted 2026-02-11 · 💻 cs.AI · cs.CL· cs.CY

Recognition: 2 theorem links

· Lean Theorem

The Consensus Trap: Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:06 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CY
keywords data annotationground truthsubjectivityconsensus biaspluralistic annotationanchoring biashuman-AI collaborationAI ethics
0
0 comments X

The pith

Human disagreement in data annotation is a signal of cultural diversity, not noise to be eliminated by consensus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the standard practice of seeking a single ground truth in machine learning datasets treats human subjectivity as error, which leads to biased models. Through a review of recent literature, it identifies how model-mediated annotations create anchoring bias and silence diverse human perspectives. Geographic and economic pressures further enforce Western norms. The authors call for annotation systems that map disagreement instead of suppressing it to build more culturally aware AI.

Core claim

The foundational ground truth paradigm in machine learning rests on a positivistic fallacy that mischaracterizes human disagreement as technical noise rather than a vital sociotechnical signal. Systemic failures in positional legibility and the shift to human-as-verifier models with model-mediated annotations introduce anchoring bias that removes human voices from the loop. Geographic hegemony imposes Western norms, enforced by precarious workers who comply to avoid penalties. Disagreement should be reclaimed as a high-fidelity signal for culturally competent models.

What carries the argument

The consensus trap, where the drive for agreement in annotation practices combined with model mediation enforces a singular truth and discards subjective diversity.

If this is right

  • Annotation processes that prioritize consensus will produce datasets lacking representation of non-Western perspectives.
  • Models trained on such data will underperform on culturally diverse tasks due to embedded biases.
  • Precarious data workers will continue to suppress their own subjectivity to meet requester expectations.
  • Reclaiming disagreement requires new infrastructures that value pluralistic responses over singular labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Implementing pluralistic annotation could improve fairness in AI systems deployed globally.
  • Future annotation tools might use disagreement metrics as a quality indicator rather than error rate.
  • Research into model-mediated annotation should test for anchoring effects in controlled experiments.

Load-bearing premise

The analysis of papers from only seven specific venues between 2020 and 2025 fully represents the mechanisms in all data annotation practices.

What would settle it

A study that applies the same reflexive thematic analysis to papers from additional venues outside the seven selected and finds no evidence of positional legibility failures or model-mediated anchoring bias.

Figures

Figures reproduced from arXiv: 2602.11318 by Benjamin Mah, Ding Wang, Edith Law, Julian Posada, Krisha Kalsi, Sheza Munir, Shivani Kapania, Syed Ishtiaque Ahmed.

Figure 1
Figure 1. Figure 1: Overview of the methodology 2.1 Review Design and Scope 2.1.1 Defining research questions. The goal of this review is to interrogate the infrastructural barriers to realized justice in data annotation. We frame these barriers through two primary areas in the pipeline: the pre-annotation allocation gap, which concerns the mismatch between worker identity and data context, and the post-annotation representat… view at source ↗
Figure 2
Figure 2. Figure 2: PRISMA Flow Diagram of the Systematic Review, showing record identification, keyword filtration, and screening [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

In machine learning, "ground truth" refers to the assumed correct labels used to train and evaluate models. However, the foundational "ground truth" paradigm rests on a positivistic fallacy that treats human disagreement as technical noise rather than a vital sociotechnical signal. This systematic literature review analyzes research published between 2020 and 2025 across seven premier venues: ACL, AIES, CHI, CSCW, EAAMO, FAccT, and NeurIPS, investigating the mechanisms in data annotation practices that facilitate this "consensus trap". Our reflexive thematic analysis of 346 papers reveals that systemic failures in positional legibility, combined with the recent architectural shift toward human-as-verifier models, specifically the reliance on model-mediated annotations, introduce deep-seated anchoring bias and effectively remove human voices from the loop. We further demonstrate how geographic hegemony imposes Western norms as universal benchmarks, often enforced by the performative alignment of precarious data workers who prioritize requester compliance over honest subjectivity to avoid economic penalties. Critiquing the "noisy sensor" fallacy, where statistical models misdiagnose pluralism as error, we argue for reclaiming disagreement as a high-fidelity signal essential for building culturally competent models. To address these systemic tensions, we propose a roadmap for pluralistic annotation infrastructures that shift the objective from discovering a singular "right" answer to mapping the diversity of human experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a reflexive thematic analysis of 346 papers (2020–2025) from seven venues (ACL, AIES, CHI, CSCW, EAAMO, FAccT, NeurIPS) to argue that the 'ground truth' paradigm in ML data annotation rests on a positivistic fallacy that treats disagreement as technical noise. It identifies systemic failures in positional legibility and the shift to human-as-verifier / model-mediated annotation as sources of anchoring bias that suppress subjective human voices, enforce Western geographic hegemony, and penalize precarious workers for non-compliance; it critiques the 'noisy sensor' statistical framing and proposes pluralistic annotation infrastructures that map diversity rather than seek singular consensus.

Significance. If the thematic findings prove robust, the work offers a timely reframing of disagreement as high-fidelity signal rather than error, with direct implications for culturally competent model development and responsible AI data practices. The explicit roadmap for pluralistic infrastructures and the systematic scope across multiple venues constitute concrete strengths that could influence both research and industry annotation pipelines.

major comments (2)
  1. [Methods] Methods section: the reflexive thematic analysis provides no exact search strings, Boolean queries, inclusion/exclusion criteria, or inter-coder reliability metrics for the 346 papers. Without these details the reproducibility of the extracted themes (positional legibility failures, anchoring bias, geographic hegemony) cannot be evaluated, directly weakening support for the systemic claims.
  2. [Abstract and §4] Abstract and §4 (Findings): the diagnosis of a field-wide 'consensus trap' rests on literature drawn exclusively from seven venues that skew toward critical/sociotechnical scholarship. No evidence is presented that the identified mechanisms dominate in computer-vision pipelines, large-scale industry datasets, or non-Western annotation communities; this selection limits the warrant for generalizing to 'systemic' failures across all data annotation.
minor comments (2)
  1. [Abstract] Abstract: the venue list and paper count appear late; moving the scope statement earlier would improve immediate clarity.
  2. [Introduction] The term 'positional legibility' is introduced without an explicit definition or citation on first use, requiring readers to infer its meaning from later examples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while incorporating revisions for improved transparency and scope qualification where warranted.

read point-by-point responses
  1. Referee: [Methods] Methods section: the reflexive thematic analysis provides no exact search strings, Boolean queries, inclusion/exclusion criteria, or inter-coder reliability metrics for the 346 papers. Without these details the reproducibility of the extracted themes (positional legibility failures, anchoring bias, geographic hegemony) cannot be evaluated, directly weakening support for the systemic claims.

    Authors: We acknowledge that the original Methods section omitted explicit search strings, Boolean queries, and inclusion/exclusion criteria, which limits immediate reproducibility. Reflexive thematic analysis (per Braun & Clarke) is interpretive and does not use inter-coder reliability metrics, as themes emerge through iterative researcher engagement rather than consensus coding. In the revised manuscript we have added a new subsection with the exact search strings (e.g., combinations of 'data annotation' OR 'ground truth' OR 'labeling' AND 'disagreement' OR 'subjectivity' OR 'consensus'), Boolean operators applied to the seven venues' 2020–2025 proceedings, inclusion criteria (papers addressing sociotechnical aspects of annotation), and exclusion criteria (purely technical ML papers without human-centered analysis). This addition directly addresses the concern while preserving the reflexive stance. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4 (Findings): the diagnosis of a field-wide 'consensus trap' rests on literature drawn exclusively from seven venues that skew toward critical/sociotechnical scholarship. No evidence is presented that the identified mechanisms dominate in computer-vision pipelines, large-scale industry datasets, or non-Western annotation communities; this selection limits the warrant for generalizing to 'systemic' failures across all data annotation.

    Authors: The seven venues were deliberately chosen because they constitute the primary academic outlets where sociotechnical critiques of data annotation, 'ground truth,' and related fairness issues are most extensively developed; the paper's scope is therefore the discourse within these venues rather than a claim of universality across all ML subfields. We do not present evidence that the mechanisms dominate computer-vision pipelines or non-Western industry settings, as that lies outside the sampled literature. In revision we have updated the abstract and §4 to explicitly qualify all claims as pertaining to the analyzed venues, added a dedicated limitations paragraph acknowledging the critical-scholarship skew, and included a forward-looking statement calling for complementary empirical work in industry and non-Western annotation communities. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external literature synthesis, not self-referential reduction

full rationale

The paper conducts a reflexive thematic analysis of 346 papers drawn from seven external venues (ACL, AIES, CHI, CSCW, EAAMO, FAccT, NeurIPS). Its core claims—positional legibility failures, anchoring bias from human-as-verifier models, geographic hegemony, and the noisy-sensor fallacy—are presented as interpretive findings from that corpus rather than as outputs of any fitted parameters, self-defined equations, or load-bearing self-citations that collapse back into the paper’s own inputs. No derivation step equates a “prediction” to a quantity constructed from the authors’ prior work or from the analysis itself by definition. The roadmap for pluralistic annotation infrastructures is offered as a forward proposal, not a retrofitted restatement of the reviewed material. The analysis therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on qualitative assumptions about the nature of disagreement and bias rather than quantitative parameters or new entities.

axioms (2)
  • domain assumption Human disagreement during annotation constitutes a high-fidelity signal of diversity rather than technical noise or error
    Invoked throughout the critique of the 'noisy sensor' fallacy and the call to reclaim disagreement.
  • domain assumption Papers published 2020-2025 in the seven listed premier venues are representative of broader data annotation practices
    Basis for the systematic review scope and generalizability of findings.

pith-pipeline@v0.9.0 · 5576 in / 1391 out tokens · 65197 ms · 2026-05-16T05:06:32.652853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages

  1. [1]

    Nuredin Ali Abdelkadir, Tianling Yang, Shivani Kapania, Meron Estefanos, Fasica Berhane Gebrekidan, Zecharias Zelalem, Messai Ali, Rishan Berhe, Dylan Baker, Zeerak Talat, Milagros Miceli, Alex Hanna, and Timnit Gebru. 2025. The Role of Expertise in Effectively Moderating Harmful Social Media Content.Proceedings of the 2025 CHI Conference on Human Factors...

  2. [2]

    Abdu, Irene V

    Amina A. Abdu, Irene V. Pasquetto, and Abigail Z. Jacobs. 2023. An Empirical Analysis of Racial Categories in the Algorithmic Fairness Literature.Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(2023). doi:10.1145/3593013.3594083

  3. [3]

    Re-imagining algorithmic fairness in india and beyond

    Rediet Abebe, Kehinde Aruleba, Abeba Birhane, Sara Kingsley, George Obaido, Sekou L. Remy, and Swathi Sadagopan. 2021. Narratives and Counternarratives on Data Sharing in Africa.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (2021). doi:10.1145/3442188.3445897

  4. [4]

    Shubham Agarwal, Trung Bui, Joon-Young Lee, Ioannis Konstas, and Verena Rieser. 2020. History for Visual Dialog: Do we really need it?Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(2020). doi:10.18653/v1/2020.acl-main.728

  5. [5]

    Junaid Ali, Preethi Lahoti, and Krishna P. Gummadi. 2021. Accounting for Model Uncertainty in Algorithmic Discrimination.Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society(2021). doi:10.1145/3461702.3462630

  6. [6]

    Jennifer Allen, Cameron Martel, and David G Rand. 2022. Birds of a feather don’t fact-check each other: Partisanship and the evaluation of news in Twitter’s Birdwatch crowdsourced fact-checking program.Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems(2022). doi:10.1145/3491102.3502040

  7. [7]

    Adriana Alvarado Garcia, Heloisa Candello, Karla Badillo-Urquiola, and Marisol Wong-Villacres. 2025. Emerging Data Practices: Data Work in the Era of Large Language Models.Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3714069

  8. [8]

    Adriana Alvarado Garcia, Marisol Wong-Villacres, Milagros Miceli, Benjamín Hernández, and Christopher A Le Dantec. 2023. Mobilizing Social Media Data: Reflections of a Researcher Mediating between Data and Organization.Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems(2023). doi:10.1145/3544548.3580916

  9. [9]

    Katrin Angerbauer, Nils Rodrigues, Rene Cutura, Seyda Öney, Nelusa Pathmanathan, Cristina Morariu, Daniel Weiskopf, and Michael Sedlmair. 2022. Accessibility for Color Vision Deficiencies: Challenges and Findings of a Large Scale Study on Paper Figures.Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems(2022). doi:10.1145/3491102.3502133

  10. [10]

    Ariful Islam Anik and Andrea Bunt. 2021. Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency.Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems(2021). doi:10.1145/3411764.3445736

  11. [11]

    Samreen Anjum, Chi Lin, and Danna Gurari. 2021. CrowdMOT: Crowdsourcing Strategies for Tracking Multiple Objects in Videos. Proc. ACM Hum.-Comput. Interact.(2021). doi:10.1145/3434175

  12. [12]

    Paula Akemi Aoyagui, Kelsey Stemmler, Sharon A Ferguson, Young-Ho Kim, and Anastasia Kuzminykh. 2025. A Matter of Perspective(s): Contrasting Human and LLM Argumentation in Subjective Decision-Making on Subtle Sexism.Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3713248

  13. [13]

    Riku Arakawa, Hiromu Yakura, and Masataka Goto. 2023. CatAlyst: Domain-Extensible Intervention for Preventing Task Procrastination Using Large Generative Models.Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems(2023). doi:10.1145/ 3544548.3581133

  14. [14]

    Schiff, Kaylyn Jackson Schiff, Brian Love, Jennifer Melot, Neha Singh, Lindsay Jenkins, Ashley Lin, Konstantin Pilz, Ogadinma Enwereazu, and Tyler Girard

    Zachary Arnold, Daniel S. Schiff, Kaylyn Jackson Schiff, Brian Love, Jennifer Melot, Neha Singh, Lindsay Jenkins, Ashley Lin, Konstantin Pilz, Ogadinma Enwereazu, and Tyler Girard. 2025. Introducing the AI Governance and Regulatory Archive (AGORA): An Analytic The Consensus Trap FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Infrastructure for Navigati...

  15. [15]

    Lora Aroyo, Alex Taylor, Mark Díaz, Christopher Homan, Alicia Parrish, Gregory Serapio-García, Vinodkumar Prabhakaran, and Ding Wang. 2023. DICES Dataset: Diversity in Conversational AI Evaluation for Safety.Advances in Neural Information Processing Systems (2023)

  16. [16]

    Padilla, and Chris Bryan

    Anjana Arunkumar, Lace M. Padilla, and Chris Bryan. 2025. Lost in Translation: How Does Bilingualism Shape Reader Preferences for Annotated Charts?Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3713380

  17. [17]

    Mina Arzaghi, Florian Carichon, and Golnoosh Farnadi. 2025. Understanding Intrinsic Socioeconomic Biases in Large Language Models. Proceedings of the 2024 AAAI/ACM Conference on AI, Ethics, and Society(2025)

  18. [18]

    Anne Arzberger, Stefan Buijsman, Maria Luce Lupetti, Alessandro Bozzon, and Jie Yang. 2025. Nothing Comes without Its World - Practical Challenges of Aligning LLMs to Situated Human Values through RLHF.Proceedings of the 2024 AAAI/ACM Conference on AI, Ethics, and Society(2025)

  19. [19]

    Wolf, Evelyn Duesterwald, Casey Dugan, Werner Geyer, and Darrell Reimer

    Zahra Ashktorab, Michael Desmond, Josh Andres, Michael Muller, Narendra Nath Joshi, Michelle Brachman, Aabhas Sharma, Kristina Brimijoin, Qian Pan, Christine T. Wolf, Evelyn Duesterwald, Casey Dugan, Werner Geyer, and Darrell Reimer. 2021. AI-Assisted Human Labeling: Batching for Efficiency without Overreliance.Proc. ACM Hum.-Comput. Interact.(2021). doi:...

  20. [20]

    Ezra Awumey, Sauvik Das, and Jodi Forlizzi. 2024. A Systematic Review of Biometric Monitoring in the Workplace: Analyzing Socio-technical Harms in Development, Deployment and Use.Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(2024). doi:10.1145/3630106.3658945

  21. [21]

    Teanna Barrett, Quanze Chen, and Amy Zhang. 2023. Skin Deep: Investigating Subjectivity in Skin Tone Annotations for Computer Vision Benchmark Datasets.Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(2023). doi:10.1145/ 3593013.3594114

  22. [22]

    Tilman Beck, Ji-Ung Lee, Christina Viehmann, Marcus Maurer, Oliver Quiring, and Iryna Gurevych. 2021. Investigating label suggestions for opinion mining in German Covid-19 social media.Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1...

  23. [23]

    Ripa, and Klaus Mueller

    Tina Behzad, Mithilesh Kumar Singh, Anthony J. Ripa, and Klaus Mueller. 2025. FairPlay: A Collaborative Approach to Mitigate Bias in Datasets for Improved AI Fairness.Proc. ACM Hum.-Comput. Interact.(2025). doi:10.1145/3710982

  24. [24]

    Tadesse Destaw Belay, Ahmed Haj Ahmed, Alvin Grissom II, Iqra Ameer, Grigori Sidorov, Olga Kolesnikova, and Seid Muhie Yimam

  25. [25]

    doi:10.18653/v1/2025.acl-long.925

    CULEMO: Cultural Lenses on Emotion - Benchmarking LLMs for Cross-Cultural Emotion Understanding.Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2025). doi:10.18653/v1/2025.acl-long.925

  26. [26]

    Frank Bentley, Kathleen O’Neill, Katie Quehl, and Danielle Lottridge. 2020. Exploring the Quality, Efficiency, and Representative Nature of Responses Across Multiple Survey Panels.Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems(2020). doi:10.1145/3313831.3376671

  27. [27]

    Elena Beretta, Antonio Vetrò, Bruno Lepri, and Juan Carlos De Martin. 2021. Detecting discriminatory risk through data annotation based on Bayesian inferences.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(2021). doi:10.1145/ 3442188.3445940

  28. [28]

    Stevie Bergman, Lisa Anne Hendricks, Maribeth Rauh, Boxi Wu, William Agnew, Markus Kunesch, Isabella Duan, Iason Gabriel, and William Isaac

    A. Stevie Bergman, Lisa Anne Hendricks, Maribeth Rauh, Boxi Wu, William Agnew, Markus Kunesch, Isabella Duan, Iason Gabriel, and William Isaac. 2023. Representation in AI Evaluations.Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(2023). doi:10.1145/3593013.3594019

  29. [29]

    Eshta Bhardwaj, Harshit Gujral, Siyi Wu, Ciara Zogheib, Tegan Maharaj, and Christoph Becker. 2024. Machine learning data practices through a data curation lens: An evaluation framework.Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(2024). doi:10.1145/3630106.3658955

  30. [30]

    Zhang, Connie Moon Sehat, and Tanushree Mitra

    Md Momen Bhuiyan, Amy X. Zhang, Connie Moon Sehat, and Tanushree Mitra. 2020. Investigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, and Expert Criteria.Proc. ACM Hum.-Comput. Interact.(2020). doi:10.1145/3415164

  31. [31]

    Nanyi Bi, Yi-Ching (Janet) Huang, Chao-Chun Han, and Jane Yung-jen Hsu. 2023. You Know What I Meme: Enhancing People’s Understanding and Awareness of Hateful Memes Using Crowdsourced Explanations.Proc. ACM Hum.-Comput. Interact.(2023). doi:10.1145/3579593

  32. [32]

    Brown, Johnathan Flowers, Anthony Ventresque, and Christopher L

    Abeba Birhane, Elayne Ruane, Thomas Laurent, Matthew S. Brown, Johnathan Flowers, Anthony Ventresque, and Christopher L. Dancy

  33. [33]

    Elizabeth and Horowitz, Aaron and Selbst, Andrew , year=

    The Forgotten Margins of AI Ethics.Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(2022). doi:10.1145/3531146.3533157

  34. [34]

    Borhane Blili-Hamelin and Leif Hancox-Li. 2023. Making Intelligence: Ethical Values in IQ and ML Benchmarks.Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(2023). doi:10.1145/3593013.3593996

  35. [35]

    Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, 5454–5476. d...

  36. [36]

    William Boag, Hassan Kané, Saumya Rawat, Jesse Wei, and Alexander Goehler. 2021. A Pilot Study in Surveying Clinical Judgments to Evaluate Radiology Report Generation.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(2021). doi:10.1145/3442188.3445909

  37. [37]

    Angie Boggust, Hyemin Bang, Hendrik Strobelt, and Arvind Satyanarayan. 2025. Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual Relationships.Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3713406

  38. [38]

    Elizabeth Bondi, Lily Xu, Diana Acosta-Navas, and Jackson A. Killian. 2021. Envisioning Communities: A Participatory Approach Towards AI for Social Good.Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society(2021). doi:10.1145/3461702.3462612

  39. [39]

    Ashley Boone, Annabel Rothschild, Xander Koo, Grace Pfohl, Alyssa Sheehan, Betsy DiSalvo, Christopher A Le Dantec, and Carl DiSalvo

  40. [40]

    ACM Hum.-Comput

    Reimagining Meaningful Data Work through Citizen Science.Proc. ACM Hum.-Comput. Interact.(2024). doi:10.1145/3687049

  41. [41]

    Karen L. Boyd. 2021. Datasheets for Datasets help ML Engineers Notice and Understand Ethical Issues in Training Data.Proc. ACM Hum.-Comput. Interact.(2021). doi:10.1145/3479582

  42. [42]

    Michelle Brachman, Zahra Ashktorab, Michael Desmond, Evelyn Duesterwald, Casey Dugan, Narendra Nath Joshi, Qian Pan, and Aabhas Sharma. 2022. Reliance and Automation for Human-AI Collaborative Data Labeling Conflict Resolution.Proc. ACM Hum.-Comput. Interact.(2022). doi:10.1145/3555212

  43. [43]

    Venetia Brown, Retno Larasati, Aisling Third, and Tracie Farrell. 2025. A Qualitative Study on Cultural Hegemony and the Impacts of AI.Proceedings of the 2024 AAAI/ACM Conference on AI, Ethics, and Society(2025)

  44. [44]

    Keith Burghardt, Tad Hogg, Raissa D’Souza, Kristina Lerman, and Marton Posfai. 2020. Origins of Algorithmic Instabilities in Crowdsourced Ranking.Proc. ACM Hum.-Comput. Interact.(2020). doi:10.1145/3415237

  45. [45]

    Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio Cesar Vieira Machado, and Flavio du Pin Calmon

  46. [46]

    doi:10.1145/3715275.3732194

    AI Alignment at Your Discretion.Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency(2025). doi:10.1145/3715275.3732194

  47. [47]

    Ángel Alexander Cabrera, Adam Perer, and Jason I. Hong. 2023. Improving Human-AI Collaboration With Descriptions of AI Behavior. Proc. ACM Hum.-Comput. Interact.(2023). doi:10.1145/3579612

  48. [48]

    Kathleen Cachel and Elke Rundensteiner. 2023. Fairer Together: Mitigating Disparate Exposure in Kemeny Rank Aggregation. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(2023). doi:10.1145/3593013.3594085

  49. [49]

    Kathleen Cachel and Elke Rundensteiner. 2025. Group Fair Rated Preference Aggregation: Ties Are (Mostly) All You Need.Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency(2025). doi:10.1145/3715275.3732042

  50. [50]

    Nitay Calderon, Roi Reichart, and Rotem Dror. 2025. The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs.Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2025). doi:10.18653/v1/2025.acl-long.782

  51. [51]

    Scott Allen Cambo and Darren Gergle. 2022. Model Positionality and Computational Reflexivity: Promoting Reflexivity in Data Science. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems(2022). doi:10.1145/3491102.3501998

  52. [52]

    Yang Trista Cao and Hal Daumé III. 2020. Toward Gender-Inclusive Coreference Resolution.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(2020). doi:10.18653/v1/2020.acl-main.418

  53. [53]

    Angela Carrera-Rivera, William Ochoa, Felix Larrinaga, and Ganix Lasa. 2022. How-to conduct a systematic literature review: A quick guide for computer science research.MethodsX9 (2022), 101895. doi:10.1016/j.mex.2022.101895

  54. [54]

    Silvia Casola, Simona Frenda, Soda Marem Lo, Erhan Sezerer, Antonio Uva, Valerio Basile, Cristina Bosco, Alessandro Pedrani, Chiara Rubagotti, Viviana Patti, and Davide Bernardi. 2024. MultiPICo: Multilingual Perspectivist Irony Corpus.Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2024). do...

  55. [55]

    Asil Çetin, Torsten Moeller, and Thomas Torsney-Weir. 2021. CorpSum: Towards an Enabling Tool-Design for Language Researchers to Explore, Analyze and Visualize Corpora.Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems(2021). doi:10.1145/3411764.3445145

  56. [56]

    Tuhin Chakrabarty, Philippe Laban, and Chien-Sheng Wu. 2025. Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits.Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3713559

  57. [57]

    Eunice Chan, Zhining Liu, Ruizhong Qiu, Yuheng Zhang, Ross Maciejewski, and Hanghang Tong. 2024. Group Fairness via Group Consensus.Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(2024). doi:10.1145/3630106.3659006

  58. [58]

    Taylor, Sara Heitlinger, and Ding Wang

    Srravya Chandhiramowuli, Alex S. Taylor, Sara Heitlinger, and Ding Wang. 2024. Making Data Work Count.Proc. ACM Hum.-Comput. Interact.(2024). doi:10.1145/3637367

  59. [59]

    Chia-Ming Chang, Chia-Hsien Lee, and Takeo Igarashi. 2021. Spatial Labeling: Leveraging Spatial Layout for Improving Label Quality in Non-Expert Image Annotation.Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems(2021). doi:10.1145/3411764.3445165

  60. [60]

    Shyam Sundar

    Cheng Chen and S. Shyam Sundar. 2023. Is this AI trained on Credible Data? The Effects of Labeling Quality and Performance Bias on User Trust.Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems(2023). doi:10.1145/3544548.3580805 The Consensus Trap FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

  61. [61]

    Justin Chen, Swarnadeep Saha, and Mohit Bansal. 2024. ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs.Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024). doi:10.18653/v1/2024.acl-long.381

  62. [62]

    Weld, and Amy X

    Quan Ze Chen, Daniel S. Weld, and Amy X. Zhang. 2021. Goldilocks: Consistent Crowdsourced Scalar Annotations with Relative Uncertainty.Proc. ACM Hum.-Comput. Interact.(2021). doi:10.1145/3476076

  63. [63]

    Quan Ze Chen and Amy X. Zhang. 2023. Judgment Sieve: Reducing Uncertainty in Group Judgments through Interventions Targeting Ambiguity versus Disagreement.Proc. ACM Hum.-Comput. Interact.(2023). doi:10.1145/3610074

  64. [64]

    Evgenia Christoforou, Pinar Barlas, and Jahna Otterbacher. 2021. It’s About Time: A View of Crowdsourced Data Before and During the Pandemic.Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems(2021). doi:10.1145/3411764.3445317

  65. [65]

    Seong Yeub Chu, Jong Woo Kim, and Mun Yong Yi. 2025. Think Together and Work Better: Combining Humans’ and LLMs’ Think- Aloud Outcomes for Effective Text Evaluation.Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3713181

  66. [66]

    Chaeyeon Chung, Jungsoo Lee, Kyungmin Park, Junsoo Lee, Minjae Kim, Mookyung Song, Yeonwoo Kim, Jaegul Choo, and Sungsoo Ray Hong. 2021. Understanding Human-side Impact of Sampling Image Batches in Subjective Attribute Labeling.Proc. ACM Hum.-Comput. Interact.(2021). doi:10.1145/3476037

  67. [67]

    Caroline Claisse, Alison K Osborne, Elizabeth Sillence, Angela Glascott, Alisdair S Cameron, and Abigail C Durrant. 2025. Exploring Alternative Socio-Technical Systems for Careful Data Work in Recovery Contexts.Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025). doi:10.1145/3706598.3713537

  68. [68]

    Darren Cook, Miri Zilka, Heidi DeSandre, Susan Giles, and Simon Maskell. 2023. Protecting Children from Online Exploitation: Can a Trained Model Detect Harmful Communication Strategies?Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (2023). doi:10.1145/3600211.3604696

  69. [69]

    María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, and Marcello Federico. 2025. MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation.Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2025). doi:10.18653/v1/2025.acl-long.1101

  70. [70]

    Shiva Darian, Aarjav Chauhan, Ricky Marton, Janet Ruppert, Kathleen Anderson, Ryan Clune, Madeline Cupchak, Max Gannett, Joel Holton, Elizabeth Kamas, Jason Kibozi-Yocka, Devin Mauro-Gallegos, Simon Naylor, Meghan O’Malley, Mehul Patel, Jack Sandberg, Troy Siegler, Ryan Tate, Abigil Temtim, Samantha Whaley, and Amy Voida. 2023. Enacting Data Feminism in A...

  71. [71]

    Aida Davani, Mark Díaz, Dylan Baker, and Vinodkumar Prabhakaran. 2024. Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates.Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(2024). doi:10.1145/3630106. 3659021

  72. [72]

    A. P. Dawid and A. M. Skene. 1979. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.Journal of the Royal Statistical Society: Series C (Applied Statistics)28, 1 (1979), 20–28

  73. [73]

    Ankolika De, Shaheen Kanthawala, and Jessica Maddox. 2025. Who Gets Heard? Calling Out the ’Hard-to-Reach’ Myth for Non-WEIRD Populations’ Recruitment and Involvement in Research.Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency(2025). doi:10.1145/3715275.3732055

  74. [74]

    Nathan Dennler, Anaelia Ovalle, Ashwin Singh, Luca Soldaini, Arjun Subramonian, Huy Tu, William Agnew, Avijit Ghosh, Kyra Yee, Irene Font Peradejordi, Zeerak Talat, Mayra Russo, and Jess De Jesus De Pinho Pinhal. 2023. Bound by the Bounty: Collaboratively Shaping Evaluation Processes for Queer AI Harms.Proceedings of the 2023 AAAI/ACM Conference on AI, Et...

  75. [75]

    Denton, M

    Emily L. Denton, M. C. D’iaz, Ian Kivlichan, Vinodkumar Prabhakaran, and Rachel Rosen. 2021. Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation.ArXivabs/2112.04554 (2021). https://api.semanticscholar.org/ CorpusID:245005939

  76. [76]

    Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(2021). doi:10.1145/3442188.3445924

  77. [77]

    Mark Díaz, Ian Kivlichan, Rachel Rosen, Dylan Baker, Razvan Amironesei, Vinodkumar Prabhakaran, and Remi Denton. 2022. CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation.Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(2022). doi:10.1145/3531146.3534647

  78. [78]

    Mark Díaz and Angela D. R. Smith. 2025. What Makes an Expert? Reviewing How ML Researchers Define ’Expert’.Proceedings of the 2024 AAAI/ACM Conference on AI, Ethics, and Society(2025)

  79. [79]

    Yi Ding, Jacob You, Tonja-Katrin Machulla, Jennifer Jacobs, Pradeep Sen, and Tobias Höllerer. 2022. Impact of Annotator Demographics on Sentiment Dataset Labeling.Proc. ACM Hum.-Comput. Interact.(2022). doi:10.1145/3555632

  80. [80]

    Eccles, Niels van Berkel, and Vassilis Kostakos

    Tilman Dingler, Benjamin Tag, David A. Eccles, Niels van Berkel, and Vassilis Kostakos. 2022. Method for Appropriating the Brief Implicit Association Test to Elicit Biases in Users.Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Munir, et al. (2022). doi:10.1145/3491102.3517570

Showing first 80 references.