pith. sign in

arxiv: 2606.05936 · v1 · pith:HLTOVTEGnew · submitted 2026-06-04 · 💻 cs.CL

Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails

Pith reviewed 2026-06-28 02:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords epistemic injusticelanguage modelspretraining filtersguardrailsepistemic erasurecontent moderationmarginalized groupsrepresentational harms
0
0 comments X

The pith

Pretraining filters and guardrails disproportionately remove mentions of marginalized groups from language model data and outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits four pretraining filters and three inference-time guardrails on Common Crawl sentences that mention gender and regional origins, along with a manually annotated subset of 500 sentences. Automated decisions track closely with simple blocklist word cues and miss much private information or explicit hate speech. The same systems flag content about transgender people, women, and Central Americans at significantly higher rates than other content. Human annotators would retain 88.5 percent of the filter-flagged sentences and 91.3 percent of the guardrail-flagged sentences, citing representational harms that the automated rules overlook. The combined effect is a double removal of marginalized perspectives before training and again at inference time.

Core claim

Filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5% of filter-flagged and 91.3% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately rem

What carries the argument

The audit that compares automated filter and guardrail outputs against human retention judgments on sentences containing gender and regional-origin mentions.

If this is right

  • Mentions of marginalized groups are removed at higher rates before pretraining than other content.
  • The same mentions are suppressed again by guardrails at inference time.
  • Decisions rest on blocklist lexical cues rather than detection of private information or hate speech.
  • Human judges identify representational harms in content that the automated systems remove.
  • The pattern produces epistemic erasure through repeated suppression of marginalized perspectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could test whether adding human review for identity-related edge cases reduces the observed over-flagging.
  • The same lexical-cue reliance may produce similar erasure on other identity categories not tested in this audit.
  • Replacing blocklists with criteria that better match human retention judgments would change which content reaches training data.
  • Downstream models may inherit reduced coverage of marginalized experiences as a direct result of these upstream choices.

Load-bearing premise

The 500 manually annotated sentences form a representative sample and human annotators' retention judgments accurately capture representational harms that automated systems miss.

What would settle it

A larger or differently sampled audit that finds no over-flagging of marginalized-group mentions by the same filters and guardrails.

Figures

Figures reproduced from arXiv: 2606.05936 by Anne Lauscher, A Pranav, Christian Hardmeier, Marco Antonio Stranisci, Rossana Damiano.

Figure 1
Figure 1. Figure 1: Pairwise Cohen’s κ between filter and guardrails. Bold marks q < .05; underline marks q < .01. 4 Findings In this section, we show our findings on agreement audits between filters and guardrails, show the ac￾count of epistemic erasure of the identities and give a walkthrough of tensions in content moderation with examples. 4.1 Study of disagreement of filters and guardrails In this section we show how the … view at source ↗
Figure 3
Figure 3. Figure 3: Flag rates for the five world regions with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Annotators disagree with most system ver [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5\% of filter-flagged and 91.3\% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper audits four pretraining filters and three inference-time guardrails applied to Common Crawl sentences containing gender and regional-origin mentions. Using a manually annotated subset of 500 sentences, it reports that filtering and guardrail decisions correlate strongly with blocklist lexical cues, often miss private information or explicit hate speech, and disproportionately flag content mentioning marginalized groups (especially transgender people, women, and Central Americans). Human annotators would retain 88.5% of filter-flagged and 91.3% of guardrail-flagged items, leading to the claim that these systems produce epistemic erasure by removing mentions of marginalized groups both before pretraining and at inference time.

Significance. If the sampling, annotation, and statistical associations hold, the audit supplies concrete evidence that automated content-moderation pipelines in language-model pipelines can systematically suppress representation of certain demographic groups while failing to address other harms, revealing a mismatch between system behavior and human judgments of representational value. Such findings could guide the design of more transparent and less biased filtering mechanisms.

major comments (1)
  1. [Abstract / annotation description] Abstract and the section describing the 500-sentence annotation (implied in the methods and results): the central claims of disproportionate over-flagging of marginalized groups and of human retention rates (88.5% and 91.3%) rest entirely on this manually annotated subset. The manuscript supplies no sampling frame, stratification by demographic category, selection criteria from the larger Common Crawl corpus, annotation protocol, inter-annotator agreement statistics, or operational definition of what counts as a retention-worthy sentence versus one exhibiting representational harm. Without these details the quantitative associations cannot support the epistemic-erasure conclusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater methodological transparency in our annotation procedure. We address the comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract / annotation description] Abstract and the section describing the 500-sentence annotation (implied in the methods and results): the central claims of disproportionate over-flagging of marginalized groups and of human retention rates (88.5% and 91.3%) rest entirely on this manually annotated subset. The manuscript supplies no sampling frame, stratification by demographic category, selection criteria from the larger Common Crawl corpus, annotation protocol, inter-annotator agreement statistics, or operational definition of what counts as a retention-worthy sentence versus one exhibiting representational harm. Without these details the quantitative associations cannot support the epistemic-erasure conclusion.

    Authors: We agree that the current manuscript provides insufficient detail on the 500-sentence annotation to fully substantiate the reported associations and human retention rates. The manuscript does not include a sampling frame, stratification details, explicit selection criteria, annotation protocol, inter-annotator agreement statistics, or operational definitions for retention decisions. In the revised version we will expand the Methods section with a dedicated subsection that supplies: (1) the sampling frame and selection criteria from the larger Common Crawl corpus, including any stratification by gender or regional-origin categories; (2) the complete annotation protocol and guidelines given to annotators; (3) inter-annotator agreement statistics; and (4) operational definitions distinguishing retention-worthy sentences from those exhibiting representational harm. These additions will strengthen the evidential basis for the epistemic-erasure claims without altering the core quantitative findings. revision: yes

Circularity Check

0 steps flagged

Empirical audit with no derivations or self-referential reductions

full rationale

The paper is an empirical audit of pretraining filters and guardrails on Common Crawl sentences, reporting observed associations between lexical cues, demographic mentions, and flagging rates, plus human retention judgments on a 500-sentence subset. No equations, fitted parameters, uniqueness theorems, or ansatzes appear. Central claims rest on data associations and external human annotations rather than any step that reduces by construction to its own inputs or prior self-citations. This is a standard empirical study whose reasoning chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on treating human retention judgments as a valid external benchmark and on the sampled sentences being representative of Common Crawl content containing group mentions.

axioms (1)
  • domain assumption Human annotators' decisions on content retention provide a reliable ground truth for evaluating representational harms
    Paper contrasts automated flags with human retention rates (88.5% and 91.3%) to argue systems over-flag.

pith-pipeline@v0.9.1-grok · 5752 in / 1182 out tokens · 38942 ms · 2026-06-28T02:04:03.138710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell

    A survey on data selection for language models.arXiv preprint arXiv:2402.16827. Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell

  2. [2]

    Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach

    On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM confer- ence on fairness, accountability, and transparency, pages 610–623. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach

  3. [3]

    org/abs/2012.07805

    Extracting training data from large language models. Preprint, arXiv:2012.07805. Tommaso Caselli, Valerio Basile, Jelena Mitrovi´c, and Michael Granitzer

  4. [4]

    In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 17–25, Online

    HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 17–25, Online. As- sociation for Computational Linguistics. Aida Davani, Mark Díaz, Dylan Baker, and Vinodkumar Prabhakaran

  5. [5]

    InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 2007–2021

    Disentangling perceptions of of- fensiveness: Cultural and moral correlates. InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 2007–2021. Aida Mostafazadeh Davani, Mohammad Atari, Bren- dan Kennedy, and Morteza Dehghani

  6. [6]

    Preprint, arXiv:2110.14839

    Hate speech classifiers learn human-like social stereotypes. Preprint, arXiv:2110.14839. Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber

  7. [7]

    Automated Hate Speech Detection and the Problem of Offensive Language

    Automated hate speech detec- tion and the problem of offensive language.Preprint, arXiv:1703.04009. Jesse Dodge, Maarten Sap, Ana Marasovi ´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner

  8. [8]

    InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic

    Documenting large webtext corpora: A case study on the colos- sal clean crawled corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Rebecca Dorn, Lee Kezar, Fred Morstatter, and Kristina Lerman

  9. [9]

    InProceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 335–342

    Reclaim project: Exploring italian slurs reappropriation with large lan- guage models. InProceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 335–342. 9 Fredo Erxleben, Michael Günther, Markus Krötzsch, Ju- lian Mendez, and Denny Vrandeˇci´c

  10. [10]

    InThe Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23,

    Introduc- ing wikidata to the linked data web. InThe Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23,

  11. [11]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Red teaming language models to reduce harms: Meth- ods, scaling behaviors, and lessons learned.Preprint, arXiv:2209.07858. Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy

  12. [12]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    The pile: An 800gb dataset of diverse text for language modeling. Preprint, arXiv:2101.00027. Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien

  13. [13]

    André Belchior Gomes and Aysel Sultan

    Aegis: Online adaptive ai content safety moderation with ensemble of llm experts.Preprint, arXiv:2404.05993. André Belchior Gomes and Aysel Sultan

  14. [14]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama guard: Llm-based input-output safeguard for human-ai conversations. Preprint, arXiv:2312.06674. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang

  15. [15]

    Eun Seo Jo and Timnit Gebru

    Beavertails: To- wards improved safety alignment of llm via a human- preference dataset.Preprint, arXiv:2307.04657. Eun Seo Jo and Timnit Gebru

  16. [16]

    InProceedings of the 2020 con- ference on fairness, accountability, and transparency, pages 306–316

    Lessons from archives: Strategies for collecting sociocultural data in machine learning. InProceedings of the 2020 con- ference on fairness, accountability, and transparency, pages 306–316. Jared Katzman, Angelina Wang, Morgan Scheuerman, Su Lin Blodgett, Kristen Laird, Hanna Wallach, and Solon Barocas

  17. [17]

    Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo V olpin, Frederic A

    Generalization through memorization: Nearest neighbor language models.Preprint, arXiv:1911.00172. Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo V olpin, Frederic A. Dreyer, Aleksandar Sht- edritski, and Yuki M. Asano

  18. [18]

    Preprint, arXiv:2102.04130

    Bias out-of-the- box: An empirical analysis of intersectional occupa- tional biases in popular generative language models. Preprint, arXiv:2102.04130. Hadas Kotek, Rikker Dockum, and David Sun

  19. [19]

    InProceedings of The ACM Collective Intelligence Conference, CI ’23, page 12–24, New York, NY , USA

    Gender bias and stereotypes in large language models. InProceedings of The ACM Collective Intelligence Conference, CI ’23, page 12–24, New York, NY , USA. Association for Computing Machinery. Tahu Kukutai and John Taylor. 2016.Indigenous data sovereignty: Toward an agenda, volume

  20. [20]

    Exploring cross-cultural differences in english hate speech annotations: From dataset construction to analysis. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4205–

  21. [21]

    InFind- ings of the Association for Computational Linguistics: ACL 2024, pages 3923–3954, Bangkok, Thailand

    SALAD-bench: A hierarchical and comprehensive safety benchmark for large language models. InFind- ings of the Association for Computational Linguistics: ACL 2024, pages 3923–3954, Bangkok, Thailand. As- sociation for Computational Linguistics. Shayne Longpre, Gregory Yauney, Emily Reif, Kather- ine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Ke...

  22. [22]

    A pretrainer’s guide to train- ing data: Measuring the effects of data age, domain coverage, quality, & toxicity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Pa- pers), pages 3245–3276, Mexico City, Mexico. Asso- ciation for Computational...

  23. [23]

    United Nations

    Towards safer pretraining: Analyzing and filtering harmful content in webscale datasets for responsible llms.Preprint, arXiv:2505.02009. United Nations. Statistical Office. 1982.Standard coun- try or area codes for statistical use

  24. [24]

    InProceedings of the 2025 ACM Conference on Fair- ness, Accountability, and Transparency, FAccT ’25, page 3094–3105, New York, NY , USA

    The root shapes the fruit: On the persistence of gender-exclusive harms in aligned language models. InProceedings of the 2025 ACM Conference on Fair- ness, Accountability, and Transparency, FAccT ’25, page 3094–3105, New York, NY , USA. Association for Computing Machinery. Guilherme Penedo, Hynek Kydlí ˇcek, Loubna Ben al- lal, Anton Lozhkov, Margaret Mit...

  25. [25]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    The fineweb datasets: Decanting the web for the finest text data at scale.Preprint, arXiv:2406.17557. Rida Qadri, Aida M. Davani, Kevin Robinson, and Vinodkumar Prabhakaran

  26. [26]

    Organizers of QueerInAI, A Pranav, MaryLena Bleile, Arjun Subramonian, Luca Soldaini, Danica J

    Risks of cul- tural erasure in large language models.Preprint, arXiv:2501.01056. Organizers of QueerInAI, A Pranav, MaryLena Bleile, Arjun Subramonian, Luca Soldaini, Danica J. Suther- land, Sabine Weber, and Pan Xu

  27. [27]

    InPro- ceedings of the 2021 Workshop on Widening NLP, Punta Cana, Dominican Republic

    How to make virtual conferences queer-friendly: A guide. InPro- ceedings of the 2021 Workshop on Widening NLP, Punta Cana, Dominican Republic. Conference on Empirical Methods in Natural Language Processing. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu

  28. [28]

    Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith

    Shaping capa- bilities with token-level data filtering.Preprint, arXiv:2601.21571. Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith

  29. [29]

    InFind- ings of the Association for Computational Linguis- tics: EMNLP 2023, pages 12310–12324, Singapore

    Ge- ographical erasure in language generation. InFind- ings of the Association for Computational Linguis- tics: EMNLP 2023, pages 12310–12324, Singapore. Association for Computational Linguistics. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng

  30. [30]

    The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 3407– 3412, Hong Kong, China. Association for Computa- tional Linguistics. Luca Soldaini, Rodney Kinn...

  31. [31]

    A Roadmap to Pluralistic Alignment

    A roadmap to pluralistic alignment. Preprint, arXiv:2402.05070. Marco Antonio Stranisci and Christian Hardmeier

  32. [32]

    InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 2390–2397

    Detoxi- fying language models risks marginalizing minority voices. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 2390–2397. 12 A Detailed results on study of epistemic erasure of marginalised identities Tables 4, 5, and 6 report per-system flag rate...