Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails

Anne Lauscher; A Pranav; Christian Hardmeier; Marco Antonio Stranisci; Rossana Damiano

arxiv: 2606.05936 · v1 · pith:HLTOVTEGnew · submitted 2026-06-04 · 💻 cs.CL

Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails

Marco Antonio Stranisci , A Pranav , Rossana Damiano , Christian Hardmeier , Anne Lauscher This is my paper

Pith reviewed 2026-06-28 02:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords epistemic injusticelanguage modelspretraining filtersguardrailsepistemic erasurecontent moderationmarginalized groupsrepresentational harms

0 comments

The pith

Pretraining filters and guardrails disproportionately remove mentions of marginalized groups from language model data and outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits four pretraining filters and three inference-time guardrails on Common Crawl sentences that mention gender and regional origins, along with a manually annotated subset of 500 sentences. Automated decisions track closely with simple blocklist word cues and miss much private information or explicit hate speech. The same systems flag content about transgender people, women, and Central Americans at significantly higher rates than other content. Human annotators would retain 88.5 percent of the filter-flagged sentences and 91.3 percent of the guardrail-flagged sentences, citing representational harms that the automated rules overlook. The combined effect is a double removal of marginalized perspectives before training and again at inference time.

Core claim

Filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5% of filter-flagged and 91.3% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately rem

What carries the argument

The audit that compares automated filter and guardrail outputs against human retention judgments on sentences containing gender and regional-origin mentions.

If this is right

Mentions of marginalized groups are removed at higher rates before pretraining than other content.
The same mentions are suppressed again by guardrails at inference time.
Decisions rest on blocklist lexical cues rather than detection of private information or hate speech.
Human judges identify representational harms in content that the automated systems remove.
The pattern produces epistemic erasure through repeated suppression of marginalized perspectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could test whether adding human review for identity-related edge cases reduces the observed over-flagging.
The same lexical-cue reliance may produce similar erasure on other identity categories not tested in this audit.
Replacing blocklists with criteria that better match human retention judgments would change which content reaches training data.
Downstream models may inherit reduced coverage of marginalized experiences as a direct result of these upstream choices.

Load-bearing premise

The 500 manually annotated sentences form a representative sample and human annotators' retention judgments accurately capture representational harms that automated systems miss.

What would settle it

A larger or differently sampled audit that finds no over-flagging of marginalized-group mentions by the same filters and guardrails.

Figures

Figures reproduced from arXiv: 2606.05936 by Anne Lauscher, A Pranav, Christian Hardmeier, Marco Antonio Stranisci, Rossana Damiano.

**Figure 1.** Figure 1: Pairwise Cohen’s κ between filter and guardrails. Bold marks q < .05; underline marks q < .01. 4 Findings In this section, we show our findings on agreement audits between filters and guardrails, show the account of epistemic erasure of the identities and give a walkthrough of tensions in content moderation with examples. 4.1 Study of disagreement of filters and guardrails In this section we show how the … view at source ↗

**Figure 3.** Figure 3: Flag rates for the five world regions with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Annotators disagree with most system ver [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5\% of filter-flagged and 91.3\% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Audit flags over-removal of marginalized mentions but the 500-sentence sample has no visible sampling or annotation details, so the disproportion claim stays provisional.

read the letter

The paper audits four pretraining filters and three guardrails on Common Crawl sentences with gender and regional mentions. It reports that decisions track blocklist lexical cues, that systems miss private information and hate speech, and that transgender, women, and Central American mentions get flagged more often than others. Human annotators would keep 88.5-91.3% of the flagged items. That human contrast and the cross-system comparison on concrete categories are the concrete pieces of new work here.

The approach is straightforward and extends existing auditing literature by adding the retention judgments. The lexical-cue association and the specific failure modes on hate speech are useful observations.

The soft spot is the 500-sentence subset that carries the disproportion and retention numbers. The abstract supplies no sampling frame, no stratification by category, no selection criteria, no annotation protocol, and no inter-annotator numbers. The stress-test concern therefore lands: without those details the retention percentages cannot reliably show that the systems are erasing content humans would keep, nor that the over-flagging is disproportionate in a representative way. The central epistemic-erasure claim rests on that unverified base.

This is observational work with no fitted parameters or derivations, so evidence quality is the whole game. Readers working on content-moderation pipelines or fairness audits will find the failure cases worth checking once the methods are filled in. The topic is timely enough that a serious editor should send it to referees rather than desk-reject, with the explicit request that the sampling and annotation sections be expanded before any stronger claims are made.

Referee Report

1 major / 0 minor

Summary. The paper audits four pretraining filters and three inference-time guardrails applied to Common Crawl sentences containing gender and regional-origin mentions. Using a manually annotated subset of 500 sentences, it reports that filtering and guardrail decisions correlate strongly with blocklist lexical cues, often miss private information or explicit hate speech, and disproportionately flag content mentioning marginalized groups (especially transgender people, women, and Central Americans). Human annotators would retain 88.5% of filter-flagged and 91.3% of guardrail-flagged items, leading to the claim that these systems produce epistemic erasure by removing mentions of marginalized groups both before pretraining and at inference time.

Significance. If the sampling, annotation, and statistical associations hold, the audit supplies concrete evidence that automated content-moderation pipelines in language-model pipelines can systematically suppress representation of certain demographic groups while failing to address other harms, revealing a mismatch between system behavior and human judgments of representational value. Such findings could guide the design of more transparent and less biased filtering mechanisms.

major comments (1)

[Abstract / annotation description] Abstract and the section describing the 500-sentence annotation (implied in the methods and results): the central claims of disproportionate over-flagging of marginalized groups and of human retention rates (88.5% and 91.3%) rest entirely on this manually annotated subset. The manuscript supplies no sampling frame, stratification by demographic category, selection criteria from the larger Common Crawl corpus, annotation protocol, inter-annotator agreement statistics, or operational definition of what counts as a retention-worthy sentence versus one exhibiting representational harm. Without these details the quantitative associations cannot support the epistemic-erasure conclusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater methodological transparency in our annotation procedure. We address the comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract / annotation description] Abstract and the section describing the 500-sentence annotation (implied in the methods and results): the central claims of disproportionate over-flagging of marginalized groups and of human retention rates (88.5% and 91.3%) rest entirely on this manually annotated subset. The manuscript supplies no sampling frame, stratification by demographic category, selection criteria from the larger Common Crawl corpus, annotation protocol, inter-annotator agreement statistics, or operational definition of what counts as a retention-worthy sentence versus one exhibiting representational harm. Without these details the quantitative associations cannot support the epistemic-erasure conclusion.

Authors: We agree that the current manuscript provides insufficient detail on the 500-sentence annotation to fully substantiate the reported associations and human retention rates. The manuscript does not include a sampling frame, stratification details, explicit selection criteria, annotation protocol, inter-annotator agreement statistics, or operational definitions for retention decisions. In the revised version we will expand the Methods section with a dedicated subsection that supplies: (1) the sampling frame and selection criteria from the larger Common Crawl corpus, including any stratification by gender or regional-origin categories; (2) the complete annotation protocol and guidelines given to annotators; (3) inter-annotator agreement statistics; and (4) operational definitions distinguishing retention-worthy sentences from those exhibiting representational harm. These additions will strengthen the evidential basis for the epistemic-erasure claims without altering the core quantitative findings. revision: yes

Circularity Check

0 steps flagged

Empirical audit with no derivations or self-referential reductions

full rationale

The paper is an empirical audit of pretraining filters and guardrails on Common Crawl sentences, reporting observed associations between lexical cues, demographic mentions, and flagging rates, plus human retention judgments on a 500-sentence subset. No equations, fitted parameters, uniqueness theorems, or ansatzes appear. Central claims rest on data associations and external human annotations rather than any step that reduces by construction to its own inputs or prior self-citations. This is a standard empirical study whose reasoning chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on treating human retention judgments as a valid external benchmark and on the sampled sentences being representative of Common Crawl content containing group mentions.

axioms (1)

domain assumption Human annotators' decisions on content retention provide a reliable ground truth for evaluating representational harms
Paper contrasts automated flags with human retention rates (88.5% and 91.3%) to argue systems over-flag.

pith-pipeline@v0.9.1-grok · 5752 in / 1182 out tokens · 38942 ms · 2026-06-28T02:04:03.138710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 16 canonical work pages · 6 internal anchors

[1]

Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell

A survey on data selection for language models.arXiv preprint arXiv:2402.16827. Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell

work page arXiv
[2]

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach

On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM confer- ence on fairness, accountability, and transparency, pages 610–623. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach

2021
[3]

org/abs/2012.07805

Extracting training data from large language models. Preprint, arXiv:2012.07805. Tommaso Caselli, Valerio Basile, Jelena Mitrovi´c, and Michael Granitzer

work page arXiv 2012
[4]

In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 17–25, Online

HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 17–25, Online. As- sociation for Computational Linguistics. Aida Davani, Mark Díaz, Dylan Baker, and Vinodkumar Prabhakaran

2021
[5]

InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 2007–2021

Disentangling perceptions of of- fensiveness: Cultural and moral correlates. InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 2007–2021. Aida Mostafazadeh Davani, Mohammad Atari, Bren- dan Kennedy, and Morteza Dehghani

2024
[6]

Preprint, arXiv:2110.14839

Hate speech classifiers learn human-like social stereotypes. Preprint, arXiv:2110.14839. Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber

work page arXiv
[7]

Automated Hate Speech Detection and the Problem of Offensive Language

Automated hate speech detec- tion and the problem of offensive language.Preprint, arXiv:1703.04009. Jesse Dodge, Maarten Sap, Ana Marasovi ´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner

work page internal anchor Pith review Pith/arXiv arXiv
[8]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic

Documenting large webtext corpora: A case study on the colos- sal clean crawled corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Rebecca Dorn, Lee Kezar, Fred Morstatter, and Kristina Lerman

2021
[9]

InProceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 335–342

Reclaim project: Exploring italian slurs reappropriation with large lan- guage models. InProceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 335–342. 9 Fredo Erxleben, Michael Günther, Markus Krötzsch, Ju- lian Mendez, and Denny Vrandeˇci´c

2024
[10]

InThe Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23,

Introduc- ing wikidata to the linked data web. InThe Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23,

2014
[11]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Red teaming language models to reduce harms: Meth- ods, scaling behaviors, and lessons learned.Preprint, arXiv:2209.07858. Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy

work page internal anchor Pith review Pith/arXiv arXiv
[12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The pile: An 800gb dataset of diverse text for language modeling. Preprint, arXiv:2101.00027. Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien

work page internal anchor Pith review Pith/arXiv arXiv
[13]

André Belchior Gomes and Aysel Sultan

Aegis: Online adaptive ai content safety moderation with ensemble of llm experts.Preprint, arXiv:2404.05993. André Belchior Gomes and Aysel Sultan

work page arXiv
[14]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations. Preprint, arXiv:2312.06674. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Eun Seo Jo and Timnit Gebru

Beavertails: To- wards improved safety alignment of llm via a human- preference dataset.Preprint, arXiv:2307.04657. Eun Seo Jo and Timnit Gebru

work page arXiv
[16]

InProceedings of the 2020 con- ference on fairness, accountability, and transparency, pages 306–316

Lessons from archives: Strategies for collecting sociocultural data in machine learning. InProceedings of the 2020 con- ference on fairness, accountability, and transparency, pages 306–316. Jared Katzman, Angelina Wang, Morgan Scheuerman, Su Lin Blodgett, Kristen Laird, Hanna Wallach, and Solon Barocas

2020
[17]

Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo V olpin, Frederic A

Generalization through memorization: Nearest neighbor language models.Preprint, arXiv:1911.00172. Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo V olpin, Frederic A. Dreyer, Aleksandar Sht- edritski, and Yuki M. Asano

work page arXiv 1911
[18]

Preprint, arXiv:2102.04130

Bias out-of-the- box: An empirical analysis of intersectional occupa- tional biases in popular generative language models. Preprint, arXiv:2102.04130. Hadas Kotek, Rikker Dockum, and David Sun

work page arXiv
[19]

InProceedings of The ACM Collective Intelligence Conference, CI ’23, page 12–24, New York, NY , USA

Gender bias and stereotypes in large language models. InProceedings of The ACM Collective Intelligence Conference, CI ’23, page 12–24, New York, NY , USA. Association for Computing Machinery. Tahu Kukutai and John Taylor. 2016.Indigenous data sovereignty: Toward an agenda, volume

2016
[20]

Exploring cross-cultural differences in english hate speech annotations: From dataset construction to analysis. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4205–

2024
[21]

InFind- ings of the Association for Computational Linguistics: ACL 2024, pages 3923–3954, Bangkok, Thailand

SALAD-bench: A hierarchical and comprehensive safety benchmark for large language models. InFind- ings of the Association for Computational Linguistics: ACL 2024, pages 3923–3954, Bangkok, Thailand. As- sociation for Computational Linguistics. Shayne Longpre, Gregory Yauney, Emily Reif, Kather- ine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Ke...

2024
[22]

A pretrainer’s guide to train- ing data: Measuring the effects of data age, domain coverage, quality, & toxicity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Pa- pers), pages 3245–3276, Mexico City, Mexico. Asso- ciation for Computational...

2024
[23]

United Nations

Towards safer pretraining: Analyzing and filtering harmful content in webscale datasets for responsible llms.Preprint, arXiv:2505.02009. United Nations. Statistical Office. 1982.Standard coun- try or area codes for statistical use

work page arXiv 1982
[24]

InProceedings of the 2025 ACM Conference on Fair- ness, Accountability, and Transparency, FAccT ’25, page 3094–3105, New York, NY , USA

The root shapes the fruit: On the persistence of gender-exclusive harms in aligned language models. InProceedings of the 2025 ACM Conference on Fair- ness, Accountability, and Transparency, FAccT ’25, page 3094–3105, New York, NY , USA. Association for Computing Machinery. Guilherme Penedo, Hynek Kydlí ˇcek, Loubna Ben al- lal, Anton Lozhkov, Margaret Mit...

2025
[25]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The fineweb datasets: Decanting the web for the finest text data at scale.Preprint, arXiv:2406.17557. Rida Qadri, Aida M. Davani, Kevin Robinson, and Vinodkumar Prabhakaran

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Organizers of QueerInAI, A Pranav, MaryLena Bleile, Arjun Subramonian, Luca Soldaini, Danica J

Risks of cul- tural erasure in large language models.Preprint, arXiv:2501.01056. Organizers of QueerInAI, A Pranav, MaryLena Bleile, Arjun Subramonian, Luca Soldaini, Danica J. Suther- land, Sabine Weber, and Pan Xu

work page arXiv
[27]

InPro- ceedings of the 2021 Workshop on Widening NLP, Punta Cana, Dominican Republic

How to make virtual conferences queer-friendly: A guide. InPro- ceedings of the 2021 Workshop on Widening NLP, Punta Cana, Dominican Republic. Conference on Empirical Methods in Natural Language Processing. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu

2021
[28]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith

Shaping capa- bilities with token-level data filtering.Preprint, arXiv:2601.21571. Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith

work page arXiv
[29]

InFind- ings of the Association for Computational Linguis- tics: EMNLP 2023, pages 12310–12324, Singapore

Ge- ographical erasure in language generation. InFind- ings of the Association for Computational Linguis- tics: EMNLP 2023, pages 12310–12324, Singapore. Association for Computational Linguistics. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng

2023
[30]

The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 3407– 3412, Hong Kong, China. Association for Computa- tional Linguistics. Luca Soldaini, Rodney Kinn...

2019
[31]

A Roadmap to Pluralistic Alignment

A roadmap to pluralistic alignment. Preprint, arXiv:2402.05070. Marco Antonio Stranisci and Christian Hardmeier

work page internal anchor Pith review Pith/arXiv arXiv
[32]

InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 2390–2397

Detoxi- fying language models risks marginalizing minority voices. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 2390–2397. 12 A Detailed results on study of epistemic erasure of marginalised identities Tables 4, 5, and 6 report per-system flag rate...

2021

[1] [1]

Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell

A survey on data selection for language models.arXiv preprint arXiv:2402.16827. Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell

work page arXiv

[2] [2]

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach

On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM confer- ence on fairness, accountability, and transparency, pages 610–623. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach

2021

[3] [3]

org/abs/2012.07805

Extracting training data from large language models. Preprint, arXiv:2012.07805. Tommaso Caselli, Valerio Basile, Jelena Mitrovi´c, and Michael Granitzer

work page arXiv 2012

[4] [4]

In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 17–25, Online

HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 17–25, Online. As- sociation for Computational Linguistics. Aida Davani, Mark Díaz, Dylan Baker, and Vinodkumar Prabhakaran

2021

[5] [5]

InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 2007–2021

Disentangling perceptions of of- fensiveness: Cultural and moral correlates. InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 2007–2021. Aida Mostafazadeh Davani, Mohammad Atari, Bren- dan Kennedy, and Morteza Dehghani

2024

[6] [6]

Preprint, arXiv:2110.14839

Hate speech classifiers learn human-like social stereotypes. Preprint, arXiv:2110.14839. Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber

work page arXiv

[7] [7]

Automated Hate Speech Detection and the Problem of Offensive Language

Automated hate speech detec- tion and the problem of offensive language.Preprint, arXiv:1703.04009. Jesse Dodge, Maarten Sap, Ana Marasovi ´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic

Documenting large webtext corpora: A case study on the colos- sal clean crawled corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Rebecca Dorn, Lee Kezar, Fred Morstatter, and Kristina Lerman

2021

[9] [9]

InProceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 335–342

Reclaim project: Exploring italian slurs reappropriation with large lan- guage models. InProceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 335–342. 9 Fredo Erxleben, Michael Günther, Markus Krötzsch, Ju- lian Mendez, and Denny Vrandeˇci´c

2024

[10] [10]

InThe Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23,

Introduc- ing wikidata to the linked data web. InThe Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23,

2014

[11] [11]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Red teaming language models to reduce harms: Meth- ods, scaling behaviors, and lessons learned.Preprint, arXiv:2209.07858. Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The pile: An 800gb dataset of diverse text for language modeling. Preprint, arXiv:2101.00027. Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

André Belchior Gomes and Aysel Sultan

Aegis: Online adaptive ai content safety moderation with ensemble of llm experts.Preprint, arXiv:2404.05993. André Belchior Gomes and Aysel Sultan

work page arXiv

[14] [14]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations. Preprint, arXiv:2312.06674. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Eun Seo Jo and Timnit Gebru

Beavertails: To- wards improved safety alignment of llm via a human- preference dataset.Preprint, arXiv:2307.04657. Eun Seo Jo and Timnit Gebru

work page arXiv

[16] [16]

InProceedings of the 2020 con- ference on fairness, accountability, and transparency, pages 306–316

Lessons from archives: Strategies for collecting sociocultural data in machine learning. InProceedings of the 2020 con- ference on fairness, accountability, and transparency, pages 306–316. Jared Katzman, Angelina Wang, Morgan Scheuerman, Su Lin Blodgett, Kristen Laird, Hanna Wallach, and Solon Barocas

2020

[17] [17]

Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo V olpin, Frederic A

Generalization through memorization: Nearest neighbor language models.Preprint, arXiv:1911.00172. Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo V olpin, Frederic A. Dreyer, Aleksandar Sht- edritski, and Yuki M. Asano

work page arXiv 1911

[18] [18]

Preprint, arXiv:2102.04130

Bias out-of-the- box: An empirical analysis of intersectional occupa- tional biases in popular generative language models. Preprint, arXiv:2102.04130. Hadas Kotek, Rikker Dockum, and David Sun

work page arXiv

[19] [19]

InProceedings of The ACM Collective Intelligence Conference, CI ’23, page 12–24, New York, NY , USA

Gender bias and stereotypes in large language models. InProceedings of The ACM Collective Intelligence Conference, CI ’23, page 12–24, New York, NY , USA. Association for Computing Machinery. Tahu Kukutai and John Taylor. 2016.Indigenous data sovereignty: Toward an agenda, volume

2016

[20] [20]

Exploring cross-cultural differences in english hate speech annotations: From dataset construction to analysis. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4205–

2024

[21] [21]

InFind- ings of the Association for Computational Linguistics: ACL 2024, pages 3923–3954, Bangkok, Thailand

SALAD-bench: A hierarchical and comprehensive safety benchmark for large language models. InFind- ings of the Association for Computational Linguistics: ACL 2024, pages 3923–3954, Bangkok, Thailand. As- sociation for Computational Linguistics. Shayne Longpre, Gregory Yauney, Emily Reif, Kather- ine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Ke...

2024

[22] [22]

A pretrainer’s guide to train- ing data: Measuring the effects of data age, domain coverage, quality, & toxicity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Pa- pers), pages 3245–3276, Mexico City, Mexico. Asso- ciation for Computational...

2024

[23] [23]

United Nations

Towards safer pretraining: Analyzing and filtering harmful content in webscale datasets for responsible llms.Preprint, arXiv:2505.02009. United Nations. Statistical Office. 1982.Standard coun- try or area codes for statistical use

work page arXiv 1982

[24] [24]

InProceedings of the 2025 ACM Conference on Fair- ness, Accountability, and Transparency, FAccT ’25, page 3094–3105, New York, NY , USA

The root shapes the fruit: On the persistence of gender-exclusive harms in aligned language models. InProceedings of the 2025 ACM Conference on Fair- ness, Accountability, and Transparency, FAccT ’25, page 3094–3105, New York, NY , USA. Association for Computing Machinery. Guilherme Penedo, Hynek Kydlí ˇcek, Loubna Ben al- lal, Anton Lozhkov, Margaret Mit...

2025

[25] [25]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The fineweb datasets: Decanting the web for the finest text data at scale.Preprint, arXiv:2406.17557. Rida Qadri, Aida M. Davani, Kevin Robinson, and Vinodkumar Prabhakaran

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Organizers of QueerInAI, A Pranav, MaryLena Bleile, Arjun Subramonian, Luca Soldaini, Danica J

Risks of cul- tural erasure in large language models.Preprint, arXiv:2501.01056. Organizers of QueerInAI, A Pranav, MaryLena Bleile, Arjun Subramonian, Luca Soldaini, Danica J. Suther- land, Sabine Weber, and Pan Xu

work page arXiv

[27] [27]

InPro- ceedings of the 2021 Workshop on Widening NLP, Punta Cana, Dominican Republic

How to make virtual conferences queer-friendly: A guide. InPro- ceedings of the 2021 Workshop on Widening NLP, Punta Cana, Dominican Republic. Conference on Empirical Methods in Natural Language Processing. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu

2021

[28] [28]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith

Shaping capa- bilities with token-level data filtering.Preprint, arXiv:2601.21571. Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith

work page arXiv

[29] [29]

InFind- ings of the Association for Computational Linguis- tics: EMNLP 2023, pages 12310–12324, Singapore

Ge- ographical erasure in language generation. InFind- ings of the Association for Computational Linguis- tics: EMNLP 2023, pages 12310–12324, Singapore. Association for Computational Linguistics. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng

2023

[30] [30]

The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 3407– 3412, Hong Kong, China. Association for Computa- tional Linguistics. Luca Soldaini, Rodney Kinn...

2019

[31] [31]

A Roadmap to Pluralistic Alignment

A roadmap to pluralistic alignment. Preprint, arXiv:2402.05070. Marco Antonio Stranisci and Christian Hardmeier

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 2390–2397

Detoxi- fying language models risks marginalizing minority voices. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 2390–2397. 12 A Detailed results on study of epistemic erasure of marginalised identities Tables 4, 5, and 6 report per-system flag rate...

2021