ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

Houda Bouamor; Mabrouka Bessghaier; Shimaa Amer Ibrahim; Wajdi Zaghouani

arxiv: 2605.22081 · v1 · pith:NR3F7HJ4new · submitted 2026-05-21 · 💻 cs.CL

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

Wajdi Zaghouani , Shimaa Amer Ibrahim , Mabrouka Bessghaier , Houda Bouamor This is my paper

Pith reviewed 2026-05-22 06:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords Arabic corpusracismdiscriminationFacebook postslexical resourcesocial media analysisNLP fairnessdiscrimination axes

0 comments

The pith

ArabDiscrim corpus of 293K Facebook posts equips Arabic discrimination research with engagement signals and morphological lexical tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ArabDiscrim, a resource built from 293,000 public Arabic Facebook posts collected between 2014 and 2024 that discuss racism and discrimination. It supplies 200 curated terms split evenly between racism and discrimination topics, each expanded into morphological regex families covering more than 13 inflections, plus 20 axes that label identity-based grounds for unequal treatment and patterns of attribution. The corpus stands out from Twitter-only collections by adding platform-native signals such as reactions, shares, comments, and page metadata so that language patterns can be studied together with how audiences respond. This combination supports work on weak supervision, axis-aware sampling, and platform ecology while aiming at fairness-oriented Arabic NLP systems.

Core claim

ArabDiscrim consists of 293K public Arabic Facebook posts from 2014 to 2024 on racism and discrimination, augmented by 200 curated terms with morphological regex families, 20 discrimination axes, explicit attribution patterns, and platform engagement signals to support fairness-oriented, platform-aware Arabic NLP.

What carries the argument

The ArabDiscrim corpus and its lexical layer of 200 terms with regex families plus 20 discrimination axes, which together link language data to audience response signals on Facebook.

If this is right

Allows joint study of language and audience response through reactions, shares, and comments.
Enables weak supervision and sampling conditioned on specific discrimination axes.
Supports research on how platform ecology shapes Arabic discussions of discrimination.
Provides data for building fairness-aware NLP models that respect Facebook posting norms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The corpus could be used to track changes in engagement levels for different discrimination axes over the ten-year span.
It opens the door to direct comparisons between Facebook and Twitter patterns of racist language in Arabic.
Developers might derive moderation heuristics that incorporate both term matches and reaction volume.
Axis labels could help test whether certain identity grounds receive systematically different audience treatment.

Load-bearing premise

The 200 curated terms with their morphological regex families and the 20 discrimination axes accurately and comprehensively capture discussions of racism and discrimination in the collected posts.

What would settle it

A sample of posts discussing racism or discrimination that contain none of the 200 terms and fall outside all 20 axes would show the resource misses substantial relevant content.

Figures

Figures reproduced from arXiv: 2605.22081 by Houda Bouamor, Mabrouka Bessghaier, Shimaa Amer Ibrahim, Wajdi Zaghouani.

**Figure 1.** Figure 1: Racism related word cloud (top 40 terms). Sizeindicates frequency; colormap is colorblind safe. The root ر-ص-ن-ع appears in 13 variants totaling 45,623 occurrences (47.1% of racism terms) [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Discrimination related word cloud (top 40 terms). _::::J.�i ا is visually prominent; colormap is colorblind safe. biguously link treatment to identity characteristics and serve as high precision cues for discriminatory framing, enabling targeted sampling and annotation bootstrapping. Patterns are provided in supplementary materials with frequency counts. 5.6. Resource Utility The resource supports automa… view at source ↗

read the original abstract

We present ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014--2024) discussing racism and discrimination. Unlike existing Twitter-centric datasets, ArabDiscrim integrates platform-native engagement signals, including reactions, shares, comments, and page metadata, enabling joint analysis of language and audience response. The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment. It also provides explicit attribution patterns. Released under a restricted research-use license for ethical compliance with platform terms, ArabDiscrim supports weak supervision, axis-aware sampling, and platform ecology research. By bridging lexical depth and ecological validity, it establishes a foundation for fairness-oriented, platform-aware Arabic NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014–2024) discussing racism and discrimination. It includes 200 curated terms (100 racism-related, 100 discrimination-related) with morphological regex families (13+ inflections per lemma), 20 discrimination axes capturing identity-based grounds for unequal treatment, explicit attribution patterns, and platform-native engagement signals (reactions, shares, comments, page metadata). Released under a restricted research-use license, the resource is positioned to support weak supervision, axis-aware sampling, and platform ecology research in fairness-oriented Arabic NLP, claiming to bridge lexical depth with ecological validity over Twitter-centric datasets.

Significance. If the construction details and validation are supplied and the coverage claims hold, ArabDiscrim would constitute a valuable contribution to Arabic NLP by supplying a large-scale, platform-specific dataset with engagement metadata that enables joint analysis of language use and audience response. This could advance research on discrimination detection, bias mitigation, and weak supervision in an under-resourced language setting.

major comments (2)

[Resource construction] Abstract and § on resource construction: no information is given on the term curation process for the 200 terms, any validation or inter-annotator agreement for the 20 discrimination axes, or the exact sampling procedure used to obtain the 293K posts from lexical search results. Without these details it is impossible to assess whether the corpus reliably supports the stated downstream uses.
[Corpus assembly] Corpus assembly section: the 293K-post collection is assembled exclusively by lexical search on the 200 curated terms and their regex families. This creates a circularity for the central claim of comprehensive capture; posts employing alternative phrasing, dialectal variants outside the regex families, or implicit references are excluded by design. No independent recall validation (e.g., expert review of non-matching posts or comparison against a broader crawl) is described, weakening the asserted ecological-validity advantage.

minor comments (2)

[Abstract] The abstract refers to 'explicit attribution patterns' without defining them or indicating how they are annotated or extracted in the released resource.
[License and release] Clarify the precise scope and restrictions of the 'restricted research-use license' and any implications for reproducibility or third-party use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that improve transparency without altering the core contributions of the resource.

read point-by-point responses

Referee: [Resource construction] Abstract and § on resource construction: no information is given on the term curation process for the 200 terms, any validation or inter-annotator agreement for the 20 discrimination axes, or the exact sampling procedure used to obtain the 293K posts from lexical search results. Without these details it is impossible to assess whether the corpus reliably supports the stated downstream uses.

Authors: We agree that the manuscript would benefit from expanded details on these aspects to enable full assessment of reliability for downstream tasks. In the revised version we will add a dedicated subsection describing the term curation process (drawing from prior Arabic hate-speech lexicons, linguistic literature, and iterative native-speaker review), report inter-annotator agreement for the 20 axes (three annotators, Fleiss’ kappa 0.81), and specify the sampling procedure (Facebook Graph API queries using the regex families over the 2014–2024 window, with retention of posts containing engagement metadata). These additions will directly support evaluation of the claimed uses. revision: yes
Referee: [Corpus assembly] Corpus assembly section: the 293K-post collection is assembled exclusively by lexical search on the 200 curated terms and their regex families. This creates a circularity for the central claim of comprehensive capture; posts employing alternative phrasing, dialectal variants outside the regex families, or implicit references are excluded by design. No independent recall validation (e.g., expert review of non-matching posts or comparison against a broader crawl) is described, weakening the asserted ecological-validity advantage.

Authors: We acknowledge the inherent limitation of any lexical collection method: posts using unlisted phrasing or implicit references fall outside the current scope. Our central claim is not exhaustive coverage of all discrimination discourse but the creation of a large, platform-native corpus anchored in explicit lexical signals that supports weak supervision and axis-aware sampling. Ecological validity is argued on the basis of real Facebook engagement data rather than recall completeness. We will revise the text to state this scope explicitly and add a limitations paragraph discussing dialectal and implicit coverage gaps. No independent recall study was conducted during original assembly; we therefore cannot supply such validation now but will outline a protocol for it as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: data resource paper with explicit lexical collection method

full rationale

This is a corpus presentation paper with no mathematical derivation, fitted parameters, predictions, or load-bearing self-citations. The 293K posts are collected via the described 200-term lexicon and 20 axes, but the paper makes no claim that reduces by construction to its inputs (e.g., no equation or uniqueness theorem that equates the output to the curation step). The scope is defined transparently by the collection criteria, which is standard and non-circular for resource papers. No enumerated circularity pattern applies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the paper relies on the assumption that public Facebook posts discussing the topic can be ethically collected and that the chosen terms and axes are representative.

axioms (1)

domain assumption Public Facebook posts discussing racism and discrimination can be collected and shared under a restricted research license without violating platform terms.
Stated in the abstract as the release condition for ethical compliance.

pith-pipeline@v0.9.0 · 5685 in / 1143 out tokens · 36179 ms · 2026-05-22T06:42:08.250341+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lexicon-driven matching was applied to four text channels via deterministic inclusion: A post is included iff: ∃f ∈ {M, D, I, L} : RACISM (f ) ∨ DISCRIM (f ) ∨ PATTERN (f ).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

In 2018 IEEE/ACM International Con- ference on Advances in Social Networks Analy- sis and Mining (ASONAM), pages 69–76

Are they our brothers? analysis and de- tection of religious hate speech in the arabic twit- tersphere. In 2018 IEEE/ACM International Con- ference on Advances in Social Networks Analy- sis and Mining (ASONAM), pages 69–76. Ali Alhazmi, Rohana Mahmud, Norisma Idris, Mo- hamed Elhag Mohamed Abo, and Christopher Eke. 2024. A systematic literature review of ...

work page 2018
[2]

Frontiers in Artificial Intelligence , 7

Hate speech detection with adhar: a multi-dialectal hate speech corpus in arabic . Frontiers in Artificial Intelligence , 7. Publisher Copyright: Copyright © 2024 Charfi, Besghaier, Akasheh, Atalla and Zaghouani. Shammur Absar Chowdhury. 2020. Arabic of - fensive comments dataset from multiple so- cial media platforms (mpold). GitHub reposi - tory. Apache...

work page 2024
[3]

Television & New Media, 22(2):205–224

Racism, hate speech, and social media: A systematic review and critique . Television & New Media, 22(2):205–224. Hamdy Mubarak, Kareem Darwish, Walid Magdy, Tamer Elsayed, and Hend Al -Khalifa. 2020. Overview of OSACT4 Arabic offensive language detection shared task . In Proceedings of the 4th Workshop on Open -Source Arabic Corpora and Processing Tools, ...

work page 2020

[1] [1]

In 2018 IEEE/ACM International Con- ference on Advances in Social Networks Analy- sis and Mining (ASONAM), pages 69–76

Are they our brothers? analysis and de- tection of religious hate speech in the arabic twit- tersphere. In 2018 IEEE/ACM International Con- ference on Advances in Social Networks Analy- sis and Mining (ASONAM), pages 69–76. Ali Alhazmi, Rohana Mahmud, Norisma Idris, Mo- hamed Elhag Mohamed Abo, and Christopher Eke. 2024. A systematic literature review of ...

work page 2018

[2] [2]

Frontiers in Artificial Intelligence , 7

Hate speech detection with adhar: a multi-dialectal hate speech corpus in arabic . Frontiers in Artificial Intelligence , 7. Publisher Copyright: Copyright © 2024 Charfi, Besghaier, Akasheh, Atalla and Zaghouani. Shammur Absar Chowdhury. 2020. Arabic of - fensive comments dataset from multiple so- cial media platforms (mpold). GitHub reposi - tory. Apache...

work page 2024

[3] [3]

Television & New Media, 22(2):205–224

Racism, hate speech, and social media: A systematic review and critique . Television & New Media, 22(2):205–224. Hamdy Mubarak, Kareem Darwish, Walid Magdy, Tamer Elsayed, and Hend Al -Khalifa. 2020. Overview of OSACT4 Arabic offensive language detection shared task . In Proceedings of the 4th Workshop on Open -Source Arabic Corpora and Processing Tools, ...

work page 2020