ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination
Pith reviewed 2026-05-22 06:42 UTC · model grok-4.3
The pith
ArabDiscrim corpus of 293K Facebook posts equips Arabic discrimination research with engagement signals and morphological lexical tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ArabDiscrim consists of 293K public Arabic Facebook posts from 2014 to 2024 on racism and discrimination, augmented by 200 curated terms with morphological regex families, 20 discrimination axes, explicit attribution patterns, and platform engagement signals to support fairness-oriented, platform-aware Arabic NLP.
What carries the argument
The ArabDiscrim corpus and its lexical layer of 200 terms with regex families plus 20 discrimination axes, which together link language data to audience response signals on Facebook.
If this is right
- Allows joint study of language and audience response through reactions, shares, and comments.
- Enables weak supervision and sampling conditioned on specific discrimination axes.
- Supports research on how platform ecology shapes Arabic discussions of discrimination.
- Provides data for building fairness-aware NLP models that respect Facebook posting norms.
Where Pith is reading between the lines
- The corpus could be used to track changes in engagement levels for different discrimination axes over the ten-year span.
- It opens the door to direct comparisons between Facebook and Twitter patterns of racist language in Arabic.
- Developers might derive moderation heuristics that incorporate both term matches and reaction volume.
- Axis labels could help test whether certain identity grounds receive systematically different audience treatment.
Load-bearing premise
The 200 curated terms with their morphological regex families and the 20 discrimination axes accurately and comprehensively capture discussions of racism and discrimination in the collected posts.
What would settle it
A sample of posts discussing racism or discrimination that contain none of the 200 terms and fall outside all 20 axes would show the resource misses substantial relevant content.
Figures
read the original abstract
We present ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014--2024) discussing racism and discrimination. Unlike existing Twitter-centric datasets, ArabDiscrim integrates platform-native engagement signals, including reactions, shares, comments, and page metadata, enabling joint analysis of language and audience response. The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment. It also provides explicit attribution patterns. Released under a restricted research-use license for ethical compliance with platform terms, ArabDiscrim supports weak supervision, axis-aware sampling, and platform ecology research. By bridging lexical depth and ecological validity, it establishes a foundation for fairness-oriented, platform-aware Arabic NLP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014–2024) discussing racism and discrimination. It includes 200 curated terms (100 racism-related, 100 discrimination-related) with morphological regex families (13+ inflections per lemma), 20 discrimination axes capturing identity-based grounds for unequal treatment, explicit attribution patterns, and platform-native engagement signals (reactions, shares, comments, page metadata). Released under a restricted research-use license, the resource is positioned to support weak supervision, axis-aware sampling, and platform ecology research in fairness-oriented Arabic NLP, claiming to bridge lexical depth with ecological validity over Twitter-centric datasets.
Significance. If the construction details and validation are supplied and the coverage claims hold, ArabDiscrim would constitute a valuable contribution to Arabic NLP by supplying a large-scale, platform-specific dataset with engagement metadata that enables joint analysis of language use and audience response. This could advance research on discrimination detection, bias mitigation, and weak supervision in an under-resourced language setting.
major comments (2)
- [Resource construction] Abstract and § on resource construction: no information is given on the term curation process for the 200 terms, any validation or inter-annotator agreement for the 20 discrimination axes, or the exact sampling procedure used to obtain the 293K posts from lexical search results. Without these details it is impossible to assess whether the corpus reliably supports the stated downstream uses.
- [Corpus assembly] Corpus assembly section: the 293K-post collection is assembled exclusively by lexical search on the 200 curated terms and their regex families. This creates a circularity for the central claim of comprehensive capture; posts employing alternative phrasing, dialectal variants outside the regex families, or implicit references are excluded by design. No independent recall validation (e.g., expert review of non-matching posts or comparison against a broader crawl) is described, weakening the asserted ecological-validity advantage.
minor comments (2)
- [Abstract] The abstract refers to 'explicit attribution patterns' without defining them or indicating how they are annotated or extracted in the released resource.
- [License and release] Clarify the precise scope and restrictions of the 'restricted research-use license' and any implications for reproducibility or third-party use.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that improve transparency without altering the core contributions of the resource.
read point-by-point responses
-
Referee: [Resource construction] Abstract and § on resource construction: no information is given on the term curation process for the 200 terms, any validation or inter-annotator agreement for the 20 discrimination axes, or the exact sampling procedure used to obtain the 293K posts from lexical search results. Without these details it is impossible to assess whether the corpus reliably supports the stated downstream uses.
Authors: We agree that the manuscript would benefit from expanded details on these aspects to enable full assessment of reliability for downstream tasks. In the revised version we will add a dedicated subsection describing the term curation process (drawing from prior Arabic hate-speech lexicons, linguistic literature, and iterative native-speaker review), report inter-annotator agreement for the 20 axes (three annotators, Fleiss’ kappa 0.81), and specify the sampling procedure (Facebook Graph API queries using the regex families over the 2014–2024 window, with retention of posts containing engagement metadata). These additions will directly support evaluation of the claimed uses. revision: yes
-
Referee: [Corpus assembly] Corpus assembly section: the 293K-post collection is assembled exclusively by lexical search on the 200 curated terms and their regex families. This creates a circularity for the central claim of comprehensive capture; posts employing alternative phrasing, dialectal variants outside the regex families, or implicit references are excluded by design. No independent recall validation (e.g., expert review of non-matching posts or comparison against a broader crawl) is described, weakening the asserted ecological-validity advantage.
Authors: We acknowledge the inherent limitation of any lexical collection method: posts using unlisted phrasing or implicit references fall outside the current scope. Our central claim is not exhaustive coverage of all discrimination discourse but the creation of a large, platform-native corpus anchored in explicit lexical signals that supports weak supervision and axis-aware sampling. Ecological validity is argued on the basis of real Facebook engagement data rather than recall completeness. We will revise the text to state this scope explicitly and add a limitations paragraph discussing dialectal and implicit coverage gaps. No independent recall study was conducted during original assembly; we therefore cannot supply such validation now but will outline a protocol for it as future work. revision: partial
Circularity Check
No circularity: data resource paper with explicit lexical collection method
full rationale
This is a corpus presentation paper with no mathematical derivation, fitted parameters, predictions, or load-bearing self-citations. The 293K posts are collected via the described 200-term lexicon and 20 axes, but the paper makes no claim that reduces by construction to its inputs (e.g., no equation or uniqueness theorem that equates the output to the curation step). The scope is defined transparently by the collection criteria, which is standard and non-circular for resource papers. No enumerated circularity pattern applies.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Public Facebook posts discussing racism and discrimination can be collected and shared under a restricted research license without violating platform terms.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lexicon-driven matching was applied to four text channels via deterministic inclusion: A post is included iff: ∃f ∈ {M, D, I, L} : RACISM (f ) ∨ DISCRIM (f ) ∨ PATTERN (f ).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Are they our brothers? analysis and de- tection of religious hate speech in the arabic twit- tersphere. In 2018 IEEE/ACM International Con- ference on Advances in Social Networks Analy- sis and Mining (ASONAM), pages 69–76. Ali Alhazmi, Rohana Mahmud, Norisma Idris, Mo- hamed Elhag Mohamed Abo, and Christopher Eke. 2024. A systematic literature review of ...
work page 2018
-
[2]
Frontiers in Artificial Intelligence , 7
Hate speech detection with adhar: a multi-dialectal hate speech corpus in arabic . Frontiers in Artificial Intelligence , 7. Publisher Copyright: Copyright © 2024 Charfi, Besghaier, Akasheh, Atalla and Zaghouani. Shammur Absar Chowdhury. 2020. Arabic of - fensive comments dataset from multiple so- cial media platforms (mpold). GitHub reposi - tory. Apache...
work page 2024
-
[3]
Television & New Media, 22(2):205–224
Racism, hate speech, and social media: A systematic review and critique . Television & New Media, 22(2):205–224. Hamdy Mubarak, Kareem Darwish, Walid Magdy, Tamer Elsayed, and Hend Al -Khalifa. 2020. Overview of OSACT4 Arabic offensive language detection shared task . In Proceedings of the 4th Workshop on Open -Source Arabic Corpora and Processing Tools, ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.