Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Abdellah El Mekki; AbdelRahim A. Elmadany; Abdulaziz Hafiz; Abdurrahman Gerrio; Ahlam Bashiti; Alaa Aoun; Alshima Alkhazimi; Al-Yas Al-Ghafri; Amir Azad Adli Alkathiri; Anas Belfathi

arxiv: 2601.13099 · v2 · submitted 2026-01-19 · 💻 cs.CL

Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Abdellah El Mekki , Samar M. Magdy , Houdaifa Atou , Ruwa AbuHweidi , Baraah Qawasmeh , Omer Nacar , Thikra Al-hibiri , Razan Saadie

show 39 more authors

Hamzah Alsayadi Nadia Ghezaiel Hammouda Alshima Alkhazimi Aya Hamod Al-Yas Al-Ghafri Wesam El-Sayed Asila Al sharji Mohamad Ballout Anas Belfathi Karim Ghaddar Serry Sibaee Alaa Aoun Areej Asiri Lina Abureesh Ahlam Bashiti Majdal Yousef Abdulaziz Hafiz Yehdih Mohamed Emira Hamedtou Brakehe Brahim Rahaf Alhamouri Youssef Nafea Aya El Aatar Walid Al-Dhabyani Emhemed Hamed Sara Shatnawi Fakhraddin Alwajih Khalid Elkhidir Ashwag Alasmari Abdurrahman Gerrio Omar Alshahri AbdelRahim A. Elmadany Ismail Berrada Amir Azad Adli Alkathiri Fadi A Zaraket Mustafa Jarrar Yahya Mohamed El Hadj Hassan Alhuzali Muhammad Abdul-Mageed

This is my paper

Pith reviewed 2026-05-16 12:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords dialectal Arabicmachine translationdatasetLLM evaluationArabic dialectscommunity datagender annotationmulti-domain

0 comments

The pith

Alexandria supplies 107K turns of English-dialectal Arabic conversations from 13 countries and 11 domains to train and benchmark machine translation for real-world Arabic use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Alexandria as a community-collected parallel dataset that pairs English with dialectal Arabic across multiple turns. Contributions carry city-of-origin labels and gender-of-speaker metadata, moving past broad regional categories to finer local varieties. The resource is intended both to supply training data for MT systems and LLMs and to serve as an evaluation benchmark that reveals where current models still struggle with authentic dialect input. Because daily Arabic communication is overwhelmingly dialectal rather than Modern Standard Arabic, the dataset directly targets a practical gap that limits model usefulness for millions of speakers.

Core claim

Alexandria consists of 107K English-dialectal Arabic turns drawn from multi-turn conversational scenarios in 11 high-impact domains, each turn annotated with city-of-origin metadata and speaker-addressee gender configurations, providing both training material and a fine-grained testbed for assessing how well MT systems and LLMs handle sub-dialectal variation across 13 Arab countries.

What carries the argument

City-of-origin metadata attached to each translation, which distinguishes local sub-dialects beyond coarse country or regional tags while preserving conversational structure and gender-conditioned variation.

If this is right

MT and LLM systems trained on Alexandria can be evaluated for performance on health, education, and agriculture dialogues in specific city varieties.
Gender annotations allow measurement of how well models preserve or distort gender-conditioned dialect features.
The same data can be used to adapt existing Arabic-aware models toward better handling of sub-dialectal input.
Release of prompts and guidelines enables replication for other diglossic languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

City-level granularity could support downstream applications such as localized voice assistants or region-specific medical translation tools.
If the dataset's conversational format proves effective, similar multi-turn collections may become standard for evaluating dialogue systems in other low-resource language varieties.
Persistent benchmark failures highlighted by Alexandria may accelerate development of dialect-specific tokenization or adaptation techniques.

Load-bearing premise

Human translators recruited from each city produce translations that faithfully reflect the local spoken dialect without introducing noticeable standardization or selection bias.

What would settle it

An LLM that achieves high automatic and human scores on Alexandria's sub-dialect test splits without any training or fine-tuning on the dataset itself would indicate that the claimed persistent challenges do not require this resource.

read the original abstract

Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges. The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available in the following repository: https://github.com/UBC-NLP/Alexandria

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Alexandria is a practical new dataset for dialectal Arabic MT with city-level detail and broad coverage, but the evaluation claims rest on thin validation details.

read the letter

Alexandria is a new multi-domain dataset for dialectal Arabic machine translation that stands out for its city-level metadata across 13 countries and 11 domains, plus multi-turn conversations annotated for gender. That's the key point to take away if you're working in this area. The authors have done solid work assembling 107K turns through community contributions and making the whole thing public along with the creation prompts, translation guidelines, and evaluation code. The shift to finer-grained city origins instead of broad regions is a clear improvement over previous resources, and including gender-conditioned annotations opens up interesting angles for studying variation. Where it falls short is in the strength of the supporting evidence. The abstract talks about automatic and human evaluations that benchmark current LLMs and highlight ongoing challenges, but without quantitative results or details on how they controlled for translation quality and dialect fidelity, it's hard to judge how rigorous the benchmark really is. The stress on community-driven data is good for scale, but without reported checks like inter-annotator agreement on dialect markers or protocols to minimize MSA interference, there's room for artifacts that could affect what the evaluations actually measure. Overall, this is the kind of resource that researchers focused on inclusive NLP for Arabic speakers will want to look at and potentially build on. It fills a gap in high-impact domains like health and agriculture. I would recommend sending it for peer review. The dataset contribution is substantial enough to warrant referee input on tightening the evaluation section and addressing potential biases in the collection process.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Alexandria, a community-driven dataset of 107K parallel English-Dialectal Arabic multi-turn conversational turns spanning 13 Arab countries and 11 domains (health, education, agriculture, etc.). It supplies city-of-origin metadata for sub-dialect granularity and speaker-addressee gender annotations, positioning the resource as both a training corpus and a benchmark for MT systems and Arabic-aware LLMs, with automatic and human evaluations that highlight persistent translation challenges.

Significance. If the data quality and authenticity claims hold, the dataset would be a valuable addition to dialectal Arabic resources, enabling more culturally inclusive LLMs and MT for real-world diglossic use cases. The public release of prompts, guidelines, and evaluation code, combined with the scale and metadata granularity, strengthens its potential impact over prior coarse-grained dialect corpora.

major comments (2)

[Abstract] Abstract: The claim that city-of-origin metadata yields 'authentic local varieties beyond coarse regional labels' and enables a 'rigorous benchmark' is load-bearing for the central contribution, yet the construction description supplies no verification of translator nativeness, inter-annotator agreement on sub-dialect markers, or explicit revision protocols to limit MSA interference; without these, automatic/human evaluations risk conflating model limitations with possible data artifacts or leveling.
[Evaluation] Evaluation (referenced in abstract): The statement that evaluations 'expose significant persistent challenges' is central to the benchmark utility, but the abstract provides no quantitative metrics (e.g., BLEU, COMET, or human adequacy scores), baselines, or error analysis; this absence prevents verification of the challenges' severity and scope.

minor comments (2)

[Abstract] The repository URL is provided, but the manuscript should explicitly confirm that all promised assets (prompts, guidelines, code) are included and versioned for reproducibility.
A table or figure breaking down the 107K turns by country/domain would clarify coverage balance and support the multi-domain claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that city-of-origin metadata yields 'authentic local varieties beyond coarse regional labels' and enables a 'rigorous benchmark' is load-bearing for the central contribution, yet the construction description supplies no verification of translator nativeness, inter-annotator agreement on sub-dialect markers, or explicit revision protocols to limit MSA interference; without these, automatic/human evaluations risk conflating model limitations with possible data artifacts or leveling.

Authors: We acknowledge the referee's concern that the abstract and construction description do not sufficiently detail verification steps. The full manuscript (Section 3) describes recruitment of native speakers from the target cities through community networks, along with multi-stage review processes to reduce MSA influence. To directly address this, we will add an explicit quality-control subsection that reports translator nativeness verification methods, the revision protocols applied, and any inter-annotator agreement figures obtained for sub-dialect markers. This addition will strengthen the authenticity claims without altering the dataset itself. revision: yes
Referee: [Evaluation] Evaluation (referenced in abstract): The statement that evaluations 'expose significant persistent challenges' is central to the benchmark utility, but the abstract provides no quantitative metrics (e.g., BLEU, COMET, or human adequacy scores), baselines, or error analysis; this absence prevents verification of the challenges' severity and scope.

Authors: We agree that the abstract should contain concrete quantitative support for the claim of persistent challenges. The full paper (Section 5) reports automatic metrics (BLEU, COMET), human adequacy/fluency scores, baseline comparisons, and error analysis. We will revise the abstract to include key numerical results (e.g., average COMET scores by dialect and domain) and a brief reference to the observed error patterns, thereby making the benchmark contribution verifiable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

Dataset release paper exhibits no circularity in claimed derivations

full rationale

The paper introduces the Alexandria dataset as a community-driven resource for dialectal Arabic MT, with claims centered on its scale, metadata granularity, and utility as a benchmark. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. All load-bearing elements (dataset construction, annotations, public release) are external to any self-referential logic and do not reduce to inputs by definition. Self-citations, if present, are not invoked to justify uniqueness theorems or ansatzes. This is a standard data contribution paper whose central claim rests on the dataset itself rather than any internal derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data resource paper whose central contribution is the curated dataset rather than a derived theoretical or mathematical claim. No free parameters, axioms, or invented entities are required.

pith-pipeline@v0.9.0 · 5809 in / 1010 out tokens · 32483 ms · 2026-05-16T12:58:12.450218+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs) in translating across diverse Arabic dialects

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.