SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Khalid Yusuf Dahir

arxiv: 2605.18232 · v1 · pith:GUVSMPMUnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.IR

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Khalid Yusuf Dahir This is my paper

Pith reviewed 2026-05-20 10:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords Somali web corpusdata filtering pipelineBPE tokenizerlanguage identificationlow-resource languagespretraining datacorpus quality

0 comments

The pith

SomaliWeb v1 provides the first dedicated quality-filtered Somali corpus along with a matched tokenizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Somali has around 25 million speakers but until now lacked a dedicated public pretraining corpus. The paper constructs SomaliWeb v1 by applying a six-stage pipeline to sources like HPLT v2, CC100, and Somali Wikipedia, resulting in 819,322 documents and roughly 303 million tokens. It also releases a BPE-16K tokenizer and the first public benchmark for Somali language identification. This matters because it allows better language model training for Somali and exposes quality problems in current multilingual datasets such as high rates of duplicates and encoding errors.

Core claim

The central discovery is the creation of SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents totaling about 303 million tokens, built through a reproducible six-stage pipeline from three sources. The authors also provide a matched BPE-16K tokenizer that produces 40.2% fewer tokens than GPT-4's cl100k_base on Somali text from FLORES-200, and establish the first public benchmark comparing three production language identifiers on Somali.

What carries the argument

The six-stage filtering pipeline that removes low-quality and non-Somali content from web sources while preserving useful Somali text.

If this is right

Existing multilingual corpora contain significant quality defects including duplicates and mojibake.
The matched BPE-16K tokenizer emits 40.2% fewer tokens on Somali text than GPT-4's tokenizer.
The public language-identification benchmark enables direct evaluation of production systems on Somali.
Future language models can be pretrained on this dedicated corpus for improved Somali performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar quality-filtering pipelines could be replicated for other low-resource languages lacking dedicated corpora.
More efficient tokenization for Somali may improve the performance of multilingual models on Somali-related tasks.
The released benchmark may spur development of more accurate language identifiers for Somali and related languages.

Load-bearing premise

The six-stage filtering pipeline removes low-quality or non-Somali text while retaining representative and useful Somali content without introducing systematic bias or excessive data loss.

What would settle it

A manual audit of a sample of the SomaliWeb v1 corpus revealing a large fraction of non-Somali or poor-quality documents would disprove the effectiveness of the filtering pipeline.

Figures

Figures reproduced from arXiv: 2605.18232 by Khalid Yusuf Dahir.

**Figure 1.** Figure 1: SomaliWeb v1 — the six-stage corpus construction pipeline with per-phase retention. See §5 for equations and §7 for retention tables. • C3 (Measurement). Three concrete, quantified quality defects in HPLT v2’s “cleaned” Somali distribution (17.3% byte-duplicates, 56.1% mojibake-bearing documents, 10.7% near-duplicates) with per-phase retention and per-source breakdowns. • C4 (Tool). A BPE-16K tokenizer tra… view at source ↗

**Figure 2.** Figure 2: LSH S-curves (7) for three (b, r) configurations. Our choice of (16, 4) is balanced around s ∗ ≈ 0.50 with near-certain capture at τ = 0.80. Coverage. cov(d) = |G(5)(d) ∩ G (5) seed| |G(5)(d)| (8) We drop the bottom 15% by coverage; empirically this corresponds to threshold cov ≥ 0.9029. Retention. 819,322 / 963,908 = 85.00%. Per-source drop rate reveals source-quality asymmetry: Wikipedia 20.61%, HPLT 18.… view at source ↗

**Figure 3.** Figure 3: Per-source retention across the six pipeline phases. Software. Python 3.10. Pinned versions: numpy<2, ftfy==6.1.3, langdetect==1.0.9, tokenizers==0.15.2, datasets==2.19.0, zstandard==0.22.0, tiktoken==0.7.0. Full requirements.txt in Appendix B. Determinism. All seeds fixed at 0: random.seed(0), np.random.seed(0), DetectorFactory.seed = 0, tokenizer trainer shuffle_seed=0. With pinned versions the full pipe… view at source ↗

**Figure 4.** Figure 4: Somali LID confusion matrices on the 200-row test set (40 per class). langdetect dominates on Somali recall [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Tokenizer fertility distribution on FLORES-200 Somali devtest (1,012 sentences). SomaliWeb v1 ties HPLT-raw at 30% smaller training corpus, and emits 40.2% fewer tokens than GPT-4’s cl100k_base. Source composition. Qualitative audit. From a random sample of 20 release documents, all 20 were judged by a native Somali speaker as recognizable, well-formed Somali text suitable for pretraining. Full rubric-base… view at source ↗

**Figure 6.** Figure 6: Char-5-gram coverage distribution (Phase 5) with τ = 0.9029 drop threshold. 5. Standard Somali only. No Maay Maay coverage; pipeline would need dialect-aware LID adjustments. 6. No Somali-aware PII scrub. Empirical scan: ∼7.9% of release documents contain at least one email-shaped string. Presidio does not cover Somali. Consumer-facing downstream uses must apply additional PII filtering. 7. Source coverage… view at source ↗

read the original abstract

Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3% byte-exact duplicates, 56.1% of its documents contain fixable mojibake, and 10.7% of its byte-unique documents are near-duplicates at Jaccard tau=0.80. Our BPE-16K tokenizer emits 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest as a tokenizer-level measurement; downstream language-model perplexity comparisons are deferred to a follow-up release.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases the first dedicated Somali corpus and tokenizer with direct measurements of quality defects in existing multilingual sources.

read the letter

The main thing to take away is that this paper releases a dedicated Somali corpus of 819k documents and 303M tokens, built with a reproducible filter from public sources, along with a 16K BPE tokenizer that uses 40% fewer tokens on Somali than the GPT-4 one, and the first side-by-side language ID benchmark for the language. What stands out is the concrete defect measurements on HPLT v2: 17.3% byte-exact duplicates, 56.1% mojibake, and 10.7% near-duplicates. Those are direct counts, not derived from any model. The tokenizer comparison on FLORES-200 is also a clean, apples-to-apples check. Releasing everything makes it immediately usable for anyone starting Somali NLP work. The filtering pipeline is the softer part. It has six stages to remove low-quality or non-Somali text, but without a held-out validation or manual checks, it's hard to know if it keeps the right distribution of content or drops too much useful material. They note that full downstream LM perplexity tests are coming later, so right now we have the resource but not proof it improves models. This is aimed at researchers in low-resource NLP who need a Somali starting point or want to compare against the big multilingual dumps. A reader building tokenizers or cleaning web data for African languages will get practical value from the numbers and the released files. It has enough substance and new artifacts to warrant a serious referee. I'd send it to peer review. The measurements are solid and the release addresses a real gap, even if the bias checks could be stronger.

Referee Report

1 major / 2 minor

Summary. The paper introduces SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from HPLT v2, CC100, and Somali Wikipedia via a six-stage reproducible pipeline. It releases the corpus, a matched BPE-16K tokenizer (showing 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest), and the first public side-by-side benchmark of three production language identifiers for Somali. The work also reports concrete defect rates in existing distributions, including 17.3% byte-exact duplicates, 56.1% mojibake documents, and 10.7% near-duplicates (Jaccard 0.80) in HPLT v2.

Significance. If the filtering pipeline preserves representative Somali content, this resource addresses a clear gap for a low-resource language with ~25 million speakers that is currently scattered across multilingual crawls. The public release of the corpus, tokenizer, and benchmark, together with direct quantitative measurements of upstream defects and tokenizer efficiency, supports reproducible work in Somali NLP and provides a falsifiable baseline for future corpus construction efforts.

major comments (1)

[six-stage filtering pipeline] Description of the six-stage filtering pipeline: the claim that the pipeline removes low-quality or non-Somali text while retaining representative content rests on the pipeline description alone; the manuscript provides no quantitative validation (e.g., stage-wise precision, recall, or manual inspection statistics) for the effectiveness of individual filters. This is load-bearing for the central quality-filtered corpus claim.

minor comments (2)

[language-identification benchmark] The language-identification benchmark section would benefit from error bars, confidence intervals, or statistical significance tests on the performance differences reported for the three production identifiers.
[abstract] The abstract notes that downstream language-model perplexity comparisons are deferred; a short forward-looking sentence on the planned evaluation protocol would improve completeness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We are pleased that the significance of the work is recognized and that a minor revision is recommended. Below we provide a point-by-point response to the major comment.

read point-by-point responses

Referee: Description of the six-stage filtering pipeline: the claim that the pipeline removes low-quality or non-Somali text while retaining representative content rests on the pipeline description alone; the manuscript provides no quantitative validation (e.g., stage-wise precision, recall, or manual inspection statistics) for the effectiveness of individual filters. This is load-bearing for the central quality-filtered corpus claim.

Authors: We agree that providing quantitative validation for the effectiveness of the individual filters would strengthen the manuscript's central claim regarding the quality of the filtered corpus. While the six-stage pipeline is fully described and reproducible, allowing independent verification, we did not include stage-wise metrics or manual inspection results in the original submission. In the revised version, we will add a new subsection with manual evaluation statistics. Specifically, we will report the results of inspecting a random sample of 100 documents at each stage, including the proportion classified as Somali, the presence of low-quality text, and the estimated precision and recall for non-Somali removal where applicable. This will provide the requested quantitative support for the pipeline's performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new Somali corpus, matched BPE tokenizer, and language-ID benchmark via a reproducible six-stage pipeline applied to public upstream sources (HPLT v2, CC100, Somali Wikipedia). All key measurements—17.3% byte-exact duplicates, 56.1% mojibake documents, 10.7% near-duplicates, and 40.2% token reduction versus cl100k_base on FLORES-200—are direct empirical counts and comparisons on the released data or standard test sets. No equations, fitted parameters, predictions, or self-citations reduce any central claim to its own inputs by construction; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard NLP data-cleaning and tokenization methods with no new free parameters beyond the conventional choice of 16K vocabulary size and the assumption that the upstream sources contain extractable Somali text.

free parameters (1)

BPE vocabulary size
The tokenizer is built with a 16K vocabulary; this is a standard modeling choice rather than a fitted constant.

axioms (1)

domain assumption The six-stage pipeline correctly identifies and retains representative Somali text while removing noise and non-Somali content.
Invoked when constructing the final corpus from HPLT v2, CC100, and Somali Wikipedia.

pith-pipeline@v0.9.0 · 5806 in / 1408 out tokens · 57085 ms · 2026-05-20T10:30:49.159345+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

six-stage reproducible pipeline... Phase 1 Byte-exact dedup... Phase 5 Char-5-gram quality... BPE-16K tokenizer
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SomaliWeb v1... 819,322 documents (~303M tokens)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Towards a cleaner document-oriented multilingual crawled corpus

Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. Towards a cleaner document-oriented multilingual crawled corpus. InProc. LREC, 2022

work page 2022
[2]

MasakhaNER: Named entity recognition for african languages

David Ifeoluwa Adelani et al. MasakhaNER: Named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9, 2021

work page 2021
[3]

MasakhaNEWS: News topic classification for african languages

David Ifeoluwa Adelani et al. MasakhaNEWS: News topic classification for african languages. InProc. IJCNLP-AACL, 2023

work page 2023
[4]

Tokenizer Choice For LLM Training: Negligible or Crucial?2024

Mehdi Ali et al. Tokenizer choice for LLM training: Negligible or crucial? InFindings of NAACL, 2024.https://arxiv.org/abs/2310.08754

work page arXiv 2024
[5]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. Data statements for NLP: Toward mitigating system bias and enabling better science.Transactions of the Association for Computational Linguistics, 6, 2018

work page 2018
[6]

Andrei Z. Broder. On the resemblance and containment of documents. InSEQUENCES, 1997

work page 1997
[7]

An open dataset and model for language identification

Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield. An open dataset and model for language identification. InProc. ACL, 2023

work page 2023
[8]

An expanded massive multilingual dataset for high-performance language technologies (HPLT)

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, et al. An expanded massive multilingual dataset for high-performance language technologies (HPLT). InProc. ACL (Volume 1: Long Papers), pages 17452–17485, Vienna, Austria, 2025.https://arxiv.org/abs/2503. 10267. 14

work page 2025
[9]

CommonLID: Re-evaluating state-of-the-art language identification performance on web data, 2025

Common Crawl Foundation. CommonLID: Re-evaluating state-of-the-art language identification performance on web data, 2025. Workshop on Multilingual Data Quality Signals (WMDQS), co-located with COLM 2025.https://arxiv.org/abs/2601.18026

work page arXiv 2025
[10]

Unsupervised cross-lingual repre- sentation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. Unsupervised cross-lingual repre- sentation learning at scale. InProc. ACL, 2020

work page 2020
[11]

Somali-ASR-Subset-68H: 68 hours of somali automatic-speech-recognition data

DDD-Kenya. Somali-ASR-Subset-68H: 68 hours of somali automatic-speech-recognition data. https://huggingface.co/datasets/DDD-Kenya/Somali-ASR-Subset-68H, 2026. Dataset, ac- cessed 2026-04-27

work page 2026
[12]

Bonaventure F. P. Dossou et al. AfroLM: A self-active learning-based multilingual pretrained language model for 23 african languages. InSustaiNLP Workshop, 2022

work page 2022
[13]

somali_cleaned_dataset.https://huggingface.co/datasets/FarmerlineML/ somali_cleaned_dataset, 2024

FarmerlineML. somali_cleaned_dataset.https://huggingface.co/datasets/FarmerlineML/ somali_cleaned_dataset, 2024. Dataset, no declared license, accessed 2026-04-27

work page 2024
[14]

fineweb-somali: Somali subset extraction of fineweb-2.https://huggingface

IbraahimLab. fineweb-somali: Somali subset extraction of fineweb-2.https://huggingface. co/datasets/IbraahimLab/fineweb-somali, 2026. Dataset, accessed 2026-04-27

work page 2026
[15]

Glot500: Scaling multilingual corpora and language models to 500 languages

Ayyoob ImaniGooghari et al. Glot500: Scaling multilingual corpora and language models to 500 languages. InProc. ACL, 2023

work page 2023
[16]

Bag of tricks for efficient text classification

Armand Joulin, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. Bag of tricks for efficient text classification. InProc. EACL, 2017

work page 2017
[17]

GlotLID: Language identification for low-resource languages

Amir Hossein Kargaran, Ayyoob Imani, François Yvon, and Hinrich Schütze. GlotLID: Language identification for low-resource languages. InFindings of EMNLP, 2023

work page 2023
[18]

MADLAD-400: A multilingual and document-level large audited dataset

Sneha Kudugunta et al. MADLAD-400: A multilingual and document-level large audited dataset. InProc. NeurIPS Datasets & Benchmarks, 2023

work page 2023
[19]

Ullman.Mining of Massive Datasets

Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman.Mining of Massive Datasets. Cam- bridge University Press, 3rd edition, 2020

work page 2020
[20]

CulturaX: A cleaned, enormous, and multilingual dataset for large language models

Thuat Nguyen et al. CulturaX: A cleaned, enormous, and multilingual dataset for large language models. InProc. LREC-COLING, 2024

work page 2024
[21]

Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages

Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. InMRL Workshop, 2021

work page 2021
[22]

Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025

Guilherme Penedo et al. FineWeb2: One pipeline to scale them all — adapting pre-training data processing to every language.arXiv preprint arXiv:2506.20920, 2025.https://arxiv. org/abs/2506.20920

work page arXiv 2025
[23]

Language model tokenizers introduce unfairness between languages

Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. InProc. NeurIPS, 2023

work page 2023
[24]

ftfy, 2019

Robyn Speer. ftfy, 2019

work page 2019
[25]

CCNet: Extracting high quality monolingual datasets from web crawl data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Édouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. InProc. LREC, 2020. 15

work page 2020
[26]

mT5: A massively multilingual pre-trained text-to-text transformer

Linting Xue et al. mT5: A massively multilingual pre-trained text-to-text transformer. InProc. NAACL-HLT, 2021. A Data Statement [5] A.1 Curation rationale.SomaliWeb v1 is a pretraining corpus intended for language-model and tokenizer training on Standard Somali. Documents were selected by aggregating three upstream sources and applying a six-stage dedupl...

work page 2021

[1] [1]

Towards a cleaner document-oriented multilingual crawled corpus

Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. Towards a cleaner document-oriented multilingual crawled corpus. InProc. LREC, 2022

work page 2022

[2] [2]

MasakhaNER: Named entity recognition for african languages

David Ifeoluwa Adelani et al. MasakhaNER: Named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9, 2021

work page 2021

[3] [3]

MasakhaNEWS: News topic classification for african languages

David Ifeoluwa Adelani et al. MasakhaNEWS: News topic classification for african languages. InProc. IJCNLP-AACL, 2023

work page 2023

[4] [4]

Tokenizer Choice For LLM Training: Negligible or Crucial?2024

Mehdi Ali et al. Tokenizer choice for LLM training: Negligible or crucial? InFindings of NAACL, 2024.https://arxiv.org/abs/2310.08754

work page arXiv 2024

[5] [5]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. Data statements for NLP: Toward mitigating system bias and enabling better science.Transactions of the Association for Computational Linguistics, 6, 2018

work page 2018

[6] [6]

Andrei Z. Broder. On the resemblance and containment of documents. InSEQUENCES, 1997

work page 1997

[7] [7]

An open dataset and model for language identification

Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield. An open dataset and model for language identification. InProc. ACL, 2023

work page 2023

[8] [8]

An expanded massive multilingual dataset for high-performance language technologies (HPLT)

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, et al. An expanded massive multilingual dataset for high-performance language technologies (HPLT). InProc. ACL (Volume 1: Long Papers), pages 17452–17485, Vienna, Austria, 2025.https://arxiv.org/abs/2503. 10267. 14

work page 2025

[9] [9]

CommonLID: Re-evaluating state-of-the-art language identification performance on web data, 2025

Common Crawl Foundation. CommonLID: Re-evaluating state-of-the-art language identification performance on web data, 2025. Workshop on Multilingual Data Quality Signals (WMDQS), co-located with COLM 2025.https://arxiv.org/abs/2601.18026

work page arXiv 2025

[10] [10]

Unsupervised cross-lingual repre- sentation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. Unsupervised cross-lingual repre- sentation learning at scale. InProc. ACL, 2020

work page 2020

[11] [11]

Somali-ASR-Subset-68H: 68 hours of somali automatic-speech-recognition data

DDD-Kenya. Somali-ASR-Subset-68H: 68 hours of somali automatic-speech-recognition data. https://huggingface.co/datasets/DDD-Kenya/Somali-ASR-Subset-68H, 2026. Dataset, ac- cessed 2026-04-27

work page 2026

[12] [12]

Bonaventure F. P. Dossou et al. AfroLM: A self-active learning-based multilingual pretrained language model for 23 african languages. InSustaiNLP Workshop, 2022

work page 2022

[13] [13]

somali_cleaned_dataset.https://huggingface.co/datasets/FarmerlineML/ somali_cleaned_dataset, 2024

FarmerlineML. somali_cleaned_dataset.https://huggingface.co/datasets/FarmerlineML/ somali_cleaned_dataset, 2024. Dataset, no declared license, accessed 2026-04-27

work page 2024

[14] [14]

fineweb-somali: Somali subset extraction of fineweb-2.https://huggingface

IbraahimLab. fineweb-somali: Somali subset extraction of fineweb-2.https://huggingface. co/datasets/IbraahimLab/fineweb-somali, 2026. Dataset, accessed 2026-04-27

work page 2026

[15] [15]

Glot500: Scaling multilingual corpora and language models to 500 languages

Ayyoob ImaniGooghari et al. Glot500: Scaling multilingual corpora and language models to 500 languages. InProc. ACL, 2023

work page 2023

[16] [16]

Bag of tricks for efficient text classification

Armand Joulin, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. Bag of tricks for efficient text classification. InProc. EACL, 2017

work page 2017

[17] [17]

GlotLID: Language identification for low-resource languages

Amir Hossein Kargaran, Ayyoob Imani, François Yvon, and Hinrich Schütze. GlotLID: Language identification for low-resource languages. InFindings of EMNLP, 2023

work page 2023

[18] [18]

MADLAD-400: A multilingual and document-level large audited dataset

Sneha Kudugunta et al. MADLAD-400: A multilingual and document-level large audited dataset. InProc. NeurIPS Datasets & Benchmarks, 2023

work page 2023

[19] [19]

Ullman.Mining of Massive Datasets

Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman.Mining of Massive Datasets. Cam- bridge University Press, 3rd edition, 2020

work page 2020

[20] [20]

CulturaX: A cleaned, enormous, and multilingual dataset for large language models

Thuat Nguyen et al. CulturaX: A cleaned, enormous, and multilingual dataset for large language models. InProc. LREC-COLING, 2024

work page 2024

[21] [21]

Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages

Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. InMRL Workshop, 2021

work page 2021

[22] [22]

Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025

Guilherme Penedo et al. FineWeb2: One pipeline to scale them all — adapting pre-training data processing to every language.arXiv preprint arXiv:2506.20920, 2025.https://arxiv. org/abs/2506.20920

work page arXiv 2025

[23] [23]

Language model tokenizers introduce unfairness between languages

Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. InProc. NeurIPS, 2023

work page 2023

[24] [24]

ftfy, 2019

Robyn Speer. ftfy, 2019

work page 2019

[25] [25]

CCNet: Extracting high quality monolingual datasets from web crawl data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Édouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. InProc. LREC, 2020. 15

work page 2020

[26] [26]

mT5: A massively multilingual pre-trained text-to-text transformer

Linting Xue et al. mT5: A massively multilingual pre-trained text-to-text transformer. InProc. NAACL-HLT, 2021. A Data Statement [5] A.1 Curation rationale.SomaliWeb v1 is a pretraining corpus intended for language-model and tokenizer training on Standard Somali. Documents were selected by aggregating three upstream sources and applying a six-stage dedupl...

work page 2021