SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark
Pith reviewed 2026-05-20 10:30 UTC · model grok-4.3
The pith
SomaliWeb v1 provides the first dedicated quality-filtered Somali corpus along with a matched tokenizer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is the creation of SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents totaling about 303 million tokens, built through a reproducible six-stage pipeline from three sources. The authors also provide a matched BPE-16K tokenizer that produces 40.2% fewer tokens than GPT-4's cl100k_base on Somali text from FLORES-200, and establish the first public benchmark comparing three production language identifiers on Somali.
What carries the argument
The six-stage filtering pipeline that removes low-quality and non-Somali content from web sources while preserving useful Somali text.
If this is right
- Existing multilingual corpora contain significant quality defects including duplicates and mojibake.
- The matched BPE-16K tokenizer emits 40.2% fewer tokens on Somali text than GPT-4's tokenizer.
- The public language-identification benchmark enables direct evaluation of production systems on Somali.
- Future language models can be pretrained on this dedicated corpus for improved Somali performance.
Where Pith is reading between the lines
- Similar quality-filtering pipelines could be replicated for other low-resource languages lacking dedicated corpora.
- More efficient tokenization for Somali may improve the performance of multilingual models on Somali-related tasks.
- The released benchmark may spur development of more accurate language identifiers for Somali and related languages.
Load-bearing premise
The six-stage filtering pipeline removes low-quality or non-Somali text while retaining representative and useful Somali content without introducing systematic bias or excessive data loss.
What would settle it
A manual audit of a sample of the SomaliWeb v1 corpus revealing a large fraction of non-Somali or poor-quality documents would disprove the effectiveness of the filtering pipeline.
Figures
read the original abstract
Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3% byte-exact duplicates, 56.1% of its documents contain fixable mojibake, and 10.7% of its byte-unique documents are near-duplicates at Jaccard tau=0.80. Our BPE-16K tokenizer emits 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest as a tokenizer-level measurement; downstream language-model perplexity comparisons are deferred to a follow-up release.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from HPLT v2, CC100, and Somali Wikipedia via a six-stage reproducible pipeline. It releases the corpus, a matched BPE-16K tokenizer (showing 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest), and the first public side-by-side benchmark of three production language identifiers for Somali. The work also reports concrete defect rates in existing distributions, including 17.3% byte-exact duplicates, 56.1% mojibake documents, and 10.7% near-duplicates (Jaccard 0.80) in HPLT v2.
Significance. If the filtering pipeline preserves representative Somali content, this resource addresses a clear gap for a low-resource language with ~25 million speakers that is currently scattered across multilingual crawls. The public release of the corpus, tokenizer, and benchmark, together with direct quantitative measurements of upstream defects and tokenizer efficiency, supports reproducible work in Somali NLP and provides a falsifiable baseline for future corpus construction efforts.
major comments (1)
- [six-stage filtering pipeline] Description of the six-stage filtering pipeline: the claim that the pipeline removes low-quality or non-Somali text while retaining representative content rests on the pipeline description alone; the manuscript provides no quantitative validation (e.g., stage-wise precision, recall, or manual inspection statistics) for the effectiveness of individual filters. This is load-bearing for the central quality-filtered corpus claim.
minor comments (2)
- [language-identification benchmark] The language-identification benchmark section would benefit from error bars, confidence intervals, or statistical significance tests on the performance differences reported for the three production identifiers.
- [abstract] The abstract notes that downstream language-model perplexity comparisons are deferred; a short forward-looking sentence on the planned evaluation protocol would improve completeness.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We are pleased that the significance of the work is recognized and that a minor revision is recommended. Below we provide a point-by-point response to the major comment.
read point-by-point responses
-
Referee: Description of the six-stage filtering pipeline: the claim that the pipeline removes low-quality or non-Somali text while retaining representative content rests on the pipeline description alone; the manuscript provides no quantitative validation (e.g., stage-wise precision, recall, or manual inspection statistics) for the effectiveness of individual filters. This is load-bearing for the central quality-filtered corpus claim.
Authors: We agree that providing quantitative validation for the effectiveness of the individual filters would strengthen the manuscript's central claim regarding the quality of the filtered corpus. While the six-stage pipeline is fully described and reproducible, allowing independent verification, we did not include stage-wise metrics or manual inspection results in the original submission. In the revised version, we will add a new subsection with manual evaluation statistics. Specifically, we will report the results of inspecting a random sample of 100 documents at each stage, including the proportion classified as Somali, the presence of low-quality text, and the estimated precision and recall for non-Somali removal where applicable. This will provide the requested quantitative support for the pipeline's performance. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces a new Somali corpus, matched BPE tokenizer, and language-ID benchmark via a reproducible six-stage pipeline applied to public upstream sources (HPLT v2, CC100, Somali Wikipedia). All key measurements—17.3% byte-exact duplicates, 56.1% mojibake documents, 10.7% near-duplicates, and 40.2% token reduction versus cl100k_base on FLORES-200—are direct empirical counts and comparisons on the released data or standard test sets. No equations, fitted parameters, predictions, or self-citations reduce any central claim to its own inputs by construction; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- BPE vocabulary size
axioms (1)
- domain assumption The six-stage pipeline correctly identifies and retains representative Somali text while removing noise and non-Somali content.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
six-stage reproducible pipeline... Phase 1 Byte-exact dedup... Phase 5 Char-5-gram quality... BPE-16K tokenizer
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SomaliWeb v1... 819,322 documents (~303M tokens)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Towards a cleaner document-oriented multilingual crawled corpus
Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. Towards a cleaner document-oriented multilingual crawled corpus. InProc. LREC, 2022
work page 2022
-
[2]
MasakhaNER: Named entity recognition for african languages
David Ifeoluwa Adelani et al. MasakhaNER: Named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9, 2021
work page 2021
-
[3]
MasakhaNEWS: News topic classification for african languages
David Ifeoluwa Adelani et al. MasakhaNEWS: News topic classification for african languages. InProc. IJCNLP-AACL, 2023
work page 2023
-
[4]
Tokenizer Choice For LLM Training: Negligible or Crucial?2024
Mehdi Ali et al. Tokenizer choice for LLM training: Negligible or crucial? InFindings of NAACL, 2024.https://arxiv.org/abs/2310.08754
-
[5]
Emily M. Bender and Batya Friedman. Data statements for NLP: Toward mitigating system bias and enabling better science.Transactions of the Association for Computational Linguistics, 6, 2018
work page 2018
-
[6]
Andrei Z. Broder. On the resemblance and containment of documents. InSEQUENCES, 1997
work page 1997
-
[7]
An open dataset and model for language identification
Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield. An open dataset and model for language identification. InProc. ACL, 2023
work page 2023
-
[8]
An expanded massive multilingual dataset for high-performance language technologies (HPLT)
Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, et al. An expanded massive multilingual dataset for high-performance language technologies (HPLT). InProc. ACL (Volume 1: Long Papers), pages 17452–17485, Vienna, Austria, 2025.https://arxiv.org/abs/2503. 10267. 14
work page 2025
-
[9]
CommonLID: Re-evaluating state-of-the-art language identification performance on web data, 2025
Common Crawl Foundation. CommonLID: Re-evaluating state-of-the-art language identification performance on web data, 2025. Workshop on Multilingual Data Quality Signals (WMDQS), co-located with COLM 2025.https://arxiv.org/abs/2601.18026
-
[10]
Unsupervised cross-lingual repre- sentation learning at scale
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. Unsupervised cross-lingual repre- sentation learning at scale. InProc. ACL, 2020
work page 2020
-
[11]
Somali-ASR-Subset-68H: 68 hours of somali automatic-speech-recognition data
DDD-Kenya. Somali-ASR-Subset-68H: 68 hours of somali automatic-speech-recognition data. https://huggingface.co/datasets/DDD-Kenya/Somali-ASR-Subset-68H, 2026. Dataset, ac- cessed 2026-04-27
work page 2026
-
[12]
Bonaventure F. P. Dossou et al. AfroLM: A self-active learning-based multilingual pretrained language model for 23 african languages. InSustaiNLP Workshop, 2022
work page 2022
-
[13]
somali_cleaned_dataset.https://huggingface.co/datasets/FarmerlineML/ somali_cleaned_dataset, 2024
FarmerlineML. somali_cleaned_dataset.https://huggingface.co/datasets/FarmerlineML/ somali_cleaned_dataset, 2024. Dataset, no declared license, accessed 2026-04-27
work page 2024
-
[14]
fineweb-somali: Somali subset extraction of fineweb-2.https://huggingface
IbraahimLab. fineweb-somali: Somali subset extraction of fineweb-2.https://huggingface. co/datasets/IbraahimLab/fineweb-somali, 2026. Dataset, accessed 2026-04-27
work page 2026
-
[15]
Glot500: Scaling multilingual corpora and language models to 500 languages
Ayyoob ImaniGooghari et al. Glot500: Scaling multilingual corpora and language models to 500 languages. InProc. ACL, 2023
work page 2023
-
[16]
Bag of tricks for efficient text classification
Armand Joulin, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. Bag of tricks for efficient text classification. InProc. EACL, 2017
work page 2017
-
[17]
GlotLID: Language identification for low-resource languages
Amir Hossein Kargaran, Ayyoob Imani, François Yvon, and Hinrich Schütze. GlotLID: Language identification for low-resource languages. InFindings of EMNLP, 2023
work page 2023
-
[18]
MADLAD-400: A multilingual and document-level large audited dataset
Sneha Kudugunta et al. MADLAD-400: A multilingual and document-level large audited dataset. InProc. NeurIPS Datasets & Benchmarks, 2023
work page 2023
-
[19]
Ullman.Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman.Mining of Massive Datasets. Cam- bridge University Press, 3rd edition, 2020
work page 2020
-
[20]
CulturaX: A cleaned, enormous, and multilingual dataset for large language models
Thuat Nguyen et al. CulturaX: A cleaned, enormous, and multilingual dataset for large language models. InProc. LREC-COLING, 2024
work page 2024
-
[21]
Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. InMRL Workshop, 2021
work page 2021
-
[22]
Guilherme Penedo et al. FineWeb2: One pipeline to scale them all — adapting pre-training data processing to every language.arXiv preprint arXiv:2506.20920, 2025.https://arxiv. org/abs/2506.20920
-
[23]
Language model tokenizers introduce unfairness between languages
Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. InProc. NeurIPS, 2023
work page 2023
- [24]
-
[25]
CCNet: Extracting high quality monolingual datasets from web crawl data
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Édouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. InProc. LREC, 2020. 15
work page 2020
-
[26]
mT5: A massively multilingual pre-trained text-to-text transformer
Linting Xue et al. mT5: A massively multilingual pre-trained text-to-text transformer. InProc. NAACL-HLT, 2021. A Data Statement [5] A.1 Curation rationale.SomaliWeb v1 is a pretraining corpus intended for language-model and tokenizer training on Standard Somali. Documents were selected by aggregating three upstream sources and applying a six-stage dedupl...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.