pith. sign in

arxiv: 2605.18232 · v1 · pith:GUVSMPMUnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.IR

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Pith reviewed 2026-05-20 10:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords Somali web corpusdata filtering pipelineBPE tokenizerlanguage identificationlow-resource languagespretraining datacorpus quality
0
0 comments X

The pith

SomaliWeb v1 provides the first dedicated quality-filtered Somali corpus along with a matched tokenizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Somali has around 25 million speakers but until now lacked a dedicated public pretraining corpus. The paper constructs SomaliWeb v1 by applying a six-stage pipeline to sources like HPLT v2, CC100, and Somali Wikipedia, resulting in 819,322 documents and roughly 303 million tokens. It also releases a BPE-16K tokenizer and the first public benchmark for Somali language identification. This matters because it allows better language model training for Somali and exposes quality problems in current multilingual datasets such as high rates of duplicates and encoding errors.

Core claim

The central discovery is the creation of SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents totaling about 303 million tokens, built through a reproducible six-stage pipeline from three sources. The authors also provide a matched BPE-16K tokenizer that produces 40.2% fewer tokens than GPT-4's cl100k_base on Somali text from FLORES-200, and establish the first public benchmark comparing three production language identifiers on Somali.

What carries the argument

The six-stage filtering pipeline that removes low-quality and non-Somali content from web sources while preserving useful Somali text.

If this is right

  • Existing multilingual corpora contain significant quality defects including duplicates and mojibake.
  • The matched BPE-16K tokenizer emits 40.2% fewer tokens on Somali text than GPT-4's tokenizer.
  • The public language-identification benchmark enables direct evaluation of production systems on Somali.
  • Future language models can be pretrained on this dedicated corpus for improved Somali performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar quality-filtering pipelines could be replicated for other low-resource languages lacking dedicated corpora.
  • More efficient tokenization for Somali may improve the performance of multilingual models on Somali-related tasks.
  • The released benchmark may spur development of more accurate language identifiers for Somali and related languages.

Load-bearing premise

The six-stage filtering pipeline removes low-quality or non-Somali text while retaining representative and useful Somali content without introducing systematic bias or excessive data loss.

What would settle it

A manual audit of a sample of the SomaliWeb v1 corpus revealing a large fraction of non-Somali or poor-quality documents would disprove the effectiveness of the filtering pipeline.

Figures

Figures reproduced from arXiv: 2605.18232 by Khalid Yusuf Dahir.

Figure 1
Figure 1. Figure 1: SomaliWeb v1 — the six-stage corpus construction pipeline with per-phase retention. See §5 for equations and §7 for retention tables. • C3 (Measurement). Three concrete, quantified quality defects in HPLT v2’s “cleaned” Somali distribution (17.3% byte-duplicates, 56.1% mojibake-bearing documents, 10.7% near-duplicates) with per-phase retention and per-source breakdowns. • C4 (Tool). A BPE-16K tokenizer tra… view at source ↗
Figure 2
Figure 2. Figure 2: LSH S-curves (7) for three (b, r) configurations. Our choice of (16, 4) is balanced around s ∗ ≈ 0.50 with near-certain capture at τ = 0.80. Coverage. cov(d) = |G(5)(d) ∩ G (5) seed| |G(5)(d)| (8) We drop the bottom 15% by coverage; empirically this corresponds to threshold cov ≥ 0.9029. Retention. 819,322 / 963,908 = 85.00%. Per-source drop rate reveals source-quality asymmetry: Wikipedia 20.61%, HPLT 18.… view at source ↗
Figure 3
Figure 3. Figure 3: Per-source retention across the six pipeline phases. Software. Python 3.10. Pinned versions: numpy<2, ftfy==6.1.3, langdetect==1.0.9, tokenizers==0.15.2, datasets==2.19.0, zstandard==0.22.0, tiktoken==0.7.0. Full requirements.txt in Appendix B. Determinism. All seeds fixed at 0: random.seed(0), np.random.seed(0), DetectorFactory.seed = 0, tokenizer trainer shuffle_seed=0. With pinned versions the full pipe… view at source ↗
Figure 4
Figure 4. Figure 4: Somali LID confusion matrices on the 200-row test set (40 per class). langdetect dominates on Somali recall [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tokenizer fertility distribution on FLORES-200 Somali devtest (1,012 sentences). SomaliWeb v1 ties HPLT-raw at 30% smaller training corpus, and emits 40.2% fewer tokens than GPT-4’s cl100k_base. Source composition. Qualitative audit. From a random sample of 20 release documents, all 20 were judged by a native Somali speaker as recognizable, well-formed Somali text suitable for pretraining. Full rubric-base… view at source ↗
Figure 6
Figure 6. Figure 6: Char-5-gram coverage distribution (Phase 5) with τ = 0.9029 drop threshold. 5. Standard Somali only. No Maay Maay coverage; pipeline would need dialect-aware LID adjustments. 6. No Somali-aware PII scrub. Empirical scan: ∼7.9% of release documents contain at least one email-shaped string. Presidio does not cover Somali. Consumer-facing downstream uses must apply additional PII filtering. 7. Source coverage… view at source ↗
read the original abstract

Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3% byte-exact duplicates, 56.1% of its documents contain fixable mojibake, and 10.7% of its byte-unique documents are near-duplicates at Jaccard tau=0.80. Our BPE-16K tokenizer emits 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest as a tokenizer-level measurement; downstream language-model perplexity comparisons are deferred to a follow-up release.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from HPLT v2, CC100, and Somali Wikipedia via a six-stage reproducible pipeline. It releases the corpus, a matched BPE-16K tokenizer (showing 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest), and the first public side-by-side benchmark of three production language identifiers for Somali. The work also reports concrete defect rates in existing distributions, including 17.3% byte-exact duplicates, 56.1% mojibake documents, and 10.7% near-duplicates (Jaccard 0.80) in HPLT v2.

Significance. If the filtering pipeline preserves representative Somali content, this resource addresses a clear gap for a low-resource language with ~25 million speakers that is currently scattered across multilingual crawls. The public release of the corpus, tokenizer, and benchmark, together with direct quantitative measurements of upstream defects and tokenizer efficiency, supports reproducible work in Somali NLP and provides a falsifiable baseline for future corpus construction efforts.

major comments (1)
  1. [six-stage filtering pipeline] Description of the six-stage filtering pipeline: the claim that the pipeline removes low-quality or non-Somali text while retaining representative content rests on the pipeline description alone; the manuscript provides no quantitative validation (e.g., stage-wise precision, recall, or manual inspection statistics) for the effectiveness of individual filters. This is load-bearing for the central quality-filtered corpus claim.
minor comments (2)
  1. [language-identification benchmark] The language-identification benchmark section would benefit from error bars, confidence intervals, or statistical significance tests on the performance differences reported for the three production identifiers.
  2. [abstract] The abstract notes that downstream language-model perplexity comparisons are deferred; a short forward-looking sentence on the planned evaluation protocol would improve completeness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We are pleased that the significance of the work is recognized and that a minor revision is recommended. Below we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: Description of the six-stage filtering pipeline: the claim that the pipeline removes low-quality or non-Somali text while retaining representative content rests on the pipeline description alone; the manuscript provides no quantitative validation (e.g., stage-wise precision, recall, or manual inspection statistics) for the effectiveness of individual filters. This is load-bearing for the central quality-filtered corpus claim.

    Authors: We agree that providing quantitative validation for the effectiveness of the individual filters would strengthen the manuscript's central claim regarding the quality of the filtered corpus. While the six-stage pipeline is fully described and reproducible, allowing independent verification, we did not include stage-wise metrics or manual inspection results in the original submission. In the revised version, we will add a new subsection with manual evaluation statistics. Specifically, we will report the results of inspecting a random sample of 100 documents at each stage, including the proportion classified as Somali, the presence of low-quality text, and the estimated precision and recall for non-Somali removal where applicable. This will provide the requested quantitative support for the pipeline's performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new Somali corpus, matched BPE tokenizer, and language-ID benchmark via a reproducible six-stage pipeline applied to public upstream sources (HPLT v2, CC100, Somali Wikipedia). All key measurements—17.3% byte-exact duplicates, 56.1% mojibake documents, 10.7% near-duplicates, and 40.2% token reduction versus cl100k_base on FLORES-200—are direct empirical counts and comparisons on the released data or standard test sets. No equations, fitted parameters, predictions, or self-citations reduce any central claim to its own inputs by construction; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard NLP data-cleaning and tokenization methods with no new free parameters beyond the conventional choice of 16K vocabulary size and the assumption that the upstream sources contain extractable Somali text.

free parameters (1)
  • BPE vocabulary size
    The tokenizer is built with a 16K vocabulary; this is a standard modeling choice rather than a fitted constant.
axioms (1)
  • domain assumption The six-stage pipeline correctly identifies and retains representative Somali text while removing noise and non-Somali content.
    Invoked when constructing the final corpus from HPLT v2, CC100, and Somali Wikipedia.

pith-pipeline@v0.9.0 · 5806 in / 1408 out tokens · 57085 ms · 2026-05-20T10:30:49.159345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    Towards a cleaner document-oriented multilingual crawled corpus

    Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. Towards a cleaner document-oriented multilingual crawled corpus. InProc. LREC, 2022

  2. [2]

    MasakhaNER: Named entity recognition for african languages

    David Ifeoluwa Adelani et al. MasakhaNER: Named entity recognition for african languages. Transactions of the Association for Computational Linguistics, 9, 2021

  3. [3]

    MasakhaNEWS: News topic classification for african languages

    David Ifeoluwa Adelani et al. MasakhaNEWS: News topic classification for african languages. InProc. IJCNLP-AACL, 2023

  4. [4]

    Tokenizer Choice For LLM Training: Negligible or Crucial?2024

    Mehdi Ali et al. Tokenizer choice for LLM training: Negligible or crucial? InFindings of NAACL, 2024.https://arxiv.org/abs/2310.08754

  5. [5]

    Bender and Batya Friedman

    Emily M. Bender and Batya Friedman. Data statements for NLP: Toward mitigating system bias and enabling better science.Transactions of the Association for Computational Linguistics, 6, 2018

  6. [6]

    Andrei Z. Broder. On the resemblance and containment of documents. InSEQUENCES, 1997

  7. [7]

    An open dataset and model for language identification

    Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield. An open dataset and model for language identification. InProc. ACL, 2023

  8. [8]

    An expanded massive multilingual dataset for high-performance language technologies (HPLT)

    Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, et al. An expanded massive multilingual dataset for high-performance language technologies (HPLT). InProc. ACL (Volume 1: Long Papers), pages 17452–17485, Vienna, Austria, 2025.https://arxiv.org/abs/2503. 10267. 14

  9. [9]

    CommonLID: Re-evaluating state-of-the-art language identification performance on web data, 2025

    Common Crawl Foundation. CommonLID: Re-evaluating state-of-the-art language identification performance on web data, 2025. Workshop on Multilingual Data Quality Signals (WMDQS), co-located with COLM 2025.https://arxiv.org/abs/2601.18026

  10. [10]

    Unsupervised cross-lingual repre- sentation learning at scale

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. Unsupervised cross-lingual repre- sentation learning at scale. InProc. ACL, 2020

  11. [11]

    Somali-ASR-Subset-68H: 68 hours of somali automatic-speech-recognition data

    DDD-Kenya. Somali-ASR-Subset-68H: 68 hours of somali automatic-speech-recognition data. https://huggingface.co/datasets/DDD-Kenya/Somali-ASR-Subset-68H, 2026. Dataset, ac- cessed 2026-04-27

  12. [12]

    Bonaventure F. P. Dossou et al. AfroLM: A self-active learning-based multilingual pretrained language model for 23 african languages. InSustaiNLP Workshop, 2022

  13. [13]

    somali_cleaned_dataset.https://huggingface.co/datasets/FarmerlineML/ somali_cleaned_dataset, 2024

    FarmerlineML. somali_cleaned_dataset.https://huggingface.co/datasets/FarmerlineML/ somali_cleaned_dataset, 2024. Dataset, no declared license, accessed 2026-04-27

  14. [14]

    fineweb-somali: Somali subset extraction of fineweb-2.https://huggingface

    IbraahimLab. fineweb-somali: Somali subset extraction of fineweb-2.https://huggingface. co/datasets/IbraahimLab/fineweb-somali, 2026. Dataset, accessed 2026-04-27

  15. [15]

    Glot500: Scaling multilingual corpora and language models to 500 languages

    Ayyoob ImaniGooghari et al. Glot500: Scaling multilingual corpora and language models to 500 languages. InProc. ACL, 2023

  16. [16]

    Bag of tricks for efficient text classification

    Armand Joulin, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. Bag of tricks for efficient text classification. InProc. EACL, 2017

  17. [17]

    GlotLID: Language identification for low-resource languages

    Amir Hossein Kargaran, Ayyoob Imani, François Yvon, and Hinrich Schütze. GlotLID: Language identification for low-resource languages. InFindings of EMNLP, 2023

  18. [18]

    MADLAD-400: A multilingual and document-level large audited dataset

    Sneha Kudugunta et al. MADLAD-400: A multilingual and document-level large audited dataset. InProc. NeurIPS Datasets & Benchmarks, 2023

  19. [19]

    Ullman.Mining of Massive Datasets

    Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman.Mining of Massive Datasets. Cam- bridge University Press, 3rd edition, 2020

  20. [20]

    CulturaX: A cleaned, enormous, and multilingual dataset for large language models

    Thuat Nguyen et al. CulturaX: A cleaned, enormous, and multilingual dataset for large language models. InProc. LREC-COLING, 2024

  21. [21]

    Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages

    Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. InMRL Workshop, 2021

  22. [22]

    Fineweb2: One pipeline to scale them all -- adapting pre-training data processing to every language, 2025

    Guilherme Penedo et al. FineWeb2: One pipeline to scale them all — adapting pre-training data processing to every language.arXiv preprint arXiv:2506.20920, 2025.https://arxiv. org/abs/2506.20920

  23. [23]

    Language model tokenizers introduce unfairness between languages

    Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. InProc. NeurIPS, 2023

  24. [24]

    ftfy, 2019

    Robyn Speer. ftfy, 2019

  25. [25]

    CCNet: Extracting high quality monolingual datasets from web crawl data

    Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Édouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. InProc. LREC, 2020. 15

  26. [26]

    mT5: A massively multilingual pre-trained text-to-text transformer

    Linting Xue et al. mT5: A massively multilingual pre-trained text-to-text transformer. InProc. NAACL-HLT, 2021. A Data Statement [5] A.1 Curation rationale.SomaliWeb v1 is a pretraining corpus intended for language-model and tokenizer training on Standard Somali. Documents were selected by aggregating three upstream sources and applying a six-stage dedupl...