pith. sign in

arxiv: 2601.18026 · v2 · pith:XUBDE4GEnew · submitted 2026-01-25 · 💻 cs.CL

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Pedro Ortiz Suarez , Laurie Burchell , Catherine Arnett , Rafael Mosquera-G\'omez , Sara Hincapie-Monsalve , Thom Vaughan , Damian Stewart , Malte Ostendorff
show 89 more authors
This is my paper
classification 💻 cs.CL
keywords commonlidlanguageslanguagemanymodelscorporadatadomain
0
0 comments X
read the original abstract

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

    cs.CL 2026-05 accept novelty 6.0

    SomaliWeb v1 delivers a cleaned Somali corpus, efficient BPE tokenizer, and side-by-side language identification benchmark while documenting defects in prior multilingual datasets.