pith. sign in

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in NLP. However, in recent years, such assumptions of high quality have become the subject of scrutiny in low-resource and multilingual contexts. In this study, we subject the entirety of non-English Wikipedia to a data filtering procedure typically reserved for noisy web-text -- a process which removes a large percentage of the collection's data. In analysing the removed data, we reveal numerous systematic quality issues, such as script and language contamination, repeated template and placeholder articles, and a high concentration of bot-generated content. We consolidate these findings into a 4-level quality ranking of Wikipedia, which shows strong correspondence with alternative quality measures and heuristics. Lastly, we evaluate the downstream impact of quality filtering in three practical language modelling scenarios, showing that models trained on filtered data largely match or outperform those trained on raw Wikipedia, with the largest gains observed for lower-quality language editions. Ultimately, our experiments serve as a first step in establishing quality-aware best practices for Wikipedia utilization in NLP, laying groundwork that can inform future dataset creation and curation efforts.

fields

cs.CL 1

years

2025 1

verdicts

UNVERDICTED 1

representative citing papers

Factual Inconsistencies in Multilingual Wikipedia Tables

cs.CL · 2025-07-24 · unverdicted · novelty 4.0

The study introduces a method for detecting and categorizing cross-lingual factual inconsistencies in Wikipedia tables using alignment techniques and metrics on sample data.

citing papers explorer

Showing 1 of 1 citing paper.

  • Factual Inconsistencies in Multilingual Wikipedia Tables cs.CL · 2025-07-24 · unverdicted · none · ref 13 · internal anchor

    The study introduces a method for detecting and categorizing cross-lingual factual inconsistencies in Wikipedia tables using alignment techniques and metrics on sample data.