WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
Pith reviewed 2026-05-24 23:26 UTC · model grok-4.3
The pith
Sentence embeddings extract 135 million parallel sentences from Wikipedia for 1620 language pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An approach based on multilingual sentence embeddings extracts 135M parallel sentences for 1620 language pairs from Wikipedia in 85 languages, with only 34M involving English; neural MT baselines trained solely on these bitexts achieve strong BLEU scores on the TED corpus for many pairs, including distant languages.
What carries the argument
Multilingual sentence embeddings that identify parallel sentences by computing similarity scores across different languages.
If this is right
- Neural MT systems trained only on the mined data achieve strong BLEU scores for many language pairs on the TED corpus.
- The extracted bitexts are useful for training systems between distant languages without pivoting through English.
- A large public corpus of parallel sentences becomes available for 1620 language pairs.
Where Pith is reading between the lines
- The method could scale to additional languages or other multilingual text sources to further increase parallel data.
- Performance gains might be largest for language pairs with little existing parallel data.
- Combining the mined data with other resources could improve overall translation quality for low-resource settings.
Load-bearing premise
The multilingual sentence embeddings reliably identify true parallel sentences with acceptable precision for all 1620 language pairs including low-resource and distant ones.
What would settle it
Human inspection of extracted sentence pairs showing many are not true translations, or MT models trained on the data underperforming on independent test sets for most pairs.
read the original abstract
We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 85 languages, including several dialects or low-resource languages. We do not limit the the extraction process to alignments with English, but systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 1620 different language pairs, out of which only 34M are aligned with English. This corpus of parallel sentences is freely available at https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a pipeline that applies LASER multilingual sentence embeddings to mine parallel sentences from Wikipedia articles across 85 languages. It extracts bitexts for all language pairs (not limited to English alignments), yielding 135M parallel sentences in 1620 pairs. Quality is assessed by training NMT baselines on the mined data for 1886 language pairs and reporting BLEU scores on the TED test sets, with the claim that the resource is especially useful for distant-language MT without English pivoting. The corpus is released publicly.
Significance. If the precision of the mined bitexts holds across the claimed language pairs, the work supplies a large-scale, publicly released resource that substantially expands available training data for multilingual MT, particularly for non-English and low-resource pairs. The all-pairs extraction approach and the TED evaluation results provide concrete evidence of practical utility at this scale, and the open release of the data supports further research.
major comments (3)
- [Section 3] Section 3: The extraction declares parallel sentences via a cosine-similarity threshold applied uniformly in the shared LASER space for all 1620 pairs. No language-pair-specific precision estimates, human judgments, or threshold ablations stratified by resource level or typological distance are reported; this directly underpins both the 135M count and the quality claims for non-English pairs.
- [Section 4] Section 4: MT systems are trained exclusively on the mined WikiMatrix data and evaluated on TED. The evaluation does not include controls (e.g., comparison against pivoted baselines or noise-injection tests) that would confirm the mined pairs satisfy strict parallelism rather than semantic relatedness, which is load-bearing for the 'strong BLEU without English pivot' claim for distant pairs.
- [Abstract and Section 4] Abstract and Section 4: The statement that the bitexts are 'particularly interesting to train MT systems between distant languages' is not supported by any breakdown of BLEU scores by language distance, resource level, or explicit comparison to English-pivoted systems; the TED results alone do not isolate this advantage.
minor comments (3)
- [Abstract] Abstract contains the typo 'limit the the extraction'.
- [Abstract] The number '1886 language pairs' for MT training appears inconsistent with the 1620 extracted pairs; clarify the exact count and selection criterion.
- [Section 3] The manuscript provides limited detail on the precise LASER model variant, embedding dimensionality, or exact similarity threshold value used in the mining step.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below, providing clarifications on our design choices while agreeing to revisions where they strengthen the presentation of results.
read point-by-point responses
-
Referee: [Section 3] Section 3: The extraction declares parallel sentences via a cosine-similarity threshold applied uniformly in the shared LASER space for all 1620 pairs. No language-pair-specific precision estimates, human judgments, or threshold ablations stratified by resource level or typological distance are reported; this directly underpins both the 135M count and the quality claims for non-English pairs.
Authors: A single threshold was chosen to enable consistent, scalable extraction across 1620 pairs in the shared embedding space without prohibitive per-pair tuning. Direct precision estimates or human judgments for every pair were not feasible at this scale; instead, quality is assessed via downstream NMT performance on TED as a practical proxy. We will add an explicit discussion of the uniform threshold choice and its limitations in the revised manuscript. revision: partial
-
Referee: [Section 4] Section 4: MT systems are trained exclusively on the mined WikiMatrix data and evaluated on TED. The evaluation does not include controls (e.g., comparison against pivoted baselines or noise-injection tests) that would confirm the mined pairs satisfy strict parallelism rather than semantic relatedness, which is load-bearing for the 'strong BLEU without English pivot' claim for distant pairs.
Authors: The experimental design evaluates the mined data in isolation to demonstrate its utility as a standalone resource, especially for non-English pairs. The resulting BLEU scores on TED provide evidence that the extracted sentences contain sufficient parallel signal. While pivoted baselines or noise-injection tests would offer additional validation, they fall outside the scope of showing direct mining value; we will expand the discussion section to address potential semantic relatedness concerns. revision: partial
-
Referee: [Abstract and Section 4] Abstract and Section 4: The statement that the bitexts are 'particularly interesting to train MT systems between distant languages' is not supported by any breakdown of BLEU scores by language distance, resource level, or explicit comparison to English-pivoted systems; the TED results alone do not isolate this advantage.
Authors: The reported results include strong performance on numerous non-English pairs from TED, supporting utility beyond English-centric data. However, we acknowledge that the manuscript lacks an explicit stratification by distance or resource level and direct pivoted comparisons. We will revise the abstract and Section 4 to qualify the claim and add any feasible supporting analysis or caveats in the revision. revision: partial
Circularity Check
Empirical mining pipeline with external evaluation shows no circularity
full rationale
The paper describes a data-mining pipeline that applies pre-existing LASER multilingual embeddings to Wikipedia, thresholds cosine similarity to extract bitexts across 1620 pairs, counts the resulting 135M sentences, and trains/evaluates NMT systems on the external TED corpus. No equations, derivations, fitted parameters, or predictions appear; the central claims are direct empirical outputs. Self-citation of LASER is not load-bearing for any reduction because the embeddings are treated as a fixed external tool and the new results (counts, BLEU scores) are independently verifiable on TED. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multilingual sentence embeddings can identify parallel sentences via vector similarity across language pairs
Forward citations
Cited by 1 Pith paper
-
Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification
The authors built and publicly released sentence-aligned simplification corpora for five languages by processing crowd-sourced data from comparable documents.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.