WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

Francisco Guzm\'an; Holger Schwenk; Hongyu Gong; Shuo Sun; Vishrav Chaudhary

arxiv: 1907.05791 · v2 · pith:U77OXXT7new · submitted 2019-07-10 · 💻 cs.CL

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

Holger Schwenk , Vishrav Chaudhary , Shuo Sun , Hongyu Gong , Francisco Guzm\'an This is my paper

Pith reviewed 2026-05-24 23:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords bitext miningparallel sentencesWikipediamachine translationmultilingual embeddingslow-resource languagesdistant language pairs

0 comments

The pith

Sentence embeddings extract 135 million parallel sentences from Wikipedia for 1620 language pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors use multilingual sentence embeddings to scan Wikipedia articles in 85 languages and pull out matching sentences that say the same thing. They apply this to every possible pair of languages rather than stopping at English alignments. The result is a dataset of 135 million parallel sentences covering 1620 pairs. Training translation models on this data produces good results on standard tests, particularly when the languages are not close to English.

Core claim

An approach based on multilingual sentence embeddings extracts 135M parallel sentences for 1620 language pairs from Wikipedia in 85 languages, with only 34M involving English; neural MT baselines trained solely on these bitexts achieve strong BLEU scores on the TED corpus for many pairs, including distant languages.

What carries the argument

Multilingual sentence embeddings that identify parallel sentences by computing similarity scores across different languages.

If this is right

Neural MT systems trained only on the mined data achieve strong BLEU scores for many language pairs on the TED corpus.
The extracted bitexts are useful for training systems between distant languages without pivoting through English.
A large public corpus of parallel sentences becomes available for 1620 language pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could scale to additional languages or other multilingual text sources to further increase parallel data.
Performance gains might be largest for language pairs with little existing parallel data.
Combining the mined data with other resources could improve overall translation quality for low-resource settings.

Load-bearing premise

The multilingual sentence embeddings reliably identify true parallel sentences with acceptable precision for all 1620 language pairs including low-resource and distant ones.

What would settle it

Human inspection of extracted sentence pairs showing many are not true translations, or MT models trained on the data underperforming on independent test sets for most pairs.

read the original abstract

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 85 languages, including several dialects or low-resource languages. We do not limit the the extraction process to alignments with English, but systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 1620 different language pairs, out of which only 34M are aligned with English. This corpus of parallel sentences is freely available at https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WikiMatrix delivers a large public bitext resource focused on non-English pairs, but its quality claims rest on indirect TED BLEU scores rather than direct checks.

read the letter

The paper extracts 135 million parallel sentences from Wikipedia across 1620 language pairs in 85 languages using LASER embeddings, with only 34 million involving English. They release the data and train MT systems on subsets that reach decent BLEU on TED for many pairs, including some distant ones without English pivots. That non-English-centric scale is the main new piece compared to earlier mining work. The release itself is straightforward and useful for anyone needing bitext outside high-resource English alignments. The MT baselines give a practical signal that the mined data can support training. The extraction pipeline is described at a high level in the abstract, with a cosine threshold applied uniformly in the shared embedding space. The soft spot is the validation. Quality is judged only through downstream TED performance, with no reported precision numbers, human judgments, or threshold ablations broken out by resource level or language distance. Noisy pairs could still produce usable BLEU while falling short of strict parallelism, especially for low-resource or typologically distant combinations. That leaves the central assumption about reliable identification across all 1620 pairs untested in the provided details. Readers working on low-resource MT will get immediate value from the released corpus for experiments. The work is solid enough on the resource side to warrant referee time, even if reviewers will likely press for more direct quality evidence. I would send it for peer review.

Referee Report

3 major / 3 minor

Summary. The paper presents a pipeline that applies LASER multilingual sentence embeddings to mine parallel sentences from Wikipedia articles across 85 languages. It extracts bitexts for all language pairs (not limited to English alignments), yielding 135M parallel sentences in 1620 pairs. Quality is assessed by training NMT baselines on the mined data for 1886 language pairs and reporting BLEU scores on the TED test sets, with the claim that the resource is especially useful for distant-language MT without English pivoting. The corpus is released publicly.

Significance. If the precision of the mined bitexts holds across the claimed language pairs, the work supplies a large-scale, publicly released resource that substantially expands available training data for multilingual MT, particularly for non-English and low-resource pairs. The all-pairs extraction approach and the TED evaluation results provide concrete evidence of practical utility at this scale, and the open release of the data supports further research.

major comments (3)

[Section 3] Section 3: The extraction declares parallel sentences via a cosine-similarity threshold applied uniformly in the shared LASER space for all 1620 pairs. No language-pair-specific precision estimates, human judgments, or threshold ablations stratified by resource level or typological distance are reported; this directly underpins both the 135M count and the quality claims for non-English pairs.
[Section 4] Section 4: MT systems are trained exclusively on the mined WikiMatrix data and evaluated on TED. The evaluation does not include controls (e.g., comparison against pivoted baselines or noise-injection tests) that would confirm the mined pairs satisfy strict parallelism rather than semantic relatedness, which is load-bearing for the 'strong BLEU without English pivot' claim for distant pairs.
[Abstract and Section 4] Abstract and Section 4: The statement that the bitexts are 'particularly interesting to train MT systems between distant languages' is not supported by any breakdown of BLEU scores by language distance, resource level, or explicit comparison to English-pivoted systems; the TED results alone do not isolate this advantage.

minor comments (3)

[Abstract] Abstract contains the typo 'limit the the extraction'.
[Abstract] The number '1886 language pairs' for MT training appears inconsistent with the 1620 extracted pairs; clarify the exact count and selection criterion.
[Section 3] The manuscript provides limited detail on the precise LASER model variant, embedding dimensionality, or exact similarity threshold value used in the mining step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, providing clarifications on our design choices while agreeing to revisions where they strengthen the presentation of results.

read point-by-point responses

Referee: [Section 3] Section 3: The extraction declares parallel sentences via a cosine-similarity threshold applied uniformly in the shared LASER space for all 1620 pairs. No language-pair-specific precision estimates, human judgments, or threshold ablations stratified by resource level or typological distance are reported; this directly underpins both the 135M count and the quality claims for non-English pairs.

Authors: A single threshold was chosen to enable consistent, scalable extraction across 1620 pairs in the shared embedding space without prohibitive per-pair tuning. Direct precision estimates or human judgments for every pair were not feasible at this scale; instead, quality is assessed via downstream NMT performance on TED as a practical proxy. We will add an explicit discussion of the uniform threshold choice and its limitations in the revised manuscript. revision: partial
Referee: [Section 4] Section 4: MT systems are trained exclusively on the mined WikiMatrix data and evaluated on TED. The evaluation does not include controls (e.g., comparison against pivoted baselines or noise-injection tests) that would confirm the mined pairs satisfy strict parallelism rather than semantic relatedness, which is load-bearing for the 'strong BLEU without English pivot' claim for distant pairs.

Authors: The experimental design evaluates the mined data in isolation to demonstrate its utility as a standalone resource, especially for non-English pairs. The resulting BLEU scores on TED provide evidence that the extracted sentences contain sufficient parallel signal. While pivoted baselines or noise-injection tests would offer additional validation, they fall outside the scope of showing direct mining value; we will expand the discussion section to address potential semantic relatedness concerns. revision: partial
Referee: [Abstract and Section 4] Abstract and Section 4: The statement that the bitexts are 'particularly interesting to train MT systems between distant languages' is not supported by any breakdown of BLEU scores by language distance, resource level, or explicit comparison to English-pivoted systems; the TED results alone do not isolate this advantage.

Authors: The reported results include strong performance on numerous non-English pairs from TED, supporting utility beyond English-centric data. However, we acknowledge that the manuscript lacks an explicit stratification by distance or resource level and direct pivoted comparisons. We will revise the abstract and Section 4 to qualify the claim and add any feasible supporting analysis or caveats in the revision. revision: partial

Circularity Check

0 steps flagged

Empirical mining pipeline with external evaluation shows no circularity

full rationale

The paper describes a data-mining pipeline that applies pre-existing LASER multilingual embeddings to Wikipedia, thresholds cosine similarity to extract bitexts across 1620 pairs, counts the resulting 135M sentences, and trains/evaluates NMT systems on the external TED corpus. No equations, derivations, fitted parameters, or predictions appear; the central claims are direct empirical outputs. Self-citation of LASER is not load-bearing for any reduction because the embeddings are treated as a fixed external tool and the new results (counts, BLEU scores) are independently verifiable on TED. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The extraction rests on the domain assumption that pre-trained multilingual embeddings capture cross-lingual parallelism sufficiently for mining; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Multilingual sentence embeddings can identify parallel sentences via vector similarity across language pairs
Core mechanism of the extraction process stated in the abstract.

pith-pipeline@v0.9.0 · 5708 in / 1149 out tokens · 17786 ms · 2026-05-24T23:26:52.085706+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification
cs.CL 2026-05 unverdicted novelty 4.0

The authors built and publicly released sentence-aligned simplification corpora for five languages by processing crowd-sourced data from comparable documents.