Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

Francisco Guzm\'an; Holger Schwenk; Philipp Koehn; Vishrav Chaudhary; Yuqing Tang

arxiv: 1906.08885 · v1 · pith:4GENHVMKnew · submitted 2019-06-20 · 💻 cs.CL

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

Vishrav Chaudhary , Yuqing Tang , Francisco Guzm\'an , Holger Schwenk , Philipp Koehn This is my paper

Pith reviewed 2026-05-25 19:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords low-resource machine translationparallel corpus filteringmultilingual sentence embeddingsLASERWMT19 shared taskNepali-EnglishSinhala-Englishnoisy data filtering

0 comments

The pith

LASER multilingual embeddings score and filter noisy parallel sentences directly for low-resource machine translation without training an extra scorer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sentence representations from the LASER encoder-decoder, already trained on other parallel data, can be used off-the-shelf to rank candidate sentence pairs by quality. This ranking then filters a noisy crawl down to one million clean pairs for Nepali-English and Sinhala-English. The filtered data produces the highest-scoring translation systems in the WMT19 shared task for both language pairs. The same method also improves when combined with other scoring techniques in an ensemble. The authors note the approach needs no language-specific adaptation and appears usable even when almost no clean parallel text exists.

Core claim

The central claim is that multilingual sentence embeddings produced by an encoder-decoder trained on parallel corpora can be applied directly, without any further training or adaptation of a scoring function, to rank and retain the highest-quality sentence pairs from a noisy low-resource corpus; when this filtered corpus is used for training, the resulting translation models achieve the best reported BLEU scores on the Nepali-English and Sinhala-English 1M tasks.

What carries the argument

LASER sentence representations obtained from an encoder-decoder architecture trained on parallel text, used directly as a similarity or quality score for filtering.

If this is right

Direct LASER scoring outperforms several alternative filtering methods on the two target language pairs.
An ensemble that combines LASER scores with other methods produces further BLEU gains.
The technique requires no additional parallel data or labeled quality judgments for the target low-resource pair.
The same procedure shows promise for even lower-resource or zero-resource filtering settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the transfer holds, the same embeddings could be tested on additional low-resource pairs without new training runs.
The method might reduce the need for language-specific quality classifiers when new crawls become available.
One could measure how performance changes when the original LASER training data is replaced by data closer to the target languages.

Load-bearing premise

Embeddings learned from other language pairs transfer to scoring Nepali-English and Sinhala-English data without any retraining or language-specific adjustment.

What would settle it

Apply the LASER scoring procedure to the WMT19 Nepali-English and Sinhala-English crawls, train translation systems on the resulting 1M pairs, and obtain BLEU scores no higher than the second-best submitted systems.

read the original abstract

In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali-English and Sinhala-English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LASER embeddings win the WMT19 low-resource filtering task for Nepali and Sinhala but the paper does not isolate how much the direct embedding scores contribute versus the ensemble.

read the letter

The main thing to know is that this paper takes the existing LASER multilingual sentence embeddings and uses them directly, with no extra training, to score and filter noisy parallel data. Their submission placed first in the WMT19 shared task for both Nepali-English and Sinhala-English 1M tracks, beating the next systems by 1.3 and 1.4 BLEU. They also note that an ensemble of methods adds further gains and suggest the approach could help in no-resource cases too.

Referee Report

1 major / 1 minor

Summary. The manuscript describes the authors' submission to the WMT19 low-resource parallel corpus filtering shared task. The core method applies pre-trained LASER multilingual sentence embeddings directly to score and filter noisy sentence pairs for Nepali-English and Sinhala-English without training any additional scoring function. The paper contrasts this with other approaches, forms an ensemble that yields further gains, and reports that the submission achieved the best overall performance on both 1M tasks by margins of 1.3 and 1.4 BLEU over the second-best systems. It also suggests the technique is promising for low- and no-resource scenarios.

Significance. If the reported margins hold under scrutiny and the method generalizes, the work shows that unmodified multilingual embeddings can serve as an effective, training-free component for corpus filtering in low-resource MT, providing a simple baseline that may reduce reliance on language-specific parallel data for the filtering stage itself.

major comments (1)

[Abstract] Abstract and results: the headline claim of 1.3–1.4 BLEU margins is attributed to the LASER-based submission, yet no ablation is described that isolates the LASER-only scoring performance on the identical 1M noisy corpus against either the full ensemble or against simple baselines (e.g., length or language-model filters). Without this isolation the transfer assumption for unmodified LASER embeddings remains untested and the attribution of the margin to the core method is not fully supported.

minor comments (1)

[Abstract] The abstract states that LASER embeddings are used 'directly to score and filter' but does not specify the precise similarity metric, threshold, or ranking procedure employed; adding one sentence on this point would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the opportunity to clarify our submission. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and results: the headline claim of 1.3–1.4 BLEU margins is attributed to the LASER-based submission, yet no ablation is described that isolates the LASER-only scoring performance on the identical 1M noisy corpus against either the full ensemble or against simple baselines (e.g., length or language-model filters). Without this isolation the transfer assumption for unmodified LASER embeddings remains untested and the attribution of the margin to the core method is not fully supported.

Authors: We agree that the abstract could be more precise. The manuscript presents LASER embeddings used directly (without training an additional scorer) as the core method, shows that this approach yields strong results when contrasted with other techniques, and states that an ensemble of scoring methods produces further gains. The 1.3 and 1.4 BLEU margins refer to the performance of our final shared-task submission, which is the ensemble. We did not include an explicit ablation that isolates LASER-only filtering on the exact 1M noisy corpus against the ensemble or against simple baselines such as length or language-model filters. We will revise the abstract to attribute the headline margins clearly to the ensemble submission while noting the LASER component as the primary, training-free method. This revision will better delimit the claims about unmodified multilingual embeddings. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results rest on external shared-task evaluation of pre-trained embeddings

full rationale

The paper applies the published LASER encoder (trained on separate parallel data) to score and filter noisy corpora for Nepali-English and Sinhala-English, then reports BLEU on the WMT19 1M shared-task test sets. No equations, fitted parameters, or self-referential definitions are present; the central performance numbers are obtained from an external evaluation protocol on public data. Self-citation of LASER is non-load-bearing because the embeddings are used off-the-shelf and the headline gains are measured against independent baselines and other submissions. The derivation chain therefore contains no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented entities are described in the abstract; the work is an empirical application of an existing public toolkit.

pith-pipeline@v0.9.0 · 5689 in / 1055 out tokens · 26747 ms · 2026-05-25T19:19:26.994436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. https://arxiv.org/pdf/1710.11041.pdf Unsupervised neural machine translation . In International Conference on Learning Representations (ICLR)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Mikel Artetxe and Holger Schwenk. 2018 a . https://arxiv.org/abs/1811.01136 Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings . arXiv preprint arXiv:1811.01136

work page arXiv 2018
[5]

Mikel Artetxe and Holger Schwenk. 2018 b . https://arxiv.org/pdf/1812.10464.pdf Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond . arXiv preprint arXiv:1812.10464

work page arXiv 2018
[6]

Francisco Guzm\' a n, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc'Aurelio Ranzato. 2019. https://arxiv.org/abs/1902.01382 Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english . arXiv preprint arXiv:1902.01382

work page arXiv 2019
[7]

Clark, and Philipp Koehn

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. https://kheafield.com/papers/edinburgh/estimate\_paper.pdf Scalable modified Kneser-Ney language model estimation . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 690--696, Sofia, Bulgaria

work page 2013
[8]

Marcin Junczys-Dowmunt. 2018. https://www.statmt.org/wmt18/pdf/WMT106.pdf Dual conditional cross-entropy filtering of noisy parallel corpora . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 901--908, Belgium, Brussels. Association for Computational Linguistics

work page 2018
[9]

Huda Khayrallah and Philipp Koehn. 2018. http://www.aclweb.org/anthology/W18-2709 On the impact of various types of noise on neural machine translation . In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74--83, Melbourne, Australia. Association for Computational Linguistics

work page 2018
[10]

Huda Khayrallah, Hainan Xu, and Philipp Koehn. 2018. http://www.aclweb.org/anthology/W18-6480 The JHU parallel corpus filtering systems for WMT 2018 . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 909--912, Belgium, Brussels. Association for Computational Linguistics

work page 2018
[11]

Philipp Koehn, Francisco Guzm\'an, Vishrav Chaudhary, and Juan M. Pino. 2019. Findings of the wmt 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the Fourth Conference on Machine Translation, Volume 2: Shared Task Papers, Florence, Italy. Association for Computational Linguistics

work page 2019
[12]

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Ondrej Bojar Chris Dyer, Alexandra Constantin, and Evan Herbst. 2007. https://www.aclweb.org/anthology/P07-2045 Moses: Open source toolkit for statistical machine translation . In Annual Meeting of t...

work page 2007
[13]

Philipp Koehn, Huda Khayrallah, Kenneth Heafield, and Mikel L Forcada. 2018. https://www.aclweb.org/anthology/W18-6453 Findings of the WMT 2018 shared task on parallel corpus filtering . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 726--739, Belgium, Brussels. Association for Computational Linguistics

work page 2018
[14]

Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018. https://aclweb.org/anthology/D18-1549 Phrase-based & neural unsupervised machine translation . In Empirical Methods in Natural Language Processing (EMNLP), pages 5039--5049, Belgium, Brussels. Association for Computational Linguistics

work page 2018
[15]

Fantine Mordelet and J-P Vert. 2014. A bagging svm to learn from positive and unlabeled examples. Pattern Recognition Letters, 37:201--209

work page 2014
[16]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations

work page 2019
[17]

Matt Post. 2018. https://www.aclweb.org/anthology/W18-6319 A call for clarity in reporting bleu scores . In Proceedings of the Third Conference on Machine Translation (WMT), Volume 1: Research Papers, volume 1804.08771, pages 186--191, Belgium, Brussels. Association for Computational Linguistics

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

V \' ctor M S \'a nchez-Cartagena, Marta Ba \ n \'o n, Sergio Ortiz Rojas, and Gema Ram \' rez. 2018. https://www.statmt.org/wmt18/pdf/WMT116.pdf Prompsit's submission to WMT 2018 parallel corpus filtering shared task . In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 955--962, Belgium, Brussels. Association for Com...

work page 2018
[19]

Holger Schwenk. 2018. https://aclweb.org/anthology/P18-2037 Filtering and mining parallel data in a joint multilingual space . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 228--234, Australia, Melbourne. Association for Computational Linguistics

work page 2018
[20]

Hainan Xu and Philipp Koehn. 2017. https://www.aclweb.org/anthology/D17-1319 Zipporah: a fast and scalable data cleaning system for noisy web-crawled parallel corpora . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2945--2950, Denmark, Cophenhagen. Association for Computational Linguistics

work page 2017

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. https://arxiv.org/pdf/1710.11041.pdf Unsupervised neural machine translation . In International Conference on Learning Representations (ICLR)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Mikel Artetxe and Holger Schwenk. 2018 a . https://arxiv.org/abs/1811.01136 Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings . arXiv preprint arXiv:1811.01136

work page arXiv 2018

[5] [5]

Mikel Artetxe and Holger Schwenk. 2018 b . https://arxiv.org/pdf/1812.10464.pdf Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond . arXiv preprint arXiv:1812.10464

work page arXiv 2018

[6] [6]

Francisco Guzm\' a n, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc'Aurelio Ranzato. 2019. https://arxiv.org/abs/1902.01382 Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english . arXiv preprint arXiv:1902.01382

work page arXiv 2019

[7] [7]

Clark, and Philipp Koehn

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. https://kheafield.com/papers/edinburgh/estimate\_paper.pdf Scalable modified Kneser-Ney language model estimation . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 690--696, Sofia, Bulgaria

work page 2013

[8] [8]

Marcin Junczys-Dowmunt. 2018. https://www.statmt.org/wmt18/pdf/WMT106.pdf Dual conditional cross-entropy filtering of noisy parallel corpora . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 901--908, Belgium, Brussels. Association for Computational Linguistics

work page 2018

[9] [9]

Huda Khayrallah and Philipp Koehn. 2018. http://www.aclweb.org/anthology/W18-2709 On the impact of various types of noise on neural machine translation . In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74--83, Melbourne, Australia. Association for Computational Linguistics

work page 2018

[10] [10]

Huda Khayrallah, Hainan Xu, and Philipp Koehn. 2018. http://www.aclweb.org/anthology/W18-6480 The JHU parallel corpus filtering systems for WMT 2018 . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 909--912, Belgium, Brussels. Association for Computational Linguistics

work page 2018

[11] [11]

Philipp Koehn, Francisco Guzm\'an, Vishrav Chaudhary, and Juan M. Pino. 2019. Findings of the wmt 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the Fourth Conference on Machine Translation, Volume 2: Shared Task Papers, Florence, Italy. Association for Computational Linguistics

work page 2019

[12] [12]

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Ondrej Bojar Chris Dyer, Alexandra Constantin, and Evan Herbst. 2007. https://www.aclweb.org/anthology/P07-2045 Moses: Open source toolkit for statistical machine translation . In Annual Meeting of t...

work page 2007

[13] [13]

Philipp Koehn, Huda Khayrallah, Kenneth Heafield, and Mikel L Forcada. 2018. https://www.aclweb.org/anthology/W18-6453 Findings of the WMT 2018 shared task on parallel corpus filtering . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 726--739, Belgium, Brussels. Association for Computational Linguistics

work page 2018

[14] [14]

Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018. https://aclweb.org/anthology/D18-1549 Phrase-based & neural unsupervised machine translation . In Empirical Methods in Natural Language Processing (EMNLP), pages 5039--5049, Belgium, Brussels. Association for Computational Linguistics

work page 2018

[15] [15]

Fantine Mordelet and J-P Vert. 2014. A bagging svm to learn from positive and unlabeled examples. Pattern Recognition Letters, 37:201--209

work page 2014

[16] [16]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations

work page 2019

[17] [17]

Matt Post. 2018. https://www.aclweb.org/anthology/W18-6319 A call for clarity in reporting bleu scores . In Proceedings of the Third Conference on Machine Translation (WMT), Volume 1: Research Papers, volume 1804.08771, pages 186--191, Belgium, Brussels. Association for Computational Linguistics

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

V \' ctor M S \'a nchez-Cartagena, Marta Ba \ n \'o n, Sergio Ortiz Rojas, and Gema Ram \' rez. 2018. https://www.statmt.org/wmt18/pdf/WMT116.pdf Prompsit's submission to WMT 2018 parallel corpus filtering shared task . In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 955--962, Belgium, Brussels. Association for Com...

work page 2018

[19] [19]

Holger Schwenk. 2018. https://aclweb.org/anthology/P18-2037 Filtering and mining parallel data in a joint multilingual space . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 228--234, Australia, Melbourne. Association for Computational Linguistics

work page 2018

[20] [20]

Hainan Xu and Philipp Koehn. 2017. https://www.aclweb.org/anthology/D17-1319 Zipporah: a fast and scalable data cleaning system for noisy web-crawled parallel corpora . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2945--2950, Denmark, Cophenhagen. Association for Computational Linguistics

work page 2017