Low-Resource Corpus Filtering using Multilingual Sentence Embeddings
Pith reviewed 2026-05-25 19:19 UTC · model grok-4.3
The pith
LASER multilingual embeddings score and filter noisy parallel sentences directly for low-resource machine translation without training an extra scorer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that multilingual sentence embeddings produced by an encoder-decoder trained on parallel corpora can be applied directly, without any further training or adaptation of a scoring function, to rank and retain the highest-quality sentence pairs from a noisy low-resource corpus; when this filtered corpus is used for training, the resulting translation models achieve the best reported BLEU scores on the Nepali-English and Sinhala-English 1M tasks.
What carries the argument
LASER sentence representations obtained from an encoder-decoder architecture trained on parallel text, used directly as a similarity or quality score for filtering.
If this is right
- Direct LASER scoring outperforms several alternative filtering methods on the two target language pairs.
- An ensemble that combines LASER scores with other methods produces further BLEU gains.
- The technique requires no additional parallel data or labeled quality judgments for the target low-resource pair.
- The same procedure shows promise for even lower-resource or zero-resource filtering settings.
Where Pith is reading between the lines
- If the transfer holds, the same embeddings could be tested on additional low-resource pairs without new training runs.
- The method might reduce the need for language-specific quality classifiers when new crawls become available.
- One could measure how performance changes when the original LASER training data is replaced by data closer to the target languages.
Load-bearing premise
Embeddings learned from other language pairs transfer to scoring Nepali-English and Sinhala-English data without any retraining or language-specific adjustment.
What would settle it
Apply the LASER scoring procedure to the WMT19 Nepali-English and Sinhala-English crawls, train translation systems on the resulting 1M pairs, and obtain BLEU scores no higher than the second-best submitted systems.
read the original abstract
In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali-English and Sinhala-English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the authors' submission to the WMT19 low-resource parallel corpus filtering shared task. The core method applies pre-trained LASER multilingual sentence embeddings directly to score and filter noisy sentence pairs for Nepali-English and Sinhala-English without training any additional scoring function. The paper contrasts this with other approaches, forms an ensemble that yields further gains, and reports that the submission achieved the best overall performance on both 1M tasks by margins of 1.3 and 1.4 BLEU over the second-best systems. It also suggests the technique is promising for low- and no-resource scenarios.
Significance. If the reported margins hold under scrutiny and the method generalizes, the work shows that unmodified multilingual embeddings can serve as an effective, training-free component for corpus filtering in low-resource MT, providing a simple baseline that may reduce reliance on language-specific parallel data for the filtering stage itself.
major comments (1)
- [Abstract] Abstract and results: the headline claim of 1.3–1.4 BLEU margins is attributed to the LASER-based submission, yet no ablation is described that isolates the LASER-only scoring performance on the identical 1M noisy corpus against either the full ensemble or against simple baselines (e.g., length or language-model filters). Without this isolation the transfer assumption for unmodified LASER embeddings remains untested and the attribution of the margin to the core method is not fully supported.
minor comments (1)
- [Abstract] The abstract states that LASER embeddings are used 'directly to score and filter' but does not specify the precise similarity metric, threshold, or ranking procedure employed; adding one sentence on this point would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the opportunity to clarify our submission. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract and results: the headline claim of 1.3–1.4 BLEU margins is attributed to the LASER-based submission, yet no ablation is described that isolates the LASER-only scoring performance on the identical 1M noisy corpus against either the full ensemble or against simple baselines (e.g., length or language-model filters). Without this isolation the transfer assumption for unmodified LASER embeddings remains untested and the attribution of the margin to the core method is not fully supported.
Authors: We agree that the abstract could be more precise. The manuscript presents LASER embeddings used directly (without training an additional scorer) as the core method, shows that this approach yields strong results when contrasted with other techniques, and states that an ensemble of scoring methods produces further gains. The 1.3 and 1.4 BLEU margins refer to the performance of our final shared-task submission, which is the ensemble. We did not include an explicit ablation that isolates LASER-only filtering on the exact 1M noisy corpus against the ensemble or against simple baselines such as length or language-model filters. We will revise the abstract to attribute the headline margins clearly to the ensemble submission while noting the LASER component as the primary, training-free method. This revision will better delimit the claims about unmodified multilingual embeddings. revision: partial
Circularity Check
No circularity: empirical results rest on external shared-task evaluation of pre-trained embeddings
full rationale
The paper applies the published LASER encoder (trained on separate parallel data) to score and filter noisy corpora for Nepali-English and Sinhala-English, then reports BLEU on the WMT19 1M shared-task test sets. No equations, fitted parameters, or self-referential definitions are present; the central performance numbers are obtained from an external evaluation protocol on public data. Self-citation of LASER is non-load-bearing because the embeddings are used off-the-shelf and the headline gains are measured against independent baselines and other submissions. The derivation chain therefore contains no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. https://arxiv.org/pdf/1710.11041.pdf Unsupervised neural machine translation . In International Conference on Learning Representations (ICLR)
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [4]
- [5]
-
[6]
Francisco Guzm\' a n, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc'Aurelio Ranzato. 2019. https://arxiv.org/abs/1902.01382 Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english . arXiv preprint arXiv:1902.01382
-
[7]
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. https://kheafield.com/papers/edinburgh/estimate\_paper.pdf Scalable modified Kneser-Ney language model estimation . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 690--696, Sofia, Bulgaria
work page 2013
-
[8]
Marcin Junczys-Dowmunt. 2018. https://www.statmt.org/wmt18/pdf/WMT106.pdf Dual conditional cross-entropy filtering of noisy parallel corpora . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 901--908, Belgium, Brussels. Association for Computational Linguistics
work page 2018
-
[9]
Huda Khayrallah and Philipp Koehn. 2018. http://www.aclweb.org/anthology/W18-2709 On the impact of various types of noise on neural machine translation . In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74--83, Melbourne, Australia. Association for Computational Linguistics
work page 2018
-
[10]
Huda Khayrallah, Hainan Xu, and Philipp Koehn. 2018. http://www.aclweb.org/anthology/W18-6480 The JHU parallel corpus filtering systems for WMT 2018 . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 909--912, Belgium, Brussels. Association for Computational Linguistics
work page 2018
-
[11]
Philipp Koehn, Francisco Guzm\'an, Vishrav Chaudhary, and Juan M. Pino. 2019. Findings of the wmt 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the Fourth Conference on Machine Translation, Volume 2: Shared Task Papers, Florence, Italy. Association for Computational Linguistics
work page 2019
-
[12]
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Ondrej Bojar Chris Dyer, Alexandra Constantin, and Evan Herbst. 2007. https://www.aclweb.org/anthology/P07-2045 Moses: Open source toolkit for statistical machine translation . In Annual Meeting of t...
work page 2007
-
[13]
Philipp Koehn, Huda Khayrallah, Kenneth Heafield, and Mikel L Forcada. 2018. https://www.aclweb.org/anthology/W18-6453 Findings of the WMT 2018 shared task on parallel corpus filtering . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 726--739, Belgium, Brussels. Association for Computational Linguistics
work page 2018
-
[14]
Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018. https://aclweb.org/anthology/D18-1549 Phrase-based & neural unsupervised machine translation . In Empirical Methods in Natural Language Processing (EMNLP), pages 5039--5049, Belgium, Brussels. Association for Computational Linguistics
work page 2018
-
[15]
Fantine Mordelet and J-P Vert. 2014. A bagging svm to learn from positive and unlabeled examples. Pattern Recognition Letters, 37:201--209
work page 2014
-
[16]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations
work page 2019
-
[17]
Matt Post. 2018. https://www.aclweb.org/anthology/W18-6319 A call for clarity in reporting bleu scores . In Proceedings of the Third Conference on Machine Translation (WMT), Volume 1: Research Papers, volume 1804.08771, pages 186--191, Belgium, Brussels. Association for Computational Linguistics
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
V \' ctor M S \'a nchez-Cartagena, Marta Ba \ n \'o n, Sergio Ortiz Rojas, and Gema Ram \' rez. 2018. https://www.statmt.org/wmt18/pdf/WMT116.pdf Prompsit's submission to WMT 2018 parallel corpus filtering shared task . In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 955--962, Belgium, Brussels. Association for Com...
work page 2018
-
[19]
Holger Schwenk. 2018. https://aclweb.org/anthology/P18-2037 Filtering and mining parallel data in a joint multilingual space . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 228--234, Australia, Melbourne. Association for Computational Linguistics
work page 2018
-
[20]
Hainan Xu and Philipp Koehn. 2017. https://www.aclweb.org/anthology/D17-1319 Zipporah: a fast and scalable data cleaning system for noisy web-crawled parallel corpora . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2945--2950, Denmark, Cophenhagen. Association for Computational Linguistics
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.