Facebook FAIR's WMT19 News Translation Task Submission

Alexei Baevski; Kyra Yee; Michael Auli; Myle Ott; Nathan Ng; Sergey Edunov

arxiv: 1907.06616 · v1 · pith:NTGL3YIBnew · submitted 2019-07-15 · 💻 cs.CL

Facebook FAIR's WMT19 News Translation Task Submission

Nathan Ng , Kyra Yee , Alexei Baevski , Myle Ott , Michael Auli , Sergey Edunov This is my paper

Pith reviewed 2026-05-24 21:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine translationnews translationWMT19transformer modelsback-translationensemblingnoisy channel rerankingdata filtering

0 comments

The pith

Large transformer models with bitext filtering, filtered back-translations, ensembling, domain fine-tuning, and noisy channel reranking rank first in all four WMT19 human evaluations and beat human translators on English to German.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes improvements to machine translation systems for the WMT19 news task in English-German and English-Russian. The authors begin with large BPE-based transformer models trained on sampled back-translations and then apply bitext data filtering schemes, add filtered back-translated data, ensemble the models, fine-tune on domain-specific data, and decode with noisy channel model reranking. These changes produce systems that rank first in human evaluations across all four directions, with the English-to-German system outperforming both competing systems and human translators while gaining 4.5 BLEU points over the prior year's submission. A sympathetic reader would care because the work shows how targeted data and decoding steps can lift real translation performance in a competitive setting.

Core claim

The submissions achieve first place in the human evaluation campaign for all four language directions by combining bitext data filtering schemes, filtered back-translated data, model ensembling, domain-specific fine-tuning, and noisy channel model reranking on top of large BPE-based transformer models trained with sampled back-translations. On the English to German direction, the system significantly outperforms other systems as well as human translations and improves 4.5 BLEU points upon the WMT'18 submission.

What carries the argument

The pipeline of bitext filtering, filtered back-translation augmentation, ensembling, domain fine-tuning, and noisy-channel reranking applied to large transformer models.

If this is right

Bitext filtering schemes can remove noise from parallel corpora and raise translation quality.
Adding filtered back-translated data expands usable training resources without introducing excessive noise.
Ensembling and domain fine-tuning refine model outputs beyond what single models achieve.
Noisy channel reranking improves final translations by rescoring candidate outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results suggest data curation and decoding refinements can matter as much as raw model scale for translation quality.
Similar filtering and reranking steps could be tested on other language pairs or non-news domains to check generality.
Outperformance of humans on one direction raises the question of whether the same pipeline would exceed human baselines on additional language pairs under matched conditions.

Load-bearing premise

The observed rankings and BLEU gains are caused by the listed changes rather than by unstated differences in training scale, random seeds, or evaluation artifacts.

What would settle it

A controlled replication that applies the same filtering, back-translation, ensembling, fine-tuning, and reranking steps to the WMT19 test sets but obtains lower human rankings or smaller BLEU gains would falsify the claim that these steps produce the reported results.

read the original abstract

This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling toolkit which rely on sampled back-translations. This year we experiment with different bitext data filtering schemes, as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific data, then decode using noisy channel model reranking. Our submissions are ranked first in all four directions of the human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations. This system improves upon our WMT'18 submission by 4.5 BLEU points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper describes Facebook FAIR's WMT19 shared-task submission for English<->German and English<->Russian news translation. Baselines are large BPE-based transformers trained in Fairseq with sampled back-translation; the authors add bitext filtering, filtered back-translated data, ensembling, domain fine-tuning, and noisy-channel reranking. They report first place in all four human-evaluation directions, with the En->De system significantly outperforming both other systems and human translators, and a 4.5 BLEU improvement over their own WMT'18 submission.

Significance. If the performance gains are attributable to the listed techniques, the work supplies a concrete, externally validated recipe for strong news-translation systems and demonstrates the effectiveness of combining data filtering, ensembling, and noisy-channel reranking at scale. The official shared-task human rankings and the stated 4.5 BLEU delta supply external grounding that is rare in system-description papers.

major comments (2)

[Abstract, §4] Abstract and §4: the central claim that the listed techniques (bitext filtering schemes, filtered back-translated data, ensembling, domain fine-tuning, noisy-channel reranking) produce the 4.5 BLEU gain and first-place human rankings is not supported by any ablation that holds total training compute, bitext volume, or model capacity fixed while toggling individual components. Without such controls it is impossible to isolate the contribution of the described methods from possible unstated increases in scale.
[Abstract] Abstract: no error bars, standard deviations across random seeds, or multiple-run statistics are supplied for the BLEU scores or for the human-evaluation rankings, weakening the reliability of the reported deltas and the claim of statistically significant outperformance of human translators on En->De.

minor comments (1)

The manuscript would be strengthened by explicit statements of total training FLOPs, bitext sizes before/after filtering, and model hyperparameters so that readers can assess whether the gains are reproducible at comparable scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to clarify aspects of our WMT19 system-description paper. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4: the central claim that the listed techniques (bitext filtering schemes, filtered back-translated data, ensembling, domain fine-tuning, noisy-channel reranking) produce the 4.5 BLEU gain and first-place human rankings is not supported by any ablation that holds total training compute, bitext volume, or model capacity fixed while toggling individual components. Without such controls it is impossible to isolate the contribution of the described methods from possible unstated increases in scale.

Authors: We agree that the manuscript contains no controlled ablations that hold compute, data volume, and model capacity fixed while varying individual components. The 4.5 BLEU delta is reported relative to our own WMT'18 submission, which used a similar base transformer architecture and back-translation approach but omitted the additional filtering, ensembling, fine-tuning, and reranking steps. This comparison supplies partial evidence that the combination of techniques contributed to the improvement and to the human-evaluation ranking, but it does not fully isolate each factor from possible scale increases. We will revise the abstract and §4 to state more precisely that the gains are attributable to the overall system rather than to any single listed technique in isolation. revision: yes
Referee: [Abstract] Abstract: no error bars, standard deviations across random seeds, or multiple-run statistics are supplied for the BLEU scores or for the human-evaluation rankings, weakening the reliability of the reported deltas and the claim of statistically significant outperformance of human translators on En->De.

Authors: The BLEU scores reflect single training runs; retraining these large models multiple times was not feasible under shared-task deadlines and compute budgets. The human-evaluation rankings and significance statements are those supplied by the WMT organizers following their established evaluation protocol. We will add a brief note in the revised manuscript acknowledging the single-run nature of the automatic metrics while retaining the organizers' human-evaluation results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external shared-task evaluation

full rationale

The paper is a systems description of MT submissions to WMT19. Headline claims (first-place human rankings, 4.5 BLEU gain over WMT'18) are measured by the shared task's external evaluation protocol and human judgments, not by any internal fitted quantity or self-referential definition. No equations, ansatzes, uniqueness theorems, or predictions appear; the manuscript simply enumerates training choices and reports official scores. Self-citation to the authors' WMT'18 submission is used only for delta comparison and does not bear any uniqueness or derivation load. This is the normal case of an externally benchmarked empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a systems description that relies on standard neural MT assumptions and an external competition evaluation; it introduces no new free parameters, axioms, or invented entities beyond those already established in the transformer and back-translation literature.

axioms (1)

standard math Transformer architecture and training procedures function as described in prior literature.
The submission treats large BPE-based transformers as a black-box baseline without re-deriving them.

pith-pipeline@v0.9.0 · 5680 in / 1311 out tokens · 32426 ms · 2026-05-24T21:19:44.553511+00:00 · methodology

Facebook FAIR's WMT19 News Translation Task Submission

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)