Naver Labs Europe's Systems for the WMT19 Machine Translation Robustness Task

Alexandre B\'erard; Claude Roux; Ioan Calapodescu

arxiv: 1907.06488 · v1 · pith:TXUBXYJ7new · submitted 2019-07-15 · 💻 cs.CL

Naver Labs Europe's Systems for the WMT19 Machine Translation Robustness Task

Alexandre B\'erard , Ioan Calapodescu , Claude Roux This is my paper

Pith reviewed 2026-05-24 21:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine translationrobustnesssocial media noisedomain adaptationensemble systemspre-processingWMT19BLEU evaluation

0 comments

The pith

Ensemble models rank first on the WMT19 machine translation robustness task for noisy social media text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents single and ensemble machine translation systems built for the WMT19 robustness task, which tests performance on noisy parallel data drawn from social media sources in French-English and Japanese-English. The authors focus on pre-processing steps and adaptation methods intended to handle informal language, spelling mistakes, and orthographic variations. Their ensemble systems placed first across all language pairs when scored by BLEU on held-out test sets from the same source. A sympathetic reader would care because reliable translation of everyday online text could make machine translation more usable in real-world conditions where clean formal text is rare.

Core claim

Our ensemble models, built with targeted pre-processing choices and solutions for noise robustness plus domain adaptation, ranked first in all language pairs according to BLEU evaluation on the unseen test sets provided for the WMT19 Machine Translation Robustness Task.

What carries the argument

Ensemble of adapted translation models combined with pre-processing tuned to social-media noise patterns.

If this is right

Targeted pre-processing improves handling of spelling mistakes and orthographic variations in both translation directions.
Domain adaptation from the social-media parallel data allows the models to generalize to unseen test sets drawn from the same source.
Ensemble combination yields higher BLEU scores than the corresponding single systems across all four language pairs.
The approach applies equally to French-English and Japanese-English pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-processing and adaptation pipeline could be tested on additional language pairs that also exhibit social-media-style noise.
If the noise distributions diverge, performance would drop unless the adaptation step is repeated on new data.
Human evaluation results, mentioned as part of the task, would be needed to check whether the BLEU gains correspond to better actual usability.

Load-bearing premise

The noise patterns in the provided social-media training data match those in the unseen test sets closely enough that the chosen adaptations generalize rather than overfit to training artifacts.

What would settle it

Running the same systems on a fresh collection of social-media translations whose noise types differ from the WMT19 training distribution and observing that the ensembles no longer achieve top BLEU scores.

read the original abstract

This paper describes the systems that we submitted to the WMT19 Machine Translation robustness task. This task aims to improve MT's robustness to noise found on social media, like informal language, spelling mistakes and other orthographic variations. The organizers provide parallel data extracted from a social media website in two language pairs: French-English and Japanese-English (in both translation directions). The goal is to obtain the best scores on unseen test sets from the same source, according to automatic metrics (BLEU) and human evaluation. We proposed one single and one ensemble system for each translation direction. Our ensemble models ranked first in all language pairs, according to BLEU evaluation. We discuss the pre-processing choices that we made, and present our solutions for robustness to noise and domain adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Their ensembles ranked first on BLEU across all pairs in the WMT19 robustness task, but the paper is a standard shared-task report with little analysis of why the methods worked or whether they generalize.

read the letter

Their ensembles ranked first on BLEU for every language pair in the WMT19 MT robustness task. The paper describes the single and ensemble systems they submitted for French-English and Japanese-English, built on the social-media parallel data the organizers supplied. They walk through the pre-processing steps they picked to handle spelling mistakes, informal language, and orthographic noise, plus the domain adaptation they applied. That part is straightforward and gives other teams a concrete picture of what the top entry did on the official test sets. The ranking itself is a new empirical result for this benchmark. The main limitation is the thin analysis. The paper does not include ablations that would show which pieces of the pipeline drove the gains, nor any error analysis or direct comparison of noise statistics between the training data and the hidden test sets. The stress-test concern stands: without evidence that the noise patterns match, it is hard to know whether the performance reflects real robustness or just a good fit to the training distribution. This is common in shared-task system papers, but it keeps the work from offering broader insight. The paper is mainly useful for people who follow the WMT robustness track or build MT systems for noisy social-media text. A reader looking for practical pre-processing ideas would get something out of it. It deserves peer review because documenting the winning systems on an official task benchmark adds to the record, even if the supporting analysis stays light. I would send it to referees.

Referee Report

1 major / 1 minor

Summary. The paper describes Naver Labs Europe's submissions to the WMT19 Machine Translation Robustness Task for French-English and Japanese-English (both directions). Parallel data from social media is used; the authors apply pre-processing, noise-robustness techniques (informal language, spelling, orthographic variation), and domain adaptation. They submit one single and one ensemble system per direction; the ensembles rank first on all pairs by official BLEU on hidden test sets drawn from the same source.

Significance. The top ranking on an official shared-task benchmark supplies concrete evidence that the chosen pre-processing and adaptation pipeline can be effective for social-media noise. Because the evaluation is external and automatic metrics are reported, the result is reproducible within the task setting. However, the manuscript is primarily a systems report and contains no ablation studies or error analysis, limiting the strength of any claim about which components drive robustness.

major comments (1)

[pre-processing choices and solutions for robustness to noise and domain adaptation] The headline result (first place on unseen test sets) rests on the assumption that noise statistics in the provided social-media parallel corpora are sufficiently close to those in the hidden test sets. The manuscript provides no distributional comparison of spelling mistakes, informal constructions, or orthographic variants between training and test data, nor any ablation that perturbs noise type while holding other factors fixed. This assumption is load-bearing for interpreting the ranking as evidence of general robustness rather than training-set-specific adaptation.

minor comments (1)

[abstract] The abstract uses past tense ('We proposed') while describing the systems; present tense would be more conventional for a systems paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and for noting the top ranking of our systems. Below we respond to the major comment.

read point-by-point responses

Referee: [pre-processing choices and solutions for robustness to noise and domain adaptation] The headline result (first place on unseen test sets) rests on the assumption that noise statistics in the provided social-media parallel corpora are sufficiently close to those in the hidden test sets. The manuscript provides no distributional comparison of spelling mistakes, informal constructions, or orthographic variants between training and test data, nor any ablation that perturbs noise type while holding other factors fixed. This assumption is load-bearing for interpreting the ranking as evidence of general robustness rather than training-set-specific adaptation.

Authors: We agree that the manuscript is a systems report without ablation studies or distributional comparisons, which limits the strength of claims about the sources of robustness. The test sets are hidden and were not released to participants, so direct comparison of noise statistics between training and test data is not feasible. Our focus was on describing the pre-processing, noise-robustness techniques, and domain adaptation that produced the submitted systems and their official BLEU rankings. In a revised version we will add an explicit discussion of these limitations and the assumptions involved. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical systems description on external shared-task benchmark

full rationale

The paper reports submitted MT systems for the WMT19 robustness task, detailing pre-processing, noise-robustness methods, and domain adaptation choices, then states empirical rankings (ensembles first by BLEU) on organizer-provided unseen test sets. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear. The derivation chain consists solely of engineering decisions evaluated on an external benchmark; performance claims do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This paper is a description of submitted systems to a shared task and introduces no mathematical models, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5660 in / 919 out tokens · 20239 ms · 2026-05-24T21:34:24.057912+00:00 · methodology

Naver Labs Europe's Systems for the WMT19 Machine Translation Robustness Task

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)