The University of Sydney's Machine Translation System for WMT19

Dacheng Tao; Liang Ding

arxiv: 1907.00494 · v1 · pith:BPVYDBGNnew · submitted 2019-06-30 · 💻 cs.CL · cs.LG

The University of Sydney's Machine Translation System for WMT19

Liang Ding , Dacheng Tao This is my paper

Pith reviewed 2026-05-25 12:11 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords machine translationWMT 2019Transformerback translationdata augmentationFinnish-EnglishBLEUsynthetic data

0 comments

The pith

A Transformer system with added back-translation, ensembles, and two new data methods reaches 33.0 BLEU and wins the WMT 2019 Finnish-to-English task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the University of Sydney submission to the WMT 2019 news translation shared task in the Finnish-to-English direction. The system begins with self-attentional Transformer networks and layers on established methods including BPE, back translation, data selection, augmentation, model ensembles, reranking, system combination, and post-processing. It adds two new components: a Cycle Translation augmentation technique and a Big/Small parallel construction strategy for exploiting synthetic data more fully. Experiments show that these additions produce steady BLEU gains, with the final system scoring 33.0 and beating the baseline ensemble by 5.3 points. A reader would care because the result demonstrates concrete ways to improve translation quality on a standard benchmark without relying solely on larger data volumes.

Core claim

The authors establish that their Transformer-based pipeline, after integrating recent effective strategies and introducing Cycle Translation together with Big/Small parallel construction, delivers a BLEU score of 33.0 on the WMT 2019 test set. This score is the highest among all participants and exceeds the baseline Transformer ensemble trained on the original parallel corpus by approximately 5.3 BLEU points, thereby reaching state-of-the-art performance for the Finnish-to-English news translation task.

What carries the argument

Cycle Translation augmentation method and Big/Small parallel construction strategy, which together enable fuller exploitation of synthetic corpora on top of standard Transformer components.

If this is right

Adding the listed techniques produces continuous improvements in BLEU scores.
The complete system reaches a test BLEU of 33.0.
This result exceeds the baseline ensemble by 5.3 BLEU points.
The approach secures the highest score among all WMT 2019 participants in the Finnish-to-English direction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Cycle Translation may supply a general pattern for generating higher-quality synthetic pairs that could be tested on other language directions with limited parallel data.
The Big/Small mixture rule offers a tunable way to balance quality and quantity that might interact with different model sizes or training regimes.
If the gains hold under controlled data conditions, the methods could lower the data volume needed to reach a given translation quality level.

Load-bearing premise

The reported BLEU gains are produced by the listed techniques rather than by undisclosed differences in total training data volume, hyperparameter search, or test-set-specific tuning.

What would settle it

A re-run of the baseline Transformer ensemble on exactly the same data volume and compute budget, without Cycle Translation or Big/Small construction, that still reaches or exceeds 33.0 BLEU.

read the original abstract

This paper describes the University of Sydney's submission of the WMT 2019 shared news translation task. We participated in the Finnish$\rightarrow$English direction and got the best BLEU(33.0) score among all the participants. Our system is based on the self-attentional Transformer networks, into which we integrated the most recent effective strategies from academic research (e.g., BPE, back translation, multi-features data selection, data augmentation, greedy model ensemble, reranking, ConMBR system combination, and post-processing). Furthermore, we propose a novel augmentation method $Cycle Translation$ and a data mixture strategy $Big$/$Small$ parallel construction to entirely exploit the synthetic corpus. Extensive experiments show that adding the above techniques can make continuous improvements of the BLEU scores, and the best result outperforms the baseline (Transformer ensemble model trained with the original parallel corpus) by approximately 5.3 BLEU score, achieving the state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This WMT19 systems paper wins Finnish-English with a 5.3 BLEU gain using two new data tricks, but the gains are not isolated from extra training data volume.

read the letter

This paper reports the University of Sydney entry that took first place on the WMT19 Finnish-to-English news task at 33.0 BLEU. The headline number is a 5.3 point lift over their own Transformer ensemble baseline trained only on the original parallel data. They reach that by stacking standard pieces (back-translation, BPE, data selection, ensembles, reranking) plus two methods they present as new: Cycle Translation for augmentation and Big/Small parallel construction for mixing corpora. Those two look like the actual additions beyond routine practice, and the paper claims they produce continuous gains when added one by one. For anyone building production systems on similar language pairs, the recipe is worth seeing in full because it worked on the fixed test set. The soft spot is exactly the one the stress-test flags. The baseline uses only the original corpus while the submitted system adds large amounts of synthetic data along with the new techniques. No ablation holds total example count fixed, and there is no component-wise table that controls for search effort or test-set tuning. The abstract says extensive experiments show the improvements, but without those controls the attribution of the full 5.3 points to the novel pieces stays unverified. This is a standard shared-task systems paper. Readers who need practical MT augmentation ideas or who follow WMT results will get value from the details. It is coherent on its own terms and reports a competitive, reproducible benchmark result, so it deserves a serious referee even if the final version needs clearer ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the University of Sydney's submission to the WMT 2019 news translation shared task (Finnish→English direction). The system is built on Transformer networks and incorporates standard techniques (BPE, back-translation, multi-feature data selection, data augmentation, greedy ensembling, reranking, ConMBR combination, post-processing) plus two novel components (Cycle Translation augmentation and Big/Small parallel data construction). The authors report that their final system achieved the highest BLEU score (33.0) among all participants and outperformed their baseline Transformer ensemble (trained only on the original parallel corpus) by approximately 5.3 BLEU points.

Significance. The work documents a winning entry on the fixed WMT19 test set using standard automatic metrics, which supplies a reproducible reference point for the community. The two proposed methods (Cycle Translation and Big/Small construction) are potentially useful if their incremental value can be isolated. No machine-checked proofs or parameter-free derivations are present, but the evaluation protocol itself is a strength.

major comments (2)

[Abstract] Abstract: the headline attribution of the 5.3 BLEU gain (and the 33.0 winning score) to the listed techniques, including the two novel ones, is not supported by any ablation table or controlled experiment that holds total training-example count fixed while varying only Cycle Translation or Big/Small construction.
[Abstract] The baseline is defined as a Transformer ensemble on the original parallel corpus only, while the submitted system adds back-translated synthetic data plus the new augmentation strategies; without a component-wise delta table that controls for data volume, hyperparameter-search effort, and test-set tuning, the contribution of the novel methods remains unverified.

minor comments (1)

The abstract refers to 'extensive experiments' showing 'continuous improvements' but supplies no table numbers, incremental BLEU scores, or error bars; adding a results table with per-component deltas would strengthen the presentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our WMT 2019 system description. We address the two major comments below and will revise the abstract accordingly to avoid over-attributing isolated contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the headline attribution of the 5.3 BLEU gain (and the 33.0 winning score) to the listed techniques, including the two novel ones, is not supported by any ablation table or controlled experiment that holds total training-example count fixed while varying only Cycle Translation or Big/Small construction.

Authors: We acknowledge that the abstract attributes the overall 5.3 BLEU improvement to the full set of techniques. The manuscript reports sequential additions that produce progressive BLEU gains, but does not contain ablations that hold total training-example count fixed while varying only Cycle Translation or Big/Small construction. We will revise the abstract to state that the final system incorporating all listed techniques achieves the reported score, without claiming that the novel components have been isolated under controlled data-volume conditions. revision: yes
Referee: [Abstract] The baseline is defined as a Transformer ensemble on the original parallel corpus only, while the submitted system adds back-translated synthetic data plus the new augmentation strategies; without a component-wise delta table that controls for data volume, hyperparameter-search effort, and test-set tuning, the contribution of the novel methods remains unverified.

Authors: The referee is correct that the baseline uses only the original parallel data while the submitted system includes back-translated data and the proposed augmentation strategies. The paper demonstrates cumulative improvements from the complete pipeline but does not provide component-wise tables that control for data volume, hyperparameter-search effort, or test-set tuning. We will revise the abstract to describe the baseline and final system more precisely and to refrain from implying that the novel methods' individual contributions have been verified under such controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical reporting on fixed external benchmark

full rationale

The paper reports measured BLEU scores on the fixed WMT19 test set using standard automatic metrics. The system description integrates known techniques plus two proposed methods, with performance gains stated relative to a baseline Transformer ensemble on the original parallel corpus. No derivation, equation, or prediction is presented that reduces by construction to fitted parameters or self-citations internal to the paper; the results are externally falsifiable on the shared-task test data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical performance of a Transformer-based system on the WMT19 Finnish-English test set; it draws on standard architecture assumptions and prior data-augmentation techniques without introducing new free parameters, axioms, or invented entities beyond the two named strategies.

pith-pipeline@v0.9.0 · 5691 in / 1203 out tokens · 32558 ms · 2026-05-25T12:11:12.053794+00:00 · methodology

The University of Sydney's Machine Translation System for WMT19

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)