The University of Sydney's Machine Translation System for WMT19
Pith reviewed 2026-05-25 12:11 UTC · model grok-4.3
The pith
A Transformer system with added back-translation, ensembles, and two new data methods reaches 33.0 BLEU and wins the WMT 2019 Finnish-to-English task.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that their Transformer-based pipeline, after integrating recent effective strategies and introducing Cycle Translation together with Big/Small parallel construction, delivers a BLEU score of 33.0 on the WMT 2019 test set. This score is the highest among all participants and exceeds the baseline Transformer ensemble trained on the original parallel corpus by approximately 5.3 BLEU points, thereby reaching state-of-the-art performance for the Finnish-to-English news translation task.
What carries the argument
Cycle Translation augmentation method and Big/Small parallel construction strategy, which together enable fuller exploitation of synthetic corpora on top of standard Transformer components.
If this is right
- Adding the listed techniques produces continuous improvements in BLEU scores.
- The complete system reaches a test BLEU of 33.0.
- This result exceeds the baseline ensemble by 5.3 BLEU points.
- The approach secures the highest score among all WMT 2019 participants in the Finnish-to-English direction.
Where Pith is reading between the lines
- Cycle Translation may supply a general pattern for generating higher-quality synthetic pairs that could be tested on other language directions with limited parallel data.
- The Big/Small mixture rule offers a tunable way to balance quality and quantity that might interact with different model sizes or training regimes.
- If the gains hold under controlled data conditions, the methods could lower the data volume needed to reach a given translation quality level.
Load-bearing premise
The reported BLEU gains are produced by the listed techniques rather than by undisclosed differences in total training data volume, hyperparameter search, or test-set-specific tuning.
What would settle it
A re-run of the baseline Transformer ensemble on exactly the same data volume and compute budget, without Cycle Translation or Big/Small construction, that still reaches or exceeds 33.0 BLEU.
read the original abstract
This paper describes the University of Sydney's submission of the WMT 2019 shared news translation task. We participated in the Finnish$\rightarrow$English direction and got the best BLEU(33.0) score among all the participants. Our system is based on the self-attentional Transformer networks, into which we integrated the most recent effective strategies from academic research (e.g., BPE, back translation, multi-features data selection, data augmentation, greedy model ensemble, reranking, ConMBR system combination, and post-processing). Furthermore, we propose a novel augmentation method $Cycle Translation$ and a data mixture strategy $Big$/$Small$ parallel construction to entirely exploit the synthetic corpus. Extensive experiments show that adding the above techniques can make continuous improvements of the BLEU scores, and the best result outperforms the baseline (Transformer ensemble model trained with the original parallel corpus) by approximately 5.3 BLEU score, achieving the state-of-the-art performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the University of Sydney's submission to the WMT 2019 news translation shared task (Finnish→English direction). The system is built on Transformer networks and incorporates standard techniques (BPE, back-translation, multi-feature data selection, data augmentation, greedy ensembling, reranking, ConMBR combination, post-processing) plus two novel components (Cycle Translation augmentation and Big/Small parallel data construction). The authors report that their final system achieved the highest BLEU score (33.0) among all participants and outperformed their baseline Transformer ensemble (trained only on the original parallel corpus) by approximately 5.3 BLEU points.
Significance. The work documents a winning entry on the fixed WMT19 test set using standard automatic metrics, which supplies a reproducible reference point for the community. The two proposed methods (Cycle Translation and Big/Small construction) are potentially useful if their incremental value can be isolated. No machine-checked proofs or parameter-free derivations are present, but the evaluation protocol itself is a strength.
major comments (2)
- [Abstract] Abstract: the headline attribution of the 5.3 BLEU gain (and the 33.0 winning score) to the listed techniques, including the two novel ones, is not supported by any ablation table or controlled experiment that holds total training-example count fixed while varying only Cycle Translation or Big/Small construction.
- [Abstract] The baseline is defined as a Transformer ensemble on the original parallel corpus only, while the submitted system adds back-translated synthetic data plus the new augmentation strategies; without a component-wise delta table that controls for data volume, hyperparameter-search effort, and test-set tuning, the contribution of the novel methods remains unverified.
minor comments (1)
- The abstract refers to 'extensive experiments' showing 'continuous improvements' but supplies no table numbers, incremental BLEU scores, or error bars; adding a results table with per-component deltas would strengthen the presentation.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our WMT 2019 system description. We address the two major comments below and will revise the abstract accordingly to avoid over-attributing isolated contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline attribution of the 5.3 BLEU gain (and the 33.0 winning score) to the listed techniques, including the two novel ones, is not supported by any ablation table or controlled experiment that holds total training-example count fixed while varying only Cycle Translation or Big/Small construction.
Authors: We acknowledge that the abstract attributes the overall 5.3 BLEU improvement to the full set of techniques. The manuscript reports sequential additions that produce progressive BLEU gains, but does not contain ablations that hold total training-example count fixed while varying only Cycle Translation or Big/Small construction. We will revise the abstract to state that the final system incorporating all listed techniques achieves the reported score, without claiming that the novel components have been isolated under controlled data-volume conditions. revision: yes
-
Referee: [Abstract] The baseline is defined as a Transformer ensemble on the original parallel corpus only, while the submitted system adds back-translated synthetic data plus the new augmentation strategies; without a component-wise delta table that controls for data volume, hyperparameter-search effort, and test-set tuning, the contribution of the novel methods remains unverified.
Authors: The referee is correct that the baseline uses only the original parallel data while the submitted system includes back-translated data and the proposed augmentation strategies. The paper demonstrates cumulative improvements from the complete pipeline but does not provide component-wise tables that control for data volume, hyperparameter-search effort, or test-set tuning. We will revise the abstract to describe the baseline and final system more precisely and to refrain from implying that the novel methods' individual contributions have been verified under such controls. revision: yes
Circularity Check
No circularity: empirical reporting on fixed external benchmark
full rationale
The paper reports measured BLEU scores on the fixed WMT19 test set using standard automatic metrics. The system description integrates known techniques plus two proposed methods, with performance gains stated relative to a baseline Transformer ensemble on the original parallel corpus. No derivation, equation, or prediction is presented that reduces by construction to fitted parameters or self-citations internal to the paper; the results are externally falsifiable on the shared-task test data.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.