pith. sign in

arxiv: 1906.11943 · v2 · pith:3EBEENMXnew · submitted 2019-06-27 · 💻 cs.CL

Findings of the First Shared Task on Machine Translation Robustness

Pith reviewed 2026-05-25 14:33 UTC · model grok-4.3

classification 💻 cs.CL
keywords machine translationrobustnessshared tasknoisy inputdomain mismatchBLEU evaluationEnglish-FrenchEnglish-Japanese
0
0 comments X

The pith

The first shared task on machine translation robustness finds all submitted systems improve substantially over baselines on real-world noisy data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents results from a shared task designed to test and improve machine translation models on noisy inputs and domain shifts. Teams submitted 23 systems for English to French and English to Japanese, evaluated on a blind test set of noisy Reddit comments paired with professional translations. All systems showed large gains, up to 22.33 BLEU points, and automatic BLEU scores correlated strongly with human judgments. The task also included qualitative analysis to understand how systems handle colloquial language and other challenges.

Core claim

All 23 submitted systems achieved large improvements over baselines on the blind test set, with the best system gaining +22.33 BLEU. Human and automatic evaluations correlated highly, with Pearson's r of 0.94 and 0.95 respectively. Qualitative analysis using compare-mt highlighted differences in how systems manage noisy input and domain mismatch.

What carries the argument

The blind test set of noisy Reddit comments and professionally sourced translations, used to evaluate robustness to noisy input and domain mismatch for English-French and English-Japanese pairs.

If this is right

  • All submitted systems outperformed baselines, demonstrating that robustness can be improved through various approaches.
  • High correlation between human judgment and BLEU suggests automatic metrics remain reliable for this task.
  • Qualitative differences in handling colloquial expressions explain cases where human and automatic scores disagree.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on this testbed suggests MT models can be made more reliable for user-generated content without major architectural changes.
  • Future tasks might expand to more language pairs or different noise types to further test generalization.
  • The shared task format encourages diverse solutions and provides a standardized benchmark for robustness research.

Load-bearing premise

The blind test set consisting of noisy comments on Reddit and professionally sourced translations accurately represents the challenges facing MT models deployed in the real world.

What would settle it

If future systems that excel on this test set perform poorly on other real-world noisy sources like social media or speech transcripts, the claim of improved robustness would be challenged.

Figures

Figures reproduced from arXiv: 1906.11943 by Antonios Anastasopoulos, Graham Neubig, Hassan Sajjad, Juan Pino, Nadir Durrani, Orhan Firat, Paul Michel, Philipp Koehn, Xian Li, Yonatan Belinkov.

Figure 1
Figure 1. Figure 1: Annotation interface for human evaluations. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Word F-measure by casing of the words in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

We share the findings of the first shared task on improving robustness of Machine Translation (MT). The task provides a testbed representing challenges facing MT models deployed in the real world, and facilitates new approaches to improve models; robustness to noisy input and domain mismatch. We focus on two language pairs (English-French and English-Japanese), and the submitted systems are evaluated on a blind test set consisting of noisy comments on Reddit and professionally sourced translations. As a new task, we received 23 submissions by 11 participating teams from universities, companies, national labs, etc. All submitted systems achieved large improvements over baselines, with the best improvement having +22.33 BLEU. We evaluated submissions by both human judgment and automatic evaluation (BLEU), which shows high correlations (Pearson's r = 0.94 and 0.95). Furthermore, we conducted a qualitative analysis of the submitted systems using compare-mt, which revealed their salient differences in handling challenges in this task. Such analysis provides additional insights when there is occasional disagreement between human judgment and BLEU, e.g. systems better at producing colloquial expressions received higher score from human judgment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper reports the findings of the first shared task on Machine Translation robustness for English-French and English-Japanese. It describes 23 submissions from 11 teams evaluated on a blind test set of noisy Reddit comments and professional translations. All systems showed improvements over baselines, with the largest being +22.33 BLEU. Human and automatic (BLEU) evaluations correlate highly (Pearson's r = 0.94 and 0.95), and a qualitative analysis using compare-mt is provided to explain differences in system performance.

Significance. This shared task findings paper documents community progress on MT robustness to noise and domain mismatch, providing concrete performance benchmarks and correlation data that can serve as a reference for future research. The qualitative analysis offers additional insights into system behaviors.

minor comments (2)
  1. [Abstract] The specific correspondence between the two correlation values (0.94 and 0.95) and the language pairs or judgment types is not specified.
  2. [Abstract] The construction of the baselines, data filtering procedures, and any statistical significance testing for the reported improvements are not detailed, which would aid in interpreting the magnitude of gains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our shared task findings paper and for recommending minor revision. No major comments were provided in the report, so we have no specific points requiring rebuttal or revision at this stage. We remain available to address any additional minor suggestions from the editor.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a shared-task findings report whose central claims consist of descriptive statements about observed outcomes: 23 submissions from 11 teams, BLEU gains up to +22.33 over baselines on the supplied test sets, and Pearson correlations of 0.94/0.95 between human and automatic judgments. These statements are direct reports of competition results on the defined data; they contain no derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations. The test-set representativeness assumption is motivational framing only and is not required for the factual reporting of the measured numbers. The derivation chain is therefore empty and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This paper is an empirical report on a shared task competition and introduces no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5761 in / 1077 out tokens · 25130 ms · 2026-05-25T14:33:47.077551+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 8 internal anchors

  1. [1]

    Nguyen, and David Chiang

    Antonios Anastasopoulos, Alison Lui, Toan Q. Nguyen, and David Chiang. 2019. Neural machine translation of text from non-native speakers. In Proc. NAACL HLT

  2. [2]

    Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations (ICLR)

  3. [3]

    Yonatan Belinkov and James Glass. 2019. https://doi.org/10.1162/tacl\_a\_00254 Analysis methods in neural language processing: A survey . Transactions of the Association for Computational Linguistics (TACL), 7:49--72

  4. [4]

    Alexandre B \'e rard, Ioan Calapodescu, and Claude Roux. 2019. Naver Labs Europe’s Systems for the WMT19 Machine Translation Robustness Task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

  5. [5]

    Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. Wit ^3 : Web inventory of transcribed and translated talks. In Proceedings of the 16 ^ th Conference of the European Association for Machine Translation (EAMT) , pages 261--268

  6. [6]

    Minhao Cheng, Jinfeng Yi, Huan Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. 2018 a . Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. CoRR, abs/1803.01128

  7. [7]

    Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. Robust neural machine translation with doubly adversarial inputs. In ACL. Association for Computational Linguistics

  8. [8]

    Yong Cheng, Zhaopeng Tu, Fandong Meng, Junjie Zhai, and Yang Liu. 2018 b . http://arxiv.org/abs/1805.06130 Towards robust neural machine translation . CoRR, abs/1805.06130

  9. [9]

    Raj Dabre and Eiichiro Sumita. 2019. NICT’s Supervised MT Systems for the Translation Robustness Task in WMT19 . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

  10. [10]

    Nadir Durrani, Fahim Dalvi, Hassan Sajjad, Yonatan Belinkov, and Preslav Nakov. 2019. https://www.aclweb.org/anthology/N19-1154 One size does not fit all: Comparing NMT representations of different granularities . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technolog...

  11. [11]

    Javid Ebrahimi, Daniel Lowd, and Dejing Dou. 2018. On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics

  12. [12]

    Cristian Grozea. 2019. The submission of FOKUS to the WMT 19 robustness task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

  13. [13]

    Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567

  14. [14]

    Georg Heigold, G \"u nter Neumann, and Josef van Genabith. 2017. How robust are character-based word embeddings in tagging and mt against wrod scramlbing or randdm nouse? arXiv preprint arXiv:1704.04441

  15. [15]

    Jind r ich Helcl, Jind r ich Libovick \'y , and Martin Popel. 2019. CUNI System for the WMT19 Robustness Task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

  16. [16]

    Vladimir Karpukhin, Omer Levy, Jacob Eisenstein, and Marjan Ghazvininejad. 2019. http://arxiv.org/abs/1902.01509 Training on synthetic noise improves robustness to natural noise in machine translation . CoRR, abs/1902.01509

  17. [17]

    Huda Khayrallah and Philipp Koehn. 2018. https://www.aclweb.org/anthology/W18-2709 On the impact of various types of noise on neural machine translation . In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74--83, Melbourne, Australia. Association for Computational Linguistics

  18. [18]

    Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. https://doi.org/10.18653/v1/P17-4012 Open NMT : Open-source toolkit for neural machine translation . In Proc. ACL

  19. [19]

    Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872

  20. [20]

    Taku Kudo and John Richardson. 2018. https://www.aclweb.org/anthology/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for ...

  21. [21]

    Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. 2018. Hallucinations in neural machine translation. In Interpretability and Robustness in Audio, Speech, and Language Workshop Conference on Neural Information Processing Systems

  22. [22]

    Paul Michel, Xian Li, Graham Neubig, and Juan Miguel Pino. 2019. On evaluation of adversarial perturbations for sequence-to-sequence models. In Proc. NAACL HLT

  23. [23]

    Paul Michel and Graham Neubig. 2018. MTNT : A testbed for M achine T ranslation of N oisy T ext. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  24. [24]

    Soichiro Murakami, Makoto Morishita, Tsutomu Hirao, and Masaaki Nagata. 2019. NTT’s Machine Translation Systems for WMT19 Robustness Task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

  25. [25]

    Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt

  26. [26]

    Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. 2019. https://www.aclweb.org/anthology/N19-4007 compare-mt: A tool for holistic comparison of language generation systems . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics (Demonstrations) , pages 35--41, M...

  27. [27]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

  28. [28]

    Matt Post. 2018. https://www.aclweb.org/anthology/W18-6319 A call for clarity in reporting BLEU scores . In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Belgium, Brussels. Association for Computational Linguistics

  29. [29]

    Matt Post and Kevin Duh. 2019. JHU 2019 Robustness Task System Description . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

  30. [30]

    JESC: Japanese-English Subtitle Corpus

    R. Pryzant , Y. Chung , D. Jurafsky , and D. Britz . http://arxiv.org/abs/1710.10639 Jesc: Japanese-english subtitle corpus . ArXiv e-prints

  31. [31]

    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

  32. [32]

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199

  33. [33]

    Vaibhav Vaibhav, Sumeet Singh, Craig Stewart, and Graham Neubig. 2019. https://www.aclweb.org/anthology/N19-1190 Improving robustness of machine translation with synthetic noise . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pape...

  34. [34]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008

  35. [35]

    Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. https://openreview.net/forum?id=H1BLjgZCb Generating natural adversarial examples . In International Conference on Learning Representations

  36. [36]

    Renjie Zheng, Hairong Liu, Mingbo Ma, Baigong Zheng, and Liang Huang. 2019. Robust Machine Translation with Domain Sensitive Pseudo-Sources: Baidu-OSU WMT19 MT Robustness Shared Task System Report . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

  37. [37]

    Shuyan Zhou, Xiangkai Zeng, Yingqi Zhou, Antonios Anastasopoulos, and Graham Neubig. 2019. Improving Robustness of Neural Machine Translation with Multi-task Learning . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

  38. [38]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  39. [39]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...