Findings of the First Shared Task on Machine Translation Robustness

Antonios Anastasopoulos; Graham Neubig; Hassan Sajjad; Juan Pino; Nadir Durrani; Orhan Firat; Paul Michel; Philipp Koehn; Xian Li; Yonatan Belinkov

arxiv: 1906.11943 · v2 · pith:3EBEENMXnew · submitted 2019-06-27 · 💻 cs.CL

Findings of the First Shared Task on Machine Translation Robustness

Xian Li , Paul Michel , Antonios Anastasopoulos , Yonatan Belinkov , Nadir Durrani , Orhan Firat , Philipp Koehn , Graham Neubig

show 2 more authors

Juan Pino Hassan Sajjad

This is my paper

Pith reviewed 2026-05-25 14:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine translationrobustnessshared tasknoisy inputdomain mismatchBLEU evaluationEnglish-FrenchEnglish-Japanese

0 comments

The pith

The first shared task on machine translation robustness finds all submitted systems improve substantially over baselines on real-world noisy data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents results from a shared task designed to test and improve machine translation models on noisy inputs and domain shifts. Teams submitted 23 systems for English to French and English to Japanese, evaluated on a blind test set of noisy Reddit comments paired with professional translations. All systems showed large gains, up to 22.33 BLEU points, and automatic BLEU scores correlated strongly with human judgments. The task also included qualitative analysis to understand how systems handle colloquial language and other challenges.

Core claim

All 23 submitted systems achieved large improvements over baselines on the blind test set, with the best system gaining +22.33 BLEU. Human and automatic evaluations correlated highly, with Pearson's r of 0.94 and 0.95 respectively. Qualitative analysis using compare-mt highlighted differences in how systems manage noisy input and domain mismatch.

What carries the argument

The blind test set of noisy Reddit comments and professionally sourced translations, used to evaluate robustness to noisy input and domain mismatch for English-French and English-Japanese pairs.

If this is right

All submitted systems outperformed baselines, demonstrating that robustness can be improved through various approaches.
High correlation between human judgment and BLEU suggests automatic metrics remain reliable for this task.
Qualitative differences in handling colloquial expressions explain cases where human and automatic scores disagree.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on this testbed suggests MT models can be made more reliable for user-generated content without major architectural changes.
Future tasks might expand to more language pairs or different noise types to further test generalization.
The shared task format encourages diverse solutions and provides a standardized benchmark for robustness research.

Load-bearing premise

The blind test set consisting of noisy comments on Reddit and professionally sourced translations accurately represents the challenges facing MT models deployed in the real world.

What would settle it

If future systems that excel on this test set perform poorly on other real-world noisy sources like social media or speech transcripts, the claim of improved robustness would be challenged.

Figures

Figures reproduced from arXiv: 1906.11943 by Antonios Anastasopoulos, Graham Neubig, Hassan Sajjad, Juan Pino, Nadir Durrani, Orhan Firat, Paul Michel, Philipp Koehn, Xian Li, Yonatan Belinkov.

**Figure 3.** Figure 3: Word F-measure by casing of the words in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

We share the findings of the first shared task on improving robustness of Machine Translation (MT). The task provides a testbed representing challenges facing MT models deployed in the real world, and facilitates new approaches to improve models; robustness to noisy input and domain mismatch. We focus on two language pairs (English-French and English-Japanese), and the submitted systems are evaluated on a blind test set consisting of noisy comments on Reddit and professionally sourced translations. As a new task, we received 23 submissions by 11 participating teams from universities, companies, national labs, etc. All submitted systems achieved large improvements over baselines, with the best improvement having +22.33 BLEU. We evaluated submissions by both human judgment and automatic evaluation (BLEU), which shows high correlations (Pearson's r = 0.94 and 0.95). Furthermore, we conducted a qualitative analysis of the submitted systems using compare-mt, which revealed their salient differences in handling challenges in this task. Such analysis provides additional insights when there is occasional disagreement between human judgment and BLEU, e.g. systems better at producing colloquial expressions received higher score from human judgment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the findings report from the first shared task on MT robustness, with a new noisy Reddit test set for En-Fr and En-Ja showing up to +22 BLEU gains and strong human-BLEU correlation.

read the letter

The core thing to know is that this paper organizes and reports the first shared task on machine translation robustness. It supplies a blind test set of noisy Reddit comments with professional translations for English-French and English-Japanese. Twenty-three submissions from eleven teams all improved over baselines, with the largest gain at +22.33 BLEU, and human judgments correlated with BLEU at Pearson's r of 0.94-0.95. A compare-mt analysis also flags differences in how systems handle colloquial expressions when the metrics disagree slightly.

Referee Report

0 major / 2 minor

Summary. The paper reports the findings of the first shared task on Machine Translation robustness for English-French and English-Japanese. It describes 23 submissions from 11 teams evaluated on a blind test set of noisy Reddit comments and professional translations. All systems showed improvements over baselines, with the largest being +22.33 BLEU. Human and automatic (BLEU) evaluations correlate highly (Pearson's r = 0.94 and 0.95), and a qualitative analysis using compare-mt is provided to explain differences in system performance.

Significance. This shared task findings paper documents community progress on MT robustness to noise and domain mismatch, providing concrete performance benchmarks and correlation data that can serve as a reference for future research. The qualitative analysis offers additional insights into system behaviors.

minor comments (2)

[Abstract] The specific correspondence between the two correlation values (0.94 and 0.95) and the language pairs or judgment types is not specified.
[Abstract] The construction of the baselines, data filtering procedures, and any statistical significance testing for the reported improvements are not detailed, which would aid in interpreting the magnitude of gains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our shared task findings paper and for recommending minor revision. No major comments were provided in the report, so we have no specific points requiring rebuttal or revision at this stage. We remain available to address any additional minor suggestions from the editor.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a shared-task findings report whose central claims consist of descriptive statements about observed outcomes: 23 submissions from 11 teams, BLEU gains up to +22.33 over baselines on the supplied test sets, and Pearson correlations of 0.94/0.95 between human and automatic judgments. These statements are direct reports of competition results on the defined data; they contain no derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations. The test-set representativeness assumption is motivational framing only and is not required for the factual reporting of the measured numbers. The derivation chain is therefore empty and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This paper is an empirical report on a shared task competition and introduces no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5761 in / 1077 out tokens · 25130 ms · 2026-05-25T14:33:47.077551+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 8 internal anchors

[1]

Nguyen, and David Chiang

Antonios Anastasopoulos, Alison Lui, Toan Q. Nguyen, and David Chiang. 2019. Neural machine translation of text from non-native speakers. In Proc. NAACL HLT

work page 2019
[2]

Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations (ICLR)

work page 2018
[3]

Yonatan Belinkov and James Glass. 2019. https://doi.org/10.1162/tacl\_a\_00254 Analysis methods in neural language processing: A survey . Transactions of the Association for Computational Linguistics (TACL), 7:49--72

work page internal anchor Pith review doi:10.1162/tacl 2019
[4]

Alexandre B \'e rard, Ioan Calapodescu, and Claude Roux. 2019. Naver Labs Europe’s Systems for the WMT19 Machine Translation Robustness Task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019
[5]

Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. Wit ^3 : Web inventory of transcribed and translated talks. In Proceedings of the 16 ^ th Conference of the European Association for Machine Translation (EAMT) , pages 261--268

work page 2012
[6]

Minhao Cheng, Jinfeng Yi, Huan Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. 2018 a . Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. CoRR, abs/1803.01128

work page arXiv 2018
[7]

Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. Robust neural machine translation with doubly adversarial inputs. In ACL. Association for Computational Linguistics

work page 2019
[8]

Yong Cheng, Zhaopeng Tu, Fandong Meng, Junjie Zhai, and Yang Liu. 2018 b . http://arxiv.org/abs/1805.06130 Towards robust neural machine translation . CoRR, abs/1805.06130

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Raj Dabre and Eiichiro Sumita. 2019. NICT’s Supervised MT Systems for the Translation Robustness Task in WMT19 . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019
[10]

Nadir Durrani, Fahim Dalvi, Hassan Sajjad, Yonatan Belinkov, and Preslav Nakov. 2019. https://www.aclweb.org/anthology/N19-1154 One size does not fit all: Comparing NMT representations of different granularities . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technolog...

work page 2019
[11]

Javid Ebrahimi, Daniel Lowd, and Dejing Dou. 2018. On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics

work page 2018
[12]

Cristian Grozea. 2019. The submission of FOKUS to the WMT 19 robustness task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019
[13]

Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Georg Heigold, G \"u nter Neumann, and Josef van Genabith. 2017. How robust are character-based word embeddings in tagging and mt against wrod scramlbing or randdm nouse? arXiv preprint arXiv:1704.04441

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Jind r ich Helcl, Jind r ich Libovick \'y , and Martin Popel. 2019. CUNI System for the WMT19 Robustness Task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019
[16]

Vladimir Karpukhin, Omer Levy, Jacob Eisenstein, and Marjan Ghazvininejad. 2019. http://arxiv.org/abs/1902.01509 Training on synthetic noise improves robustness to natural noise in machine translation . CoRR, abs/1902.01509

work page internal anchor Pith review Pith/arXiv arXiv 2019
[17]

Huda Khayrallah and Philipp Koehn. 2018. https://www.aclweb.org/anthology/W18-2709 On the impact of various types of noise on neural machine translation . In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74--83, Melbourne, Australia. Association for Computational Linguistics

work page 2018
[18]

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. https://doi.org/10.18653/v1/P17-4012 Open NMT : Open-source toolkit for neural machine translation . In Proc. ACL

work page doi:10.18653/v1/p17-4012 2017
[19]

Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Taku Kudo and John Richardson. 2018. https://www.aclweb.org/anthology/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for ...

work page 2018
[21]

Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. 2018. Hallucinations in neural machine translation. In Interpretability and Robustness in Audio, Speech, and Language Workshop Conference on Neural Information Processing Systems

work page 2018
[22]

Paul Michel, Xian Li, Graham Neubig, and Juan Miguel Pino. 2019. On evaluation of adversarial perturbations for sequence-to-sequence models. In Proc. NAACL HLT

work page 2019
[23]

Paul Michel and Graham Neubig. 2018. MTNT : A testbed for M achine T ranslation of N oisy T ext. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

work page 2018
[24]

Soichiro Murakami, Makoto Morishita, Tsutomu Hirao, and Masaaki Nagata. 2019. NTT’s Machine Translation Systems for WMT19 Robustness Task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019
[25]

Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt

work page 2011
[26]

Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. 2019. https://www.aclweb.org/anthology/N19-4007 compare-mt: A tool for holistic comparison of language generation systems . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics (Demonstrations) , pages 35--41, M...

work page 2019
[27]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

work page doi:10.3115/1073083.1073135 2002
[28]

Matt Post. 2018. https://www.aclweb.org/anthology/W18-6319 A call for clarity in reporting BLEU scores . In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Belgium, Brussels. Association for Computational Linguistics

work page 2018
[29]

Matt Post and Kevin Duh. 2019. JHU 2019 Robustness Task System Description . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019
[30]

JESC: Japanese-English Subtitle Corpus

R. Pryzant , Y. Chung , D. Jurafsky , and D. Britz . http://arxiv.org/abs/1710.10639 Jesc: Japanese-english subtitle corpus . ArXiv e-prints

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1162 2016
[32]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199

work page internal anchor Pith review Pith/arXiv arXiv 2013
[33]

Vaibhav Vaibhav, Sumeet Singh, Craig Stewart, and Graham Neubig. 2019. https://www.aclweb.org/anthology/N19-1190 Improving robustness of machine translation with synthetic noise . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pape...

work page 2019
[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008

work page 2017
[35]

Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. https://openreview.net/forum?id=H1BLjgZCb Generating natural adversarial examples . In International Conference on Learning Representations

work page 2018
[36]

Renjie Zheng, Hairong Liu, Mingbo Ma, Baigong Zheng, and Liang Huang. 2019. Robust Machine Translation with Domain Sensitive Pseudo-Sources: Baidu-OSU WMT19 MT Robustness Shared Task System Report . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019
[37]

Shuyan Zhou, Xiangkai Zeng, Yingqi Zhou, Antonios Anastasopoulos, and Graham Neubig. 2019. Improving Robustness of Neural Machine Translation with Multi-task Learning . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019
[38]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[39]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Nguyen, and David Chiang

Antonios Anastasopoulos, Alison Lui, Toan Q. Nguyen, and David Chiang. 2019. Neural machine translation of text from non-native speakers. In Proc. NAACL HLT

work page 2019

[2] [2]

Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations (ICLR)

work page 2018

[3] [3]

Yonatan Belinkov and James Glass. 2019. https://doi.org/10.1162/tacl\_a\_00254 Analysis methods in neural language processing: A survey . Transactions of the Association for Computational Linguistics (TACL), 7:49--72

work page internal anchor Pith review doi:10.1162/tacl 2019

[4] [4]

Alexandre B \'e rard, Ioan Calapodescu, and Claude Roux. 2019. Naver Labs Europe’s Systems for the WMT19 Machine Translation Robustness Task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019

[5] [5]

Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. Wit ^3 : Web inventory of transcribed and translated talks. In Proceedings of the 16 ^ th Conference of the European Association for Machine Translation (EAMT) , pages 261--268

work page 2012

[6] [6]

Minhao Cheng, Jinfeng Yi, Huan Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. 2018 a . Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. CoRR, abs/1803.01128

work page arXiv 2018

[7] [7]

Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. Robust neural machine translation with doubly adversarial inputs. In ACL. Association for Computational Linguistics

work page 2019

[8] [8]

Yong Cheng, Zhaopeng Tu, Fandong Meng, Junjie Zhai, and Yang Liu. 2018 b . http://arxiv.org/abs/1805.06130 Towards robust neural machine translation . CoRR, abs/1805.06130

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Raj Dabre and Eiichiro Sumita. 2019. NICT’s Supervised MT Systems for the Translation Robustness Task in WMT19 . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019

[10] [10]

Nadir Durrani, Fahim Dalvi, Hassan Sajjad, Yonatan Belinkov, and Preslav Nakov. 2019. https://www.aclweb.org/anthology/N19-1154 One size does not fit all: Comparing NMT representations of different granularities . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technolog...

work page 2019

[11] [11]

Javid Ebrahimi, Daniel Lowd, and Dejing Dou. 2018. On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics

work page 2018

[12] [12]

Cristian Grozea. 2019. The submission of FOKUS to the WMT 19 robustness task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019

[13] [13]

Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Georg Heigold, G \"u nter Neumann, and Josef van Genabith. 2017. How robust are character-based word embeddings in tagging and mt against wrod scramlbing or randdm nouse? arXiv preprint arXiv:1704.04441

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Jind r ich Helcl, Jind r ich Libovick \'y , and Martin Popel. 2019. CUNI System for the WMT19 Robustness Task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019

[16] [16]

Vladimir Karpukhin, Omer Levy, Jacob Eisenstein, and Marjan Ghazvininejad. 2019. http://arxiv.org/abs/1902.01509 Training on synthetic noise improves robustness to natural noise in machine translation . CoRR, abs/1902.01509

work page internal anchor Pith review Pith/arXiv arXiv 2019

[17] [17]

Huda Khayrallah and Philipp Koehn. 2018. https://www.aclweb.org/anthology/W18-2709 On the impact of various types of noise on neural machine translation . In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74--83, Melbourne, Australia. Association for Computational Linguistics

work page 2018

[18] [18]

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. https://doi.org/10.18653/v1/P17-4012 Open NMT : Open-source toolkit for neural machine translation . In Proc. ACL

work page doi:10.18653/v1/p17-4012 2017

[19] [19]

Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Taku Kudo and John Richardson. 2018. https://www.aclweb.org/anthology/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for ...

work page 2018

[21] [21]

Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. 2018. Hallucinations in neural machine translation. In Interpretability and Robustness in Audio, Speech, and Language Workshop Conference on Neural Information Processing Systems

work page 2018

[22] [22]

Paul Michel, Xian Li, Graham Neubig, and Juan Miguel Pino. 2019. On evaluation of adversarial perturbations for sequence-to-sequence models. In Proc. NAACL HLT

work page 2019

[23] [23]

Paul Michel and Graham Neubig. 2018. MTNT : A testbed for M achine T ranslation of N oisy T ext. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

work page 2018

[24] [24]

Soichiro Murakami, Makoto Morishita, Tsutomu Hirao, and Masaaki Nagata. 2019. NTT’s Machine Translation Systems for WMT19 Robustness Task . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019

[25] [25]

Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt

work page 2011

[26] [26]

Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. 2019. https://www.aclweb.org/anthology/N19-4007 compare-mt: A tool for holistic comparison of language generation systems . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics (Demonstrations) , pages 35--41, M...

work page 2019

[27] [27]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

work page doi:10.3115/1073083.1073135 2002

[28] [28]

Matt Post. 2018. https://www.aclweb.org/anthology/W18-6319 A call for clarity in reporting BLEU scores . In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Belgium, Brussels. Association for Computational Linguistics

work page 2018

[29] [29]

Matt Post and Kevin Duh. 2019. JHU 2019 Robustness Task System Description . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019

[30] [30]

JESC: Japanese-English Subtitle Corpus

R. Pryzant , Y. Chung , D. Jurafsky , and D. Britz . http://arxiv.org/abs/1710.10639 Jesc: Japanese-english subtitle corpus . ArXiv e-prints

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1162 2016

[32] [32]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199

work page internal anchor Pith review Pith/arXiv arXiv 2013

[33] [33]

Vaibhav Vaibhav, Sumeet Singh, Craig Stewart, and Graham Neubig. 2019. https://www.aclweb.org/anthology/N19-1190 Improving robustness of machine translation with synthetic noise . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pape...

work page 2019

[34] [34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008

work page 2017

[35] [35]

Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. https://openreview.net/forum?id=H1BLjgZCb Generating natural adversarial examples . In International Conference on Learning Representations

work page 2018

[36] [36]

Renjie Zheng, Hairong Liu, Mingbo Ma, Baigong Zheng, and Liang Huang. 2019. Robust Machine Translation with Domain Sensitive Pseudo-Sources: Baidu-OSU WMT19 MT Robustness Shared Task System Report . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019

[37] [37]

Shuyan Zhou, Xiangkai Zeng, Yingqi Zhou, Antonios Anastasopoulos, and Graham Neubig. 2019. Improving Robustness of Neural Machine Translation with Multi-task Learning . In Proceedings of the 2019 Shared task on Machine Translation Robustness, Conference on Machine Translation (WMT)

work page 2019

[38] [38]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[39] [39]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page