Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Ciprian Chelba; Macduff Hughes; Taro Watanabe; Tetsuji Nakagawa; Wei Wang

arxiv: 1809.00068 · v1 · pith:CBMWOMIZnew · submitted 2018-08-31 · 💻 cs.CL · cs.LG· stat.ML

Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Wei Wang , Taro Watanabe , Macduff Hughes , Tetsuji Nakagawa , Ciprian Chelba This is my paper

classification 💻 cs.CL cs.LGstat.ML

keywords datadenoisingtrainingdomainapproachmachinemeasuringneural

0 comments

read the original abstract

Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
cs.CL 2019-07 unverdicted novelty 5.0

A single multilingual NMT model for 103 languages trained on 25B examples demonstrates transfer learning benefits for low-resource languages.