The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation

arxiv: 1906.11751 · v1 · pith:U32ILPHTnew · submitted 2019-06-27 · 💻 cs.CL

The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation

Mai Oudah , Amjad Almahairi , Nizar Habash This is my paper

classification 💻 cs.CL

keywords neuralstatisticaldatatokenizationtranslationarabic-englishcomparemachine

0 comments p. Extension

pith:U32ILPHT Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{U32ILPHT}

Prints a linked pith:U32ILPHT badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Neural networks have become the state-of-the-art approach for machine translation (MT) in many languages. While linguistically-motivated tokenization techniques were shown to have significant effects on the performance of statistical MT, it remains unclear if those techniques are well suited for neural MT. In this paper, we systematically compare neural and statistical MT models for Arabic-English translation on data preprecossed by various prominent tokenization schemes. Furthermore, we consider a range of data and vocabulary sizes and compare their effect on both approaches. Our empirical results show that the best choice of tokenization scheme is largely based on the type of model and the size of data. We also show that we can gain significant improvements using a system selection that combines the output from neural and statistical MT.

This paper has not been read by Pith yet.

The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation

discussion (0)