Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks

Fahim Dalvi; Hassan Sajjad; James Glass; Llu\'is M\`arquez; Nadir Durrani; Yonatan Belinkov

arxiv: 1801.07772 · v1 · pith:HCT4HTFJnew · submitted 2018-01-23 · 💻 cs.CL

Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks

Yonatan Belinkov , Llu\'is M\`arquez , Hassan Sajjad , Nadir Durrani , Fahim Dalvi , James Glass This is my paper

classification 💻 cs.CL

keywords modelslayersqualitypart-of-speechrepresentationstaggingtaskstranslation

0 comments

read the original abstract

While neural machine translation (NMT) models provide improved translation quality in an elegant, end-to-end framework, it is less clear what they learn about language. Recent work has started evaluating the quality of vector representations learned by NMT models on morphological and syntactic tasks. In this paper, we investigate the representations learned at different layers of NMT encoders. We train NMT systems on parallel data and use the trained models to extract features for training a classifier on two tasks: part-of-speech and semantic tagging. We then measure the performance of the classifier as a proxy to the quality of the original NMT model for the given task. Our quantitative analysis yields interesting insights regarding representation learning in NMT models. For instance, we find that higher layers are better at learning semantics while lower layers tend to be better for part-of-speech tagging. We also observe little effect of the target language on source-side representations, especially with higher quality NMT models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts
cs.CL 2019-06 conditional novelty 6.0

Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.