Training Deeper Neural Machine Translation Models with Transparent Attention

· 2018 · cs.CL · arXiv 1808.07561

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

While current state-of-the-art NMT models, such as RNN seq2seq and Transformers, possess a large number of parameters, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we attempt to train significantly (2-3x) deeper Transformer and Bi-RNN encoders for machine translation. We propose a simple modification to the attention mechanism that eases the optimization of deeper models, and results in consistent gains of 0.7-1.1 BLEU on the benchmark WMT'14 English-German and WMT'15 Czech-English tasks for both architectures.

representative citing papers

Deep Modular Co-Attention Networks for Visual Question Answering

cs.CV · 2019-06-25 · conditional · novelty 7.0

MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts

cs.CL · 2019-06-28 · conditional · novelty 6.0

Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

cs.CL · 2026-06-01

citing papers explorer

Showing 3 of 3 citing papers.

Deep Modular Co-Attention Networks for Visual Question Answering cs.CV · 2019-06-25 · conditional · none · ref 5 · internal anchor
MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts cs.CL · 2019-06-28 · conditional · none · ref 1 · internal anchor
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
Do Value Vectors in Deep Layers Need Context from the Residual Stream? cs.CL · 2026-06-01 · unreviewed · ref 62 · internal anchor

Training Deeper Neural Machine Translation Models with Transparent Attention

fields

years

verdicts

representative citing papers

citing papers explorer