Learning Deep Transformer Models for Machine Translation

Learning deep transformer models for machine translation , author= · 2019 · cs.CL · arXiv 1906.01787

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto standard for the development of the Transformer system, and the other uses deeper language representation but faces the difficulty arising from learning deep networks. Here, we continue the line of research on the latter. We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. On WMT'16 English- German, NIST OpenMT'12 Chinese-English and larger WMT'18 Chinese-English tasks, our deep system (30/25-layer encoder) outperforms the shallow Transformer-Big/Base baseline (6-layer encoder) by 0.4-2.4 BLEU points. As another bonus, the deep model is 1.6X smaller in size and 3X faster in training than Transformer-Big.

representative citing papers

climt-paraformer: Stable Emulation of Convective Parameterization using a Temporal Memory-aware Transformer

physics.ao-ph · 2026-04-22 · unverdicted · novelty 5.0

A temporal memory-aware Transformer emulator for the Emanuel convective parameterization shows lower offline errors and 10-year stability in single-column model tests compared to memory-less MLP and LSTM baselines.

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

cs.CL · 2026-05-19 · unverdicted · novelty 4.0

m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.

citing papers explorer

Showing 2 of 2 citing papers.

climt-paraformer: Stable Emulation of Convective Parameterization using a Temporal Memory-aware Transformer physics.ao-ph · 2026-04-22 · unverdicted · none · ref 10
A temporal memory-aware Transformer emulator for the Emanuel convective parameterization shows lower offline errors and 10-year stability in single-column model tests compared to memory-less MLP and LSTM baselines.
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder cs.CL · 2026-05-19 · unverdicted · none · ref 41 · internal anchor
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.

Learning Deep Transformer Models for Machine Translation

fields

years

verdicts

representative citing papers

citing papers explorer