Scaling Neural Machine Translation , Year =

Ott, M · 2018 · cs.CL · arXiv 1806.00187

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT'14 English-German translation, we match the accuracy of Vaswani et al. (2017) in under 5 hours when training on 8 GPUs and we obtain a new state of the art of 29.3 BLEU after training for 85 minutes on 128 GPUs. We further improve these results to 29.8 BLEU by training on the much larger Paracrawl dataset. On the WMT'14 English-French task, we obtain a state-of-the-art BLEU of 43.2 in 8.5 hours on 128 GPUs.

representative citing papers

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

cs.LG · 2022-08-15 · conditional · novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks

cs.CL · 2019-07-25 · unverdicted · novelty 6.0

DropAttention regularizes attention weights in fully-connected self-attention networks to reduce overfitting and improve performance.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

eess.AS · 2019-07-10 · unverdicted · novelty 5.0

ADPSGD and Hierarchical-ADPSGD support 3x larger batches than SSGD for ASR, training SWB-2000 to 7.6% WER on SWB and 13.2% on CH in 5.2 hours on 64 V100 GPUs.

Enhancing Scientific Discourse: Machine Translation for the Scientific Domain

cs.CL · 2026-05-20 · conditional · novelty 4.0

Development of domain-specific scientific corpora for English-Spanish, English-French, and English-Portuguese and their application to fine-tuning NMT models.

citing papers explorer

Showing 7 of 7 citing papers.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 152
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 124 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks cs.CL · 2019-07-25 · unverdicted · none · ref 10 · internal anchor
DropAttention regularizes attention weights in fully-connected self-attention networks to reduce overfitting and improve performance.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 243
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 166
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition eess.AS · 2019-07-10 · unverdicted · none · ref 27 · internal anchor
ADPSGD and Hierarchical-ADPSGD support 3x larger batches than SSGD for ASR, training SWB-2000 to 7.6% WER on SWB and 13.2% on CH in 5.2 hours on 64 V100 GPUs.
Enhancing Scientific Discourse: Machine Translation for the Scientific Domain cs.CL · 2026-05-20 · conditional · none · ref 16 · internal anchor
Development of domain-specific scientific corpora for English-Spanish, English-French, and English-Portuguese and their application to fine-tuning NMT models.

Scaling Neural Machine Translation , Year =

fields

years

verdicts

representative citing papers

citing papers explorer