Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Cho-Jui Hsieh; James Demmel; Jing Li; Jonathan Hseu; Kurt Keutzer; Sanjiv Kumar; Sashank Reddi; Srinadh Bhojanapalli; Xiaodan Song; Yang You

arxiv: 1904.00962 · v5 · pith:J655K2KRnew · submitted 2019-04-01 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Yang You , Jing Li , Sashank Reddi , Jonathan Hseu , Sanjiv Kumar , Srinadh Bhojanapalli , Xiaodan Song , James Demmel

show 2 more authors

Kurt Keutzer Cho-Jui Hsieh

This is my paper

Pith reviewed 2026-05-21 21:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML

keywords large batch optimizationLAMB optimizerBERT traininglayerwise adaptive learning ratesdeep neural networksstochastic gradient descentTPU acceleration

0 comments

The pith

LAMB optimizer trains BERT using batch sizes of 32868 without performance loss, cutting training from 3 days to 76 minutes on a TPU pod.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops LAMB as a layerwise adaptive optimizer for large-batch training of deep networks, extending ideas from LARS to handle attention models like BERT where prior methods fall short. It proves convergence to stationary points in nonconvex settings and shows that the method requires little hyperparameter tuning across tasks. Very large batches become feasible without degrading final model quality, which directly reduces the wall-clock time needed for full training runs on high-memory hardware. This matters for practitioners because it lowers the compute cost and iteration time for large language models. The core result is demonstrated on BERT pretraining with batch sizes scaled to the memory limit of a TPUv3 Pod.

Core claim

LAMB applies a principled layerwise adaptation rule to compute per-layer learning rates from the ratio of weight and gradient norms, enabling stable training at batch sizes up to 32868. The optimizer converges to a stationary point in general nonconvex settings, and empirical tests confirm that BERT training completes in 76 minutes on a TPUv3 Pod with no accuracy drop relative to smaller-batch baselines.

What carries the argument

LAMB optimizer, which scales each layer's update using the trust ratio between the norm of the parameters and the norm of the gradient.

If this is right

BERT pretraining and similar attention-based models can use batch sizes limited only by hardware memory without accuracy loss.
Training time on TPU pods drops from multiple days to under two hours for full BERT runs.
The same optimizer works for both ResNet-style and Transformer-style architectures with minimal retuning.
Convergence guarantees hold for nonconvex objectives typical in deep learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other large transformer models where memory limits currently force smaller batches.
Combining LAMB with mixed-precision training could yield further speedups on the same hardware.
The layerwise scaling idea might help stabilize training when batch sizes increase in other domains such as vision or reinforcement learning.

Load-bearing premise

The layerwise adaptation rule that works on ResNet transfers to attention models like BERT with only minor adjustments.

What would settle it

Training BERT with batch size 32868 and observing a clear drop in downstream task accuracy or failure to reach the same validation loss as the standard small-batch run would falsify the central claim.

read the original abstract

Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes (Table 1). The LAMB implementation is available at https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAMB lets you train BERT to target quality in 76 minutes on a TPU pod with a 32k batch, but the no-degradation claim rests on a single run without variance or sensitivity checks.

read the letter

LAMB gets BERT pretraining down to 76 minutes on a full TPUv3 pod using a batch size of 32868 while matching the small-batch baseline. That is the central practical result the paper wants you to take away. They also supply a convergence proof for the optimizer in nonconvex smooth settings, covering both LAMB and the earlier LARS method. The adaptation rule itself is a modest tweak to the LARS trust ratio that adds the weight norm in the denominator, which turns out to be enough to make the method work on BERT where plain LARS failed. The experiments cover BERT pretraining plus fine-tuning on GLUE and a ResNet-50 ImageNet run, all with very little hyperparameter search. The theory section is a clear step beyond the original LARS paper, and the fact that they release the implementation helps. The main limitation is that the headline BERT result is reported from one run after minimal tuning. There are no standard deviations, no repeated seeds, and no plots showing what happens if the layer-wise trust ratio or the learning-rate scaling is moved by 10-20 percent. In nonconvex deep learning those choices can matter, so the equivalence claim would be more convincing with additional trials or a sensitivity check. The transfer story from ResNet to attention models is presented as straightforward once the rule is adjusted, and the numbers support that in this case, but broader testing would strengthen it. This paper is for people who train large transformers or other models across many accelerators and want to push batch size up to the hardware limit. A reader working on scaling or on practical optimizer design will get direct value from the numbers and the proof. I would send it to peer review. The core contribution is clean, the benchmarks are the right ones, and the speed claim is worth a careful look even if the current evidence needs a bit more statistical grounding.

Referee Report

1 major / 2 minor

Summary. The paper introduces LAMB, a layerwise adaptive large-batch optimizer extending LARS with a principled adaptation rule. It supplies convergence guarantees to a stationary point for both LAMB and LARS under general nonconvex assumptions, and reports that LAMB trains BERT to the same final performance as the small-batch baseline using a batch size of 32868 on a TPUv3 Pod, reducing wall-clock time from 3 days to 76 minutes (Table 1) with very little hyperparameter tuning.

Significance. If the empirical equivalence holds under repeated trials, the work would meaningfully advance practical large-batch training for attention-based models, directly addressing the scaling limitations of LARS on transformers. The nonconvex convergence analysis supplies theoretical support that is often absent in optimizer papers and strengthens the case for layerwise trust-ratio methods.

major comments (1)

[Table 1 and surrounding experimental text] Table 1: the central claim that LAMB with batch size 32868 matches the small-batch baseline “without any degradation of performance” is supported by a single reported outcome. In nonconvex deep-learning landscapes, such equivalence can be fragile to initialization, schedule details, or small changes in the layer-wise trust ratio; the manuscript does not report multiple random seeds, standard deviations, or an ablation perturbing the trust-ratio or learning-rate scaling by 10–20 %.

minor comments (2)

[Abstract] The abstract states that convergence analysis is provided but does not indicate the precise assumptions (e.g., smoothness, bounded variance) or the nature of the rate; a one-sentence summary of the main theoretical result would improve accessibility.
[Implementation section] The GitHub link to the LAMB implementation is a positive reproducibility feature; confirming that the released code reproduces the exact Table 1 numbers would further strengthen the submission.

Circularity Check

0 steps flagged

No significant circularity; LAMB derivation is self-contained from layerwise adaptation principle and convergence analysis

full rationale

The paper introduces LAMB by extending a layerwise adaptive rate strategy from LARS, supplies a general nonconvex convergence proof for both, and reports empirical results on BERT and ResNet-50 with large batches. No step reduces a claimed prediction or first-principles result to a fitted parameter or self-citation by construction; the optimizer definition and analysis stand independently of the final BERT timing numbers, which are presented as experimental outcomes rather than derived quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the definition of the LAMB update rule and standard nonconvex stochastic optimization assumptions rather than many fitted parameters or new entities.

axioms (1)

standard math Standard assumptions for convergence of stochastic gradient methods in nonconvex settings
Invoked for the convergence analysis of both LAMB and LARS.

pith-pipeline@v0.9.0 · 7488 in / 1098 out tokens · 97608 ms · 2026-05-21T21:31:28.378885+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
cs.LG 2026-05 unverdicted novelty 7.0

Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from ...
Learning PDEs for Portfolio Optimization with Quantum Physics-Informed Neural Networks
quant-ph 2026-04 unverdicted novelty 7.0

Quantum PINNs using tensor-rank polynomials solve the Merton portfolio optimization PDE more accurately and with far fewer parameters than classical neural networks.
Training Deep Learning Models with Norm-Constrained LMOs
cs.LG 2025-02 unverdicted novelty 7.0

Scion is a new stochastic LMO-based optimizer family that unifies existing methods, supports unconstrained problems, and delivers hyperparameter transferability plus speedups on nanoGPT training.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
cs.LG 2019-10 accept novelty 7.0

ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
cs.LG 2026-05 conditional novelty 6.0

Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.
TextTeacher: What Can Language Teach About Images?
cs.CV 2026-05 unverdicted novelty 6.0

TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
cs.LG 2026-05 unverdicted novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Foundation Models for Discovery and Exploration in Chemical Space
physics.chem-ph 2025-10 unverdicted novelty 6.0

MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization
cs.LG 2025-10 unverdicted novelty 6.0

Presents a model-based proximal framework for adaptive momentum in first-order optimizers by using a two-plane approximation of the objective to dynamically set the memory coefficient online.
PLD: A Choice-Theoretic List-Wise Knowledge Distillation
cs.LG 2025-06 unverdicted novelty 6.0

PLD recasts knowledge distillation as a weighted list-wise ranking loss under the Plackett-Luce model that optimizes a teacher-optimal class ranking and subsumes weighted cross-entropy.
Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling
stat.ML 2026-05 unverdicted novelty 5.0

Adaptive Batch Scaling dynamically increases batch size in on-policy RL as policy volatility drops, measured by a new Behavioral Divergence metric, and shows larger networks plus larger batches outperform on ALE with PQN.
On the Stability of Growth in Structural Plasticity
cs.LG 2026-05 unverdicted novelty 5.0

Newborn units in growing neural networks are forward-active but backward-starved, receiving weaker gradients than existing units and creating integration challenges that make growth less reliable than pruning in compl...
AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
cs.LG 2026-05 unverdicted novelty 5.0

AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
On the Convergence Analysis of Muon
stat.ML 2025-05 unverdicted novelty 5.0

Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
cs.CL 2019-07 accept novelty 5.0

With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training
cs.LG 2026-02 unverdicted novelty 4.0

Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 23 Pith papers · 18 internal anchors

[1]

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

signSGD: Compressed Optimisation for Non-Convex Problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: compressed optimisation for non-convex problems. CoRR, abs/1802.04434,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train

Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. Scale out for large minibatch sgd: Residual network training on imagenet-1k with improved accuracy and reduced time to train.arXiv preprint arXiv:1711.04291,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Aditya Devarakonda, Maxim Naumov, and Michael Garland. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Firecaffe: near-linear acceleration of deep neural network training on compute clusters

9 Published as a conference paper at ICLR 2020 Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2592–2600,

work page 2020
[9]

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

One weird trick for parallelizing convolutional neural networks

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash

Hiroaki Mikami, Hisahiro Suganuma, Yoshiki Tanaka, Yuichi Kageyama, et al. Imagenet/resnet-50 training in 224 seconds. arXiv preprint arXiv:1811.05233,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Second-order optimization method for large mini-batch: Training resnet-50 on imagenet in 35 epochs. arXiv preprint arXiv:1811.12019,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Measuring the Effects of Data Parallelism on Neural Network Training

Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Don't Decay the Learning Rate, Increase the Batch Size

Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds

Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds. arXiv preprint arXiv:1903.12650,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[17]

Image Classification at Supercomputer Scale

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classiﬁcation at supercomputer scale. arXiv preprint arXiv:1811.06992,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Large-Batch Training for LSTM and Beyond

Yang You, Jonathan Hseu, Chris Ying, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large-batch training for lstm and beyond. arXiv preprint arXiv:1901.08256,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[20]

We analyze the convergence ofLARS for general minibatch size here

10 Published as a conference paper at ICLR 2020 APPENDIX A P ROOF OF THEOREM 2 Proof. We analyze the convergence ofLARS for general minibatch size here. Recall that the update of LARS is the following x(i) t+1 = x(i) t − ηtφ(∥x(i) t ∥) g(i) t ∥g(i) t ∥ , for all i∈ [h]. For simplicity of notation, we reason the Since the function f is L-smooth, we have th...

work page 2020
[21]

Then LARS convergence rate can be written in the following manner: (E[∥∇f(xa)∥)2≤ O ((f(x1)− f(x∗))L∞ T ψL ψ2g +∥σ∥2 T ψ2 σ ψ2g )

for comparing SIGN SGD with SGD, we deﬁne the following quantities: ( h∑ i=1 ∥∇if(xt)∥ )2 = ψ(∇f(xt))d∥∇f(xt)∥2 h ≥ ψgd∥∇f(xt)∥2 h ∥L∥2 1≤ ψLd2∥L∥2 ∞ h2 ∥σ∥2 1 = ψσd∥σ∥2 h . Then LARS convergence rate can be written in the following manner: (E[∥∇f(xa)∥)2≤ O ((f(x1)− f(x∗))L∞ T ψL ψ2g +∥σ∥2 T ψ2 σ ψ2g ) . If ψL≪ ψ2 g and ψσ≪ ψ2 g then LARS (i.e., gradient ...

work page 2020
[22]

We used the same settings for N-LAMB and NN-LAMB

Dozat (2016) suggested the best performance of Nadam was achieved by β1 = 0.975, β2 = 0.999, and ϵ = 1e-8. We used the same settings for N-LAMB and NN-LAMB. We scaled the batch size to 32K for ImageNet training with ResNet-50. Our experimental results show that N-LAMB and NN-LAMB can achieve a comparable accuracy compared to LAMB optimizer. Their performa...

work page 2016
[23]

According to our experimental results, adam-correction essentially has the same effect as learning rate warmup (see Figure 2)

It has an impact on the learning rate byηt := ηt∗ √ (1− βt 2)/(1− βt 1). According to our experimental results, adam-correction essentially has the same effect as learning rate warmup (see Figure 2). The warmup function often was implemented in the modern deep learning system. Thus, we can remove adam-correction from the LAMB optimizer. We did not observe...

work page 2020
[24]

LAMB optimizer is able to achieve 94.08% test accuracy in 24 epochs, which is better than other adaptive optimizers and momentum SGD

We use the implementation of TensorFlow on TPUs. LAMB optimizer is able to achieve 94.08% test accuracy in 24 epochs, which is better than other adaptive optimizers and momentum SGD. Even on the smaller tasks like MNIST training with LeNet, LAMB is able to achieve a better accuracy than existing solvers (Table 7). 5https://dawn.cs.stanford.edu/benchmark/C...

work page 2020
[25]

This ﬁgure shows that LAMB can make the training converge smoothly at the batch size of 64K. Figure 8 shows that we can achieve 76.8% scaling efﬁciency by scaling the batch size (49.1 times speedup by 64 times computational resources) and 101.8% scaling efﬁciency with mixed-batch (65.2 times speedup by 64 times computational resources) 17 Published as a c...

work page 2020
[26]

The target F1 score is 90.5

18 Published as a conference paper at ICLR 2020 Table 8: ADAMW stops scaling at the batch size of 16K. The target F1 score is 90.5. LAMB achieves a F1 score of 91.345. The table shows the tuning information of ADAMW. In this table, we report the best F1 score we observed from our experiments. Solver batch size warmup steps LR last step infomation F1 score...

work page 2020
[27]

19 Published as a conference paper at ICLR 2020 Figure 7: This ﬁgure shows the training loss curve of LAMB optimizer

Based on our comprehensive tuning results, we conclude the existing adaptive solvers do not perform well on ImageNet training or at least it is hard to tune them. 19 Published as a conference paper at ICLR 2020 Figure 7: This ﬁgure shows the training loss curve of LAMB optimizer. This ﬁgure shows that LAMB can make the training converge smoothly at the ex...

work page 2020

[1] [1]

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

signSGD: Compressed Optimisation for Non-Convex Problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: compressed optimisation for non-convex problems. CoRR, abs/1802.04434,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train

Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. Scale out for large minibatch sgd: Residual network training on imagenet-1k with improved accuracy and reduced time to train.arXiv preprint arXiv:1711.04291,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Aditya Devarakonda, Maxim Naumov, and Michael Garland. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Firecaffe: near-linear acceleration of deep neural network training on compute clusters

9 Published as a conference paper at ICLR 2020 Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2592–2600,

work page 2020

[9] [9]

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

One weird trick for parallelizing convolutional neural networks

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash

Hiroaki Mikami, Hisahiro Suganuma, Yoshiki Tanaka, Yuichi Kageyama, et al. Imagenet/resnet-50 training in 224 seconds. arXiv preprint arXiv:1811.05233,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Second-order optimization method for large mini-batch: Training resnet-50 on imagenet in 35 epochs. arXiv preprint arXiv:1811.12019,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Measuring the Effects of Data Parallelism on Neural Network Training

Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Don't Decay the Learning Rate, Increase the Batch Size

Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds

Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds. arXiv preprint arXiv:1903.12650,

work page internal anchor Pith review Pith/arXiv arXiv 1903

[17] [17]

Image Classification at Supercomputer Scale

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classiﬁcation at supercomputer scale. arXiv preprint arXiv:1811.06992,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Large-Batch Training for LSTM and Beyond

Yang You, Jonathan Hseu, Chris Ying, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large-batch training for lstm and beyond. arXiv preprint arXiv:1901.08256,

work page internal anchor Pith review Pith/arXiv arXiv 1901

[20] [20]

We analyze the convergence ofLARS for general minibatch size here

10 Published as a conference paper at ICLR 2020 APPENDIX A P ROOF OF THEOREM 2 Proof. We analyze the convergence ofLARS for general minibatch size here. Recall that the update of LARS is the following x(i) t+1 = x(i) t − ηtφ(∥x(i) t ∥) g(i) t ∥g(i) t ∥ , for all i∈ [h]. For simplicity of notation, we reason the Since the function f is L-smooth, we have th...

work page 2020

[21] [21]

Then LARS convergence rate can be written in the following manner: (E[∥∇f(xa)∥)2≤ O ((f(x1)− f(x∗))L∞ T ψL ψ2g +∥σ∥2 T ψ2 σ ψ2g )

for comparing SIGN SGD with SGD, we deﬁne the following quantities: ( h∑ i=1 ∥∇if(xt)∥ )2 = ψ(∇f(xt))d∥∇f(xt)∥2 h ≥ ψgd∥∇f(xt)∥2 h ∥L∥2 1≤ ψLd2∥L∥2 ∞ h2 ∥σ∥2 1 = ψσd∥σ∥2 h . Then LARS convergence rate can be written in the following manner: (E[∥∇f(xa)∥)2≤ O ((f(x1)− f(x∗))L∞ T ψL ψ2g +∥σ∥2 T ψ2 σ ψ2g ) . If ψL≪ ψ2 g and ψσ≪ ψ2 g then LARS (i.e., gradient ...

work page 2020

[22] [22]

We used the same settings for N-LAMB and NN-LAMB

Dozat (2016) suggested the best performance of Nadam was achieved by β1 = 0.975, β2 = 0.999, and ϵ = 1e-8. We used the same settings for N-LAMB and NN-LAMB. We scaled the batch size to 32K for ImageNet training with ResNet-50. Our experimental results show that N-LAMB and NN-LAMB can achieve a comparable accuracy compared to LAMB optimizer. Their performa...

work page 2016

[23] [23]

According to our experimental results, adam-correction essentially has the same effect as learning rate warmup (see Figure 2)

It has an impact on the learning rate byηt := ηt∗ √ (1− βt 2)/(1− βt 1). According to our experimental results, adam-correction essentially has the same effect as learning rate warmup (see Figure 2). The warmup function often was implemented in the modern deep learning system. Thus, we can remove adam-correction from the LAMB optimizer. We did not observe...

work page 2020

[24] [24]

LAMB optimizer is able to achieve 94.08% test accuracy in 24 epochs, which is better than other adaptive optimizers and momentum SGD

We use the implementation of TensorFlow on TPUs. LAMB optimizer is able to achieve 94.08% test accuracy in 24 epochs, which is better than other adaptive optimizers and momentum SGD. Even on the smaller tasks like MNIST training with LeNet, LAMB is able to achieve a better accuracy than existing solvers (Table 7). 5https://dawn.cs.stanford.edu/benchmark/C...

work page 2020

[25] [25]

This ﬁgure shows that LAMB can make the training converge smoothly at the batch size of 64K. Figure 8 shows that we can achieve 76.8% scaling efﬁciency by scaling the batch size (49.1 times speedup by 64 times computational resources) and 101.8% scaling efﬁciency with mixed-batch (65.2 times speedup by 64 times computational resources) 17 Published as a c...

work page 2020

[26] [26]

The target F1 score is 90.5

18 Published as a conference paper at ICLR 2020 Table 8: ADAMW stops scaling at the batch size of 16K. The target F1 score is 90.5. LAMB achieves a F1 score of 91.345. The table shows the tuning information of ADAMW. In this table, we report the best F1 score we observed from our experiments. Solver batch size warmup steps LR last step infomation F1 score...

work page 2020

[27] [27]

19 Published as a conference paper at ICLR 2020 Figure 7: This ﬁgure shows the training loss curve of LAMB optimizer

Based on our comprehensive tuning results, we conclude the existing adaptive solvers do not perform well on ImageNet training or at least it is hard to tune them. 19 Published as a conference paper at ICLR 2020 Figure 7: This ﬁgure shows the training loss curve of LAMB optimizer. This ﬁgure shows that LAMB can make the training converge smoothly at the ex...

work page 2020