Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , Year =

Takuya Akiba, Shuji Suzuki, Keisuke Fukuda · 2017 · cs.DC · arXiv 1711.04325

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

We demonstrate that training ResNet-50 on ImageNet for 90 epochs can be achieved in 15 minutes with 1024 Tesla P100 GPUs. This was made possible by using a large minibatch size of 32k. To maintain accuracy with this large minibatch size, we employed several techniques such as RMSprop warm-up, batch normalization without moving averages, and a slow-start learning rate schedule. This paper also describes the details of the hardware and software of the system used to achieve the above performance.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

cs.DC · 2026-04-02 · unverdicted · novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

cs.LG · 2019-04-01 · conditional · novelty 6.0

LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Fast Training of Sparse Graph Neural Networks on Dense Hardware

stat.ML · 2019-06-27 · unverdicted · novelty 5.0

Techniques enable training the sparse GNN from Allamanis et al. [2018] on dense TPU hardware in 13 minutes versus a full day originally.

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

cs.LG · 2019-06-26 · unverdicted · novelty 5.0

GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.

citing papers explorer

Showing 7 of 7 citing papers.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods cs.DC · 2026-04-02 · unverdicted · none · ref 9
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 125 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes cs.LG · 2019-04-01 · conditional · none · ref 1 · internal anchor
LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 244
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 167
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Fast Training of Sparse Graph Neural Networks on Dense Hardware stat.ML · 2019-06-27 · unverdicted · none · ref 1 · internal anchor
Techniques enable training the sparse GNN from Allamanis et al. [2018] on dense TPU hardware in 13 minutes versus a full day originally.
Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD cs.LG · 2019-06-26 · unverdicted · none · ref 2 · internal anchor
GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , Year =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer