pith. sign in

hub Canonical reference

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Canonical reference. 82% of citing Pith papers cite this work as background.

83 Pith papers citing it
Background 82% of classified citations
abstract

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

hub tools

citation-role summary

background 15 method 2

citation-polarity summary

claims ledger

  • abstract Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are address

co-cited works

clear filters

representative citing papers

Masked Autoencoders Are Scalable Vision Learners

cs.CV · 2021-11-11 · accept · novelty 8.0

Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.

Segment Anything

cs.CV · 2023-04-05 · unverdicted · novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

Scalable Diffusion Models with Transformers

cs.CV · 2022-12-19 · unverdicted · novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Switchable Normalization for Learning-to-Normalize Deep Representation

cs.CV · 2019-07-22 · unverdicted · novelty 7.0

Switchable Normalization learns per-layer weights to combine channel, layer, and minibatch normalizers, claiming robustness to batch size and better results than fixed normalizers on ImageNet, COCO, CityScapes, ADE20K, MegaFace, and Kinetics.

The Adversarial Robustness of Sampling

cs.DS · 2019-06-26 · unverdicted · novelty 7.0

To achieve robustness to adaptive adversaries, Bernoulli and reservoir sampling require sample size Ω(log |R| / ε²) instead of the static VC-dimension bound.

citing papers explorer

Showing 8 of 8 citing papers after filters.