Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Feihu Zhou; Guangxiao Hu; Haidong Rong; Liqiang Xie; Liwei Yu; Shaohuai Shi; Shutao Song; Tiegang Chen; Wei He; Xianyan Jia

arxiv: 1807.11205 · v1 · pith:LBT725URnew · submitted 2018-07-30 · 💻 cs.LG · cs.DC· stat.ML

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Xianyan Jia , Shutao Song , Wei He , Yangzihao Wang , Haidong Rong , Feihu Zhou , Liqiang Xie , Zhenyu Guo

show 6 more authors

Yuanzhou Yang Liwei Yu Tiegang Chen Guangxiao Hu Shaohuai Shi Xiaowen Chu

This is my paper

classification 💻 cs.LG cs.DCstat.ML

keywords trainingsystemaccuracyminutesachievedeepgpushighly

0 comments

read the original abstract

Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing the communication-to-computation ratio, it may hurt the generalization ability of the models. To this end, we build a highly scalable deep learning training system for dense GPU clusters with three main contributions: (1) We propose a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy. (2) We propose an optimization approach for extremely large mini-batch size (up to 64k) that can train CNN models on the ImageNet dataset without losing accuracy. (3) We propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a cluster with 1024 Tesla P40 GPUs. On training ResNet-50 with 90 epochs, the state-of-the-art GPU-based system with 1024 Tesla P100 GPUs spent 15 minutes and achieved 74.9\% top-1 test accuracy, and another KNL-based system with 2048 Intel KNLs spent 20 minutes and achieved 75.4\% accuracy. Our training system can achieve 75.8\% top-1 test accuracy in only 6.6 minutes using 2048 Tesla P40 GPUs. When training AlexNet with 95 epochs, our system can achieve 58.7\% top-1 test accuracy within 4 minutes, which also outperforms all other existing systems.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
cs.LG 2019-04 conditional novelty 6.0

LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.