Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash

Hiroaki Mikami; Hisahiro Suganuma; Pongsakorn U-Chupala; Yoshiki Tanaka; Yuichi Kageyama

arxiv: 1811.05233 · v2 · pith:6VARSBRXnew · submitted 2018-11-13 · 💻 cs.LG · cs.CV

Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash

Hiroaki Mikami , Hisahiro Suganuma , Pongsakorn U-Chupala , Yoshiki Tanaka , Yuichi Kageyama This is my paper

classification 💻 cs.LG cs.CV

keywords trainingaddressall-reduceclusterd-torusdistributedgradientimagenet

0 comments

read the original abstract

Scaling the distributed deep learning to a massive GPU cluster level is challenging due to the instability of the large mini-batch training and the overhead of the gradient synchronization. We address the instability of the large mini-batch training with batch-size control and label smoothing. We address the overhead of the gradient synchronization with 2D-Torus all-reduce. Specifically, 2D-Torus all-reduce arranges GPUs in a logical 2D grid and performs a series of collective operation in different orientations. These two techniques are implemented with Neural Network Libraries (NNL). We have successfully trained ImageNet/ResNet-50 in 122 seconds without significant accuracy loss on ABCI cluster.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
cs.LG 2019-04 conditional novelty 6.0

LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.