The Effect of Network Width on the Performance of Large-batch Training

Dimitris Papailiopoulos; Hongyi Wang; Jinman Zhao; Lingjiao Chen; Paraschos Koutris

arxiv: 1806.03791 · v1 · pith:VZCQPYOSnew · submitted 2018-06-11 · 📊 stat.ML · cs.DC· cs.LG· math.OC· stat.CO

The Effect of Network Width on the Performance of Large-batch Training

Lingjiao Chen , Hongyi Wang , Jinman Zhao , Dimitris Papailiopoulos , Paraschos Koutris This is my paper

classification 📊 stat.ML cs.DCcs.LGmath.OCstat.CO

keywords trainingbatcheslarge-batchnetworksperformanceconvergencedeepergradient

0 comments

read the original abstract

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.