Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

Ashwin Pananjady; Dimitris Papailiopoulos; Dong Yin; Kannan Ramchandran; Max Lam; Peter Bartlett

arxiv: 1706.05699 · v3 · pith:DY5RCOPBnew · submitted 2017-06-18 · 💻 cs.LG · cs.DC

Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

Dong Yin , Ashwin Pananjady , Max Lam , Dimitris Papailiopoulos , Kannan Ramchandran , Peter Bartlett This is my paper

classification 💻 cs.LG cs.DC

keywords gradientdiversitymini-batchdistributedperformancebatch-sizebeyondgeneralization

0 comments

read the original abstract

It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notion of gradient diversity that measures the dissimilarity between concurrent gradient updates, and show its key role in the performance of mini-batch SGD. We prove that on problems with high gradient diversity, mini-batch SGD is amenable to better speedups, while maintaining the generalization performance of serial (one sample) SGD. We further establish lower bounds on convergence where mini-batch SGD slows down beyond a particular batch-size, solely due to the lack of gradient diversity. We provide experimental evidence indicating the key role of gradient diversity in distributed learning, and discuss how heuristics like dropout, Langevin dynamics, and quantization can improve it.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.