pith. sign in

arxiv: 1809.07599 · v2 · pith:LUPMCTUSnew · submitted 2018-09-20 · 💻 cs.LG · cs.DC· cs.DS· stat.ML

Sparsified SGD with Memory

classification 💻 cs.LG cs.DCcs.DSstat.ML
keywords algorithmscommunicationdistributedgradientinstancememoryratesame
0
0 comments X p. Extension
pith:LUPMCTUS Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{LUPMCTUS}

Prints a linked pith:LUPMCTUS badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

    cs.LG 2025-09 unverdicted novelty 7.0

    DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x thr...