Local SGD Converges Fast and Communicates Little

Sebastian U. Stich

arxiv: 1805.09767 · v3 · pith:XTSUYJTUnew · submitted 2018-05-24 · 🧮 math.OC · cs.DC· cs.LG

Local SGD Converges Fast and Communicates Little

Sebastian U. Stich This is my paper

classification 🧮 math.OC cs.DCcs.LG

keywords localnumbermini-batchschemecommunicationlargeworkersconverges

0 comments

read the original abstract

Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis. We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations. Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation
stat.ML 2026-05 unverdicted novelty 7.0

Establishes non-asymptotic Gaussian approximation bounds for federated LSA with explicit communication-heterogeneity trade-offs and introduces an online multiplier bootstrap for last-iterate inference with validity gu...
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning
stat.ML 2026-05 unverdicted novelty 6.0

Introduces FedHybrid and FedNewton for DP federated M-estimation, with finite-sample MSE bounds, minimax lower bound, and evaluations on vision datasets.
Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training
cs.DC 2026-05 unverdicted novelty 6.0

Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-stalenes...
Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 6.0

Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
cs.LG 2026-05 unverdicted novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...