Achieving Linear Speedup with ProxSkip in Distributed Stochastic Optimization
Pith reviewed 2026-05-24 06:45 UTC · model grok-4.3
The pith
Decentralized ProxSkip achieves linear speedup in the number of nodes under stochastic gradients.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decentralized ProxSkip attains linear speedup in the number of nodes under stochastic gradients, with a unified convergence analysis for non-convex, convex, and strongly convex objectives that reveals the joint influence of gradient noise, local updates, network connectivity, and data heterogeneity on the rate.
What carries the argument
Decentralized ProxSkip, which performs local updates and skips proximal steps while exchanging information over a connected network to reduce communication.
If this is right
- Local updates reduce communication frequency while preserving linear speedup.
- The linear speedup holds for non-convex problems as well as convex ones.
- Network connectivity and data heterogeneity appear explicitly in the convergence bound.
- Stochastic gradients do not prevent the linear scaling with more nodes.
Where Pith is reading between the lines
- Similar local-update and skipping patterns could produce linear speedup in other distributed first-order methods.
- Choosing the number of local steps according to measured network connectivity may further improve efficiency in practice.
- The same analysis approach might extend to federated settings where data partitions are fixed and heterogeneous.
Load-bearing premise
The communication network remains connected and the variance of the stochastic gradients stays finite.
What would settle it
An experiment that doubles the number of nodes yet shows no proportional reduction in iterations needed to reach target accuracy under stochastic non-convex conditions would disprove the linear speedup.
Figures
read the original abstract
The ProxSkip algorithm for distributed optimization is gaining increasing attention due to its effectiveness in reducing communication. However, existing analyses of ProxSkip are limited to the strongly convex setting and fail to achieve linear speedup with respect to the number of nodes. Key questions regarding its behavior in the non-convex setting and the achievability of linear speedup remain open. In this paper, we revisit decentralized ProxSkip and answer these questions affirmatively. We provide a unified convergence analysis for stochastic non-convex, convex, and strongly convex problems, revealing how gradient noise, local updates, network connectivity, and data heterogeneity jointly determine the convergence behavior. To the best of our knowledge, this is the first analysis showing that decentralized ProxSkip achieves linear speedup in the number of nodes under stochastic gradients. Moreover, our results demonstrate that local updates can effectively reduce communication frequency and improve communication efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper revisits decentralized ProxSkip and supplies a unified convergence analysis for stochastic non-convex, convex, and strongly convex regimes. The rates explicitly separate the effects of gradient noise variance, local-update count, graph connectivity (mixing time or spectral gap), and data heterogeneity; linear speedup in node count is obtained when local-update and communication parameters are chosen to balance these terms. The manuscript claims this is the first such analysis achieving linear speedup under stochastic gradients.
Significance. If the derivations hold, the work supplies the first linear-speedup guarantee for decentralized ProxSkip across convexity regimes under stochastic gradients. The explicit decomposition of convergence into noise, local steps, connectivity, and heterogeneity terms is a concrete strength that clarifies how to tune communication frequency. The unified treatment of three problem classes is also useful for the distributed optimization literature.
minor comments (2)
- [Abstract] The abstract states that rates 'jointly determine the convergence behavior'; the main text should include a short table or corollary that lists the precise dependence on each term (noise variance, local steps K, spectral gap, heterogeneity) for each convexity regime.
- Related-work discussion should explicitly contrast the new linear-speedup result with prior ProxSkip analyses (which were limited to strongly convex deterministic or single-node settings) by citing the exact rates or assumptions that prevented linear speedup in those works.
Simulated Author's Rebuttal
We thank the referee for their careful reading and positive assessment of the manuscript, including the recommendation for minor revision. No specific major comments were raised in the report.
Circularity Check
No significant circularity identified
full rationale
The manuscript supplies a unified convergence analysis for decentralized ProxSkip across stochastic non-convex, convex, and strongly convex regimes. Rate expressions explicitly separate contributions from gradient noise variance, local-update count, graph connectivity (mixing time or spectral gap), and heterogeneity; linear speedup emerges when local-update and communication parameters balance these terms. No derivation step reduces by construction to a fitted input, self-definition, or load-bearing self-citation chain. The analysis rests on standard stochastic optimization techniques and is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Accelerated AB/Push–Pull methods for distributed optimization over time-varying directed networks,
D. T. A. Nguyen, D. T. Nguyen and A. Nedi ´c, “Accelerated AB/Push–Pull methods for distributed optimization over time-varying directed networks,” IEEE Trans. Control Netw. Syst., vol. 11, no. 3, pp. 1395-1407, 2024
work page 2024
-
[2]
X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 5330–5340, 2017
work page 2017
-
[3]
Y . Zhou, L. Guo, X. Shi and J. Cao, “A distributed proximal alternating direction multiplier method for multiblock nonsmooth composite opti- mization,” IEEE Trans. Control Netw. Syst., vol. 12, no. 1, pp. 202–215, 2025
work page 2025
-
[4]
An improved analysis of gradient tracking for decentralized machine learning,
A. Koloskova, T. Lin, and S. Stich, “An improved analysis of gradient tracking for decentralized machine learning,”, in Proc. Adv. Neural Inf. Process. Sys., pp. 11422–11435, 2021
work page 2021
-
[5]
A unified and refined convergence analysis for non-convex decentralized learning,
S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Trans. Signal Process., vol. 70, pp. 3264–3279, 2022
work page 2022
-
[6]
DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,
L. Guo, X. Shi, S. Yang, and J. Cao, “DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,” IEEE Trans. Autom. Control, vol. 69, no. 5, pp. 2995–3010, 2024
work page 2024
-
[7]
Communication-efficient distributed cooperative learning with compressed beliefs,
M. T. Toghani and C. A. Uribe, “Communication-efficient distributed cooperative learning with compressed beliefs,” IEEE Trans. Control Netw. Syst., vol. 9, no. 3, pp. 1215–1226, 2022
work page 2022
-
[8]
CEDAS: A compressed decentralized stochastic gradient method with improved convergence,
K. Huang and S. Pu, “CEDAS: A compressed decentralized stochastic gradient method with improved convergence,” IEEE Trans. Autom. Control, vol. 70, no. 4, pp. 2242–2257, 2025
work page 2025
-
[9]
Momentum provably im- proves error feedback!,
I. Fatkhullin, A. Tyurin, and P. Richt ´arik, “Momentum provably im- proves error feedback!,” in Proc. Adv. Neural Inf. Process. Sys. , 2023
work page 2023
-
[10]
An accelerated distributed stochas- tic gradient method with momentum,
K. Huang, S. Pu, and A Nedi ´c, “An accelerated distributed stochas- tic gradient method with momentum,” Math. Program. , 2025, doi: 10.1007/s10107-025-02217-0
-
[11]
A. Sadiev, D. Kovalev, and P. Richt ´arik, “Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact Prox,” in Proc. Adv. Neural Inf. Process. Sys., pp. 21777–21791, 2022
work page 2022
-
[12]
Optimal and practical algo- rithms for smooth and strongly convex decentralized optimization,
D. Kovalev, A. Salim, and P. Richt ´arik, “Optimal and practical algo- rithms for smooth and strongly convex decentralized optimization,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 18342–18352, 2020
work page 2020
-
[13]
Decentralized accelerated gradi- ent methods with increasing penalty parameters,
H. Li, C. Fang, W. Yin and Z. Lin, “Decentralized accelerated gradi- ent methods with increasing penalty parameters,” IEEE Trans. Signal Process., vol. 68, pp. 4855–4870, 2020
work page 2020
-
[14]
H. Li, Z. Lin, and Y . Fang, “Variance reduced EXTRA and DIGing and their optimal acceleration for strongly convex decentralized optimiza- tion,” J. Mach. Learn. Res. , vol. 23, pp. 1–41, 2022
work page 2022
-
[15]
An optimal algorithm for decentralized finite-sum optimization,
H. Hendrikx, F. Bach, and L. Massouli ´e, “An optimal algorithm for decentralized finite-sum optimization,” SIAM J. Optim. , vol. 31, no. 4, pp. 2753–2783, 2021
work page 2021
-
[16]
Optimal gradient tracking for decentralized optimization,
Z. Song, L. Shi, S. Pu, and M. Yan, “Optimal gradient tracking for decentralized optimization,” Math. Program., vol. 207, pp. 1–53, 2024
work page 2024
-
[17]
Don’t use large mini-batches, use local SGD,
T. Lin, S. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local SGD,” in Proc. Int. Conf. Learn. Represent. , 2018
work page 2018
-
[18]
Is Local SGD Better than Minibatch SGD?,
B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, H. B. McMahan, O. Shamir, and N. Srebro, “Is Local SGD Better than Minibatch SGD?,” in Proc. Int. Conf. Mach. Learn. , pp. 10334–10343, 2020
work page 2020
-
[19]
Local SGD converges fast and communicates little,
S. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learn. Represent. , 2018
work page 2018
-
[20]
Achieving linear speedup with partial worker participation in non-IID federated learning,
H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial worker participation in non-IID federated learning,” in Proc. Int. Conf. Learn. Represent., 2021
work page 2021
-
[21]
Tighter theory for local SGD on identical and heterogeneous data,
A. Khaled, K. Mishchenko, and P. Richt ´arik, “Tighter theory for local SGD on identical and heterogeneous data,” in Proc. Int. Conf. Artif. Intell. Statist., pp. 4519–4529, 2020
work page 2020
-
[22]
Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,
J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,” J. Mach. Learn. Res., vol. 22, no. 213, 2021
work page 2021
-
[23]
SCAFFOLD: Stochastic controlled averaging for federated learning
S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning” , in Proc. Int. Conf. Mach. Learn. , pp. 5132–5143, 2020
work page 2020
-
[24]
Momentum benefits non-IID federated learning simply and provably,
Z. Cheng, X. Huang, and K. Yuan, “Momentum benefits non-IID federated learning simply and provably, ” in Proc. Int. Conf. Learn. Represent., 2024
work page 2024
-
[25]
Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,
A. Mitra, R. Jaafar, G. J. Pappas, and H. Hassani, “Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 14606–14619, 2021
work page 2021
-
[26]
FedPD: A federated learning framework with adaptivity to Non-IID data,
X. Zhang, M. Hong, S. Dhople, W. Yin, and Y . Liu, “FedPD: A federated learning framework with adaptivity to Non-IID data,”IEEE Trans. Signal Process., vol. 69, pp. 6055–6070, 2021
work page 2021
-
[27]
Federated learning based on dynamic regularization,
A. E. Durmus, Z. Yue, M. Ramon, M. Matthew, W. Paul, and S. Venkatesh, “Federated learning based on dynamic regularization,” in Proc. Int. Conf. Learn. Represent. , 2021
work page 2021
-
[28]
Vari- ance reduced local SGD with lower communication complexity,
X. Liang, S. Shen, J. Liu, Z. Pan, E. Chen, and Y . Cheng, “Vari- ance reduced local SGD with lower communication complexity,” 2019, arXiv:1912.12844
-
[29]
Feder- ated learning with compression: Unified analysis and sharp guarantees,
F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Feder- ated learning with compression: Unified analysis and sharp guarantees,” in Proc. Int. Conf. Artif. Intell. Statist. , pp. 2350–2358, 2021
work page 2021
-
[30]
Stochastic controlled averaging for federated learning with communication compression,
X. Huang, P. Li, and X. Li, “Stochastic controlled averaging for federated learning with communication compression,” in Proc. Int. Conf. Learn. Represent., 2024
work page 2024
-
[31]
A unified theory of decentralized SGD with changing topology and local updates,
A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Proc. Int. Conf. Mach. Learn. , pp. 5381–5393, 2020
work page 2020
-
[32]
Distributed stochastic gradient tracking methods,
S. Pu and A. Nedi ´c, “Distributed stochastic gradient tracking methods,” Math. Program., vol. 187, pp. 409–457, 2021
work page 2021
-
[33]
Achieving geometric convergence for distributed optimization over time-varying graphs,
A. Nedi ´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM J. Optim. , vol. 27, no. 4, pp. 2597–2633, 2017
work page 2017
-
[34]
Harnessing smoothness to accelerate distributed optimization,
G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245– 1260, Sep. 2018
work page 2018
-
[35]
On the computation-communication trade-off with a flexible gradient tracking approach,
Y . Huang and J. Xu. “On the computation-communication trade-off with a flexible gradient tracking approach,” in Proc. IEEE Conf. Decis. Control, pp. 284–289, 2023. 10
work page 2023
-
[36]
On the performance of gradient tracking with local updates,
E. D. H. Nguyen, S. A. Alghunaim, K. Yuan, and C. A. Uribe, “On the performance of gradient tracking with local updates,” in Proc. IEEE Conf. Decis. Control , pp. 4309–4313, 2023
work page 2023
-
[37]
Decentralized gradient tracking with local steps,
Y . Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,” Optimization Methods and Software, pp. 1-28, 2024
work page 2024
-
[38]
Exact diffusion for distributed optimization and learning-part I: Algorithm development,
K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning-part I: Algorithm development,” IEEE Trans. Signal Process. , vol. 67, no. 3, pp. 708–723, Feb. 2019
work page 2019
-
[39]
Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4494–4506, Sep. 2019
work page 2019
-
[40]
D 2: Decentralized training over decentralized data,
H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D 2: Decentralized training over decentralized data,” in Proc. Int. Conf. Mach. Learn. , pp. 4848–4856, 2018
work page 2018
-
[41]
L. Guo, X. Shi, J. Cao, and Z. Wang, “Decentralized inexact proximal gradient method with network-independent stepsizes for convex com- posite optimization,” IEEE Trans. Signal Process., vol. 71, pp. 786–801, 2023
work page 2023
-
[42]
Local exact-diffusion for decentralized optimization and learning,
S. A. Alghunaim, “Local exact-diffusion for decentralized optimization and learning,” vol. 69, no. 11, pp. 7371–7386, 2024
work page 2024
-
[43]
ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,
K. Mishchenko, G. Malinovsky, S. Stich, and P. Richt ´arik, “ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,” in Proc. Int. Conf. Mach. Learn. , pp. 15750–15769, 2022
work page 2022
-
[44]
L. Condat, I. Agarsk ´y, G. Malinovsky, and P. Richt ´arik, “TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation,” 2023, arXiv:2302.09832
-
[45]
L. Condat, I. Agarsk ´y, and P. Richt ´arik, “Provably doubly accelerated federated learning: The first theoretically successful combination of local training and communication compression,” 2023, arXiv:2210.13277
-
[46]
LoCoDL: Communication- efficient distributed learning with local training and compression,
L. Condat, A. Maranjyan, and P. Richt ´arik, “LoCoDL: Communication- efficient distributed learning with local training and compression,” in Proc. Int. Conf. Learn. Represent. , 2025
work page 2025
-
[47]
Variance reduced ProxSkip: Algorithm, theory and application to federated learning,
G. Malinovsky, K. Yi, and P. Richt ´arik, “Variance reduced ProxSkip: Algorithm, theory and application to federated learning,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 15176–15189, 2022
work page 2022
-
[48]
Tighter analysis for ProxSkip,
Z. Hu and H. Huang, “Tighter analysis for ProxSkip,” in Proc. Int. Conf. Mach. Learn., pp. 13469–13496, 2023
work page 2023
-
[49]
RandProx: Primal-dual optimization algo- rithms with randomized proximal updates,
L. Condat and P. Richt ´arik, “RandProx: Primal-dual optimization algo- rithms with randomized proximal updates,” in Proc. Int. Conf. Learn. Represent., 2023
work page 2023
-
[50]
Optimal algorithms for smooth and strongly convex distributed optimization in networks,
K. Scaman, F. Bach, S. Bubeck, Y .-T. Lee, and L. Massouli ´e, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” in Proc. Int. Conf. Mach. Learn. , pp. 3027–3036, 2017
work page 2017
-
[51]
H. Yuan, S. A. Alghunaim, and K. Yuan, “Achieving linear speedup with network-independent learning rates in decentralized stochastic optimization,” in Proc. IEEE Conf. Decis. Control , pp. 139-144, 2023
work page 2023
-
[52]
LibSVM: A library for support vector machines,
C.-C. Chang and C.-J. Lin, “LibSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol. , vol. 2, no. 3, 2011, Art. no. 27. Luyao Guo received the B.S. degree in infor- mation and computing science from Shanxi Uni- versity, Taiyuan, China, in 2020. He is currently pursuing the Ph.D. degree in applied mathematics with the Jiangsu Prov...
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.