pith. sign in

arxiv: 2310.07983 · v5 · pith:Z5RVPP3Enew · submitted 2023-10-12 · 💻 cs.LG · math.OC· stat.ML

Achieving Linear Speedup with ProxSkip in Distributed Stochastic Optimization

Pith reviewed 2026-05-24 06:45 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords ProxSkipdistributed optimizationlinear speedupstochastic gradientsdecentralizednon-convex optimizationcommunication efficiencylocal updates
0
0 comments X

The pith

Decentralized ProxSkip achieves linear speedup in the number of nodes under stochastic gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified convergence analysis for decentralized ProxSkip that covers stochastic non-convex, convex, and strongly convex problems. It demonstrates for the first time that the method attains linear speedup with respect to the number of nodes even when gradients are stochastic. The analysis shows how gradient noise, local updates, network connectivity, and data heterogeneity together set the convergence rate. This matters because prior work left open whether such communication-saving methods could scale linearly in realistic noisy and non-convex settings while also cutting communication frequency through local steps.

Core claim

Decentralized ProxSkip attains linear speedup in the number of nodes under stochastic gradients, with a unified convergence analysis for non-convex, convex, and strongly convex objectives that reveals the joint influence of gradient noise, local updates, network connectivity, and data heterogeneity on the rate.

What carries the argument

Decentralized ProxSkip, which performs local updates and skips proximal steps while exchanging information over a connected network to reduce communication.

If this is right

  • Local updates reduce communication frequency while preserving linear speedup.
  • The linear speedup holds for non-convex problems as well as convex ones.
  • Network connectivity and data heterogeneity appear explicitly in the convergence bound.
  • Stochastic gradients do not prevent the linear scaling with more nodes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar local-update and skipping patterns could produce linear speedup in other distributed first-order methods.
  • Choosing the number of local steps according to measured network connectivity may further improve efficiency in practice.
  • The same analysis approach might extend to federated settings where data partitions are fixed and heterogeneous.

Load-bearing premise

The communication network remains connected and the variance of the stochastic gradients stays finite.

What would settle it

An experiment that doubles the number of nodes yet shows no proportional reduction in iterations needed to reach target accuracy under stochastic non-convex conditions would disprove the linear speedup.

Figures

Figures reproduced from arXiv: 2310.07983 by Jinde Cao, Kun Yuan, Laurent Condat, Luyao Guo, Sulaiman A. Alghunaim.

Figure 1
Figure 1. Figure 1: Learning synthetic convex function over 10 nodes with noise σ 2 = 1 (Local-DSGD [22], [31], K-GT [37], and LED [42]). All uses the same stepsize and are averaged by ten repetitions. The probability of communication for ProxSkip is p, and the number of local updates of local-DSGD, K-GT, and LED are 1/p. 0 10 20 30 40 50 10 4 10 2 10 0 x t x / x , = 0.1 p = 0.1 0 10 20 30 40 50 10 4 10 2 10 0 p = 0.2 0 20 40… view at source ↗
Figure 2
Figure 2. Figure 2: Experimental results for ProxSkip to logistic regression problem with a strongly convex regularizer [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experimental results of logistic regression problem on the ijcnn1 dataset with regularizer [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Linear speedup of ProxSkip in non-convex settings over ijcnn1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Experimental comparison with the same stepsize in non-convex settings over ijcnn1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Experimental comparison with tuning stepsize in non-convex settings over ijcnn1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

The ProxSkip algorithm for distributed optimization is gaining increasing attention due to its effectiveness in reducing communication. However, existing analyses of ProxSkip are limited to the strongly convex setting and fail to achieve linear speedup with respect to the number of nodes. Key questions regarding its behavior in the non-convex setting and the achievability of linear speedup remain open. In this paper, we revisit decentralized ProxSkip and answer these questions affirmatively. We provide a unified convergence analysis for stochastic non-convex, convex, and strongly convex problems, revealing how gradient noise, local updates, network connectivity, and data heterogeneity jointly determine the convergence behavior. To the best of our knowledge, this is the first analysis showing that decentralized ProxSkip achieves linear speedup in the number of nodes under stochastic gradients. Moreover, our results demonstrate that local updates can effectively reduce communication frequency and improve communication efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper revisits decentralized ProxSkip and supplies a unified convergence analysis for stochastic non-convex, convex, and strongly convex regimes. The rates explicitly separate the effects of gradient noise variance, local-update count, graph connectivity (mixing time or spectral gap), and data heterogeneity; linear speedup in node count is obtained when local-update and communication parameters are chosen to balance these terms. The manuscript claims this is the first such analysis achieving linear speedup under stochastic gradients.

Significance. If the derivations hold, the work supplies the first linear-speedup guarantee for decentralized ProxSkip across convexity regimes under stochastic gradients. The explicit decomposition of convergence into noise, local steps, connectivity, and heterogeneity terms is a concrete strength that clarifies how to tune communication frequency. The unified treatment of three problem classes is also useful for the distributed optimization literature.

minor comments (2)
  1. [Abstract] The abstract states that rates 'jointly determine the convergence behavior'; the main text should include a short table or corollary that lists the precise dependence on each term (noise variance, local steps K, spectral gap, heterogeneity) for each convexity regime.
  2. Related-work discussion should explicitly contrast the new linear-speedup result with prior ProxSkip analyses (which were limited to strongly convex deterministic or single-node settings) by citing the exact rates or assumptions that prevented linear speedup in those works.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive assessment of the manuscript, including the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript supplies a unified convergence analysis for decentralized ProxSkip across stochastic non-convex, convex, and strongly convex regimes. Rate expressions explicitly separate contributions from gradient noise variance, local-update count, graph connectivity (mixing time or spectral gap), and heterogeneity; linear speedup emerges when local-update and communication parameters balance these terms. No derivation step reduces by construction to a fitted input, self-definition, or load-bearing self-citation chain. The analysis rests on standard stochastic optimization techniques and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full manuscript required to audit.

pith-pipeline@v0.9.0 · 5691 in / 912 out tokens · 17285 ms · 2026-05-24T06:45:11.562531+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    Accelerated AB/Push–Pull methods for distributed optimization over time-varying directed networks,

    D. T. A. Nguyen, D. T. Nguyen and A. Nedi ´c, “Accelerated AB/Push–Pull methods for distributed optimization over time-varying directed networks,” IEEE Trans. Control Netw. Syst., vol. 11, no. 3, pp. 1395-1407, 2024

  2. [2]

    Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,

    X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 5330–5340, 2017

  3. [3]

    A distributed proximal alternating direction multiplier method for multiblock nonsmooth composite opti- mization,

    Y . Zhou, L. Guo, X. Shi and J. Cao, “A distributed proximal alternating direction multiplier method for multiblock nonsmooth composite opti- mization,” IEEE Trans. Control Netw. Syst., vol. 12, no. 1, pp. 202–215, 2025

  4. [4]

    An improved analysis of gradient tracking for decentralized machine learning,

    A. Koloskova, T. Lin, and S. Stich, “An improved analysis of gradient tracking for decentralized machine learning,”, in Proc. Adv. Neural Inf. Process. Sys., pp. 11422–11435, 2021

  5. [5]

    A unified and refined convergence analysis for non-convex decentralized learning,

    S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Trans. Signal Process., vol. 70, pp. 3264–3279, 2022

  6. [6]

    DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,

    L. Guo, X. Shi, S. Yang, and J. Cao, “DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,” IEEE Trans. Autom. Control, vol. 69, no. 5, pp. 2995–3010, 2024

  7. [7]

    Communication-efficient distributed cooperative learning with compressed beliefs,

    M. T. Toghani and C. A. Uribe, “Communication-efficient distributed cooperative learning with compressed beliefs,” IEEE Trans. Control Netw. Syst., vol. 9, no. 3, pp. 1215–1226, 2022

  8. [8]

    CEDAS: A compressed decentralized stochastic gradient method with improved convergence,

    K. Huang and S. Pu, “CEDAS: A compressed decentralized stochastic gradient method with improved convergence,” IEEE Trans. Autom. Control, vol. 70, no. 4, pp. 2242–2257, 2025

  9. [9]

    Momentum provably im- proves error feedback!,

    I. Fatkhullin, A. Tyurin, and P. Richt ´arik, “Momentum provably im- proves error feedback!,” in Proc. Adv. Neural Inf. Process. Sys. , 2023

  10. [10]

    An accelerated distributed stochas- tic gradient method with momentum,

    K. Huang, S. Pu, and A Nedi ´c, “An accelerated distributed stochas- tic gradient method with momentum,” Math. Program. , 2025, doi: 10.1007/s10107-025-02217-0

  11. [11]

    Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact Prox,

    A. Sadiev, D. Kovalev, and P. Richt ´arik, “Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact Prox,” in Proc. Adv. Neural Inf. Process. Sys., pp. 21777–21791, 2022

  12. [12]

    Optimal and practical algo- rithms for smooth and strongly convex decentralized optimization,

    D. Kovalev, A. Salim, and P. Richt ´arik, “Optimal and practical algo- rithms for smooth and strongly convex decentralized optimization,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 18342–18352, 2020

  13. [13]

    Decentralized accelerated gradi- ent methods with increasing penalty parameters,

    H. Li, C. Fang, W. Yin and Z. Lin, “Decentralized accelerated gradi- ent methods with increasing penalty parameters,” IEEE Trans. Signal Process., vol. 68, pp. 4855–4870, 2020

  14. [14]

    Variance reduced EXTRA and DIGing and their optimal acceleration for strongly convex decentralized optimiza- tion,

    H. Li, Z. Lin, and Y . Fang, “Variance reduced EXTRA and DIGing and their optimal acceleration for strongly convex decentralized optimiza- tion,” J. Mach. Learn. Res. , vol. 23, pp. 1–41, 2022

  15. [15]

    An optimal algorithm for decentralized finite-sum optimization,

    H. Hendrikx, F. Bach, and L. Massouli ´e, “An optimal algorithm for decentralized finite-sum optimization,” SIAM J. Optim. , vol. 31, no. 4, pp. 2753–2783, 2021

  16. [16]

    Optimal gradient tracking for decentralized optimization,

    Z. Song, L. Shi, S. Pu, and M. Yan, “Optimal gradient tracking for decentralized optimization,” Math. Program., vol. 207, pp. 1–53, 2024

  17. [17]

    Don’t use large mini-batches, use local SGD,

    T. Lin, S. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local SGD,” in Proc. Int. Conf. Learn. Represent. , 2018

  18. [18]

    Is Local SGD Better than Minibatch SGD?,

    B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, H. B. McMahan, O. Shamir, and N. Srebro, “Is Local SGD Better than Minibatch SGD?,” in Proc. Int. Conf. Mach. Learn. , pp. 10334–10343, 2020

  19. [19]

    Local SGD converges fast and communicates little,

    S. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learn. Represent. , 2018

  20. [20]

    Achieving linear speedup with partial worker participation in non-IID federated learning,

    H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial worker participation in non-IID federated learning,” in Proc. Int. Conf. Learn. Represent., 2021

  21. [21]

    Tighter theory for local SGD on identical and heterogeneous data,

    A. Khaled, K. Mishchenko, and P. Richt ´arik, “Tighter theory for local SGD on identical and heterogeneous data,” in Proc. Int. Conf. Artif. Intell. Statist., pp. 4519–4529, 2020

  22. [22]

    Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,

    J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,” J. Mach. Learn. Res., vol. 22, no. 213, 2021

  23. [23]

    SCAFFOLD: Stochastic controlled averaging for federated learning

    S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning” , in Proc. Int. Conf. Mach. Learn. , pp. 5132–5143, 2020

  24. [24]

    Momentum benefits non-IID federated learning simply and provably,

    Z. Cheng, X. Huang, and K. Yuan, “Momentum benefits non-IID federated learning simply and provably, ” in Proc. Int. Conf. Learn. Represent., 2024

  25. [25]

    Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,

    A. Mitra, R. Jaafar, G. J. Pappas, and H. Hassani, “Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 14606–14619, 2021

  26. [26]

    FedPD: A federated learning framework with adaptivity to Non-IID data,

    X. Zhang, M. Hong, S. Dhople, W. Yin, and Y . Liu, “FedPD: A federated learning framework with adaptivity to Non-IID data,”IEEE Trans. Signal Process., vol. 69, pp. 6055–6070, 2021

  27. [27]

    Federated learning based on dynamic regularization,

    A. E. Durmus, Z. Yue, M. Ramon, M. Matthew, W. Paul, and S. Venkatesh, “Federated learning based on dynamic regularization,” in Proc. Int. Conf. Learn. Represent. , 2021

  28. [28]

    Vari- ance reduced local SGD with lower communication complexity,

    X. Liang, S. Shen, J. Liu, Z. Pan, E. Chen, and Y . Cheng, “Vari- ance reduced local SGD with lower communication complexity,” 2019, arXiv:1912.12844

  29. [29]

    Feder- ated learning with compression: Unified analysis and sharp guarantees,

    F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Feder- ated learning with compression: Unified analysis and sharp guarantees,” in Proc. Int. Conf. Artif. Intell. Statist. , pp. 2350–2358, 2021

  30. [30]

    Stochastic controlled averaging for federated learning with communication compression,

    X. Huang, P. Li, and X. Li, “Stochastic controlled averaging for federated learning with communication compression,” in Proc. Int. Conf. Learn. Represent., 2024

  31. [31]

    A unified theory of decentralized SGD with changing topology and local updates,

    A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Proc. Int. Conf. Mach. Learn. , pp. 5381–5393, 2020

  32. [32]

    Distributed stochastic gradient tracking methods,

    S. Pu and A. Nedi ´c, “Distributed stochastic gradient tracking methods,” Math. Program., vol. 187, pp. 409–457, 2021

  33. [33]

    Achieving geometric convergence for distributed optimization over time-varying graphs,

    A. Nedi ´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM J. Optim. , vol. 27, no. 4, pp. 2597–2633, 2017

  34. [34]

    Harnessing smoothness to accelerate distributed optimization,

    G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245– 1260, Sep. 2018

  35. [35]

    On the computation-communication trade-off with a flexible gradient tracking approach,

    Y . Huang and J. Xu. “On the computation-communication trade-off with a flexible gradient tracking approach,” in Proc. IEEE Conf. Decis. Control, pp. 284–289, 2023. 10

  36. [36]

    On the performance of gradient tracking with local updates,

    E. D. H. Nguyen, S. A. Alghunaim, K. Yuan, and C. A. Uribe, “On the performance of gradient tracking with local updates,” in Proc. IEEE Conf. Decis. Control , pp. 4309–4313, 2023

  37. [37]

    Decentralized gradient tracking with local steps,

    Y . Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,” Optimization Methods and Software, pp. 1-28, 2024

  38. [38]

    Exact diffusion for distributed optimization and learning-part I: Algorithm development,

    K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning-part I: Algorithm development,” IEEE Trans. Signal Process. , vol. 67, no. 3, pp. 708–723, Feb. 2019

  39. [39]

    A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,

    Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4494–4506, Sep. 2019

  40. [40]

    D 2: Decentralized training over decentralized data,

    H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D 2: Decentralized training over decentralized data,” in Proc. Int. Conf. Mach. Learn. , pp. 4848–4856, 2018

  41. [41]

    Decentralized inexact proximal gradient method with network-independent stepsizes for convex com- posite optimization,

    L. Guo, X. Shi, J. Cao, and Z. Wang, “Decentralized inexact proximal gradient method with network-independent stepsizes for convex com- posite optimization,” IEEE Trans. Signal Process., vol. 71, pp. 786–801, 2023

  42. [42]

    Local exact-diffusion for decentralized optimization and learning,

    S. A. Alghunaim, “Local exact-diffusion for decentralized optimization and learning,” vol. 69, no. 11, pp. 7371–7386, 2024

  43. [43]

    ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,

    K. Mishchenko, G. Malinovsky, S. Stich, and P. Richt ´arik, “ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,” in Proc. Int. Conf. Mach. Learn. , pp. 15750–15769, 2022

  44. [44]

    TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation,

    L. Condat, I. Agarsk ´y, G. Malinovsky, and P. Richt ´arik, “TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation,” 2023, arXiv:2302.09832

  45. [45]

    Provably doubly accelerated federated learning: The first theoretically successful combination of local training and communication compression,

    L. Condat, I. Agarsk ´y, and P. Richt ´arik, “Provably doubly accelerated federated learning: The first theoretically successful combination of local training and communication compression,” 2023, arXiv:2210.13277

  46. [46]

    LoCoDL: Communication- efficient distributed learning with local training and compression,

    L. Condat, A. Maranjyan, and P. Richt ´arik, “LoCoDL: Communication- efficient distributed learning with local training and compression,” in Proc. Int. Conf. Learn. Represent. , 2025

  47. [47]

    Variance reduced ProxSkip: Algorithm, theory and application to federated learning,

    G. Malinovsky, K. Yi, and P. Richt ´arik, “Variance reduced ProxSkip: Algorithm, theory and application to federated learning,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 15176–15189, 2022

  48. [48]

    Tighter analysis for ProxSkip,

    Z. Hu and H. Huang, “Tighter analysis for ProxSkip,” in Proc. Int. Conf. Mach. Learn., pp. 13469–13496, 2023

  49. [49]

    RandProx: Primal-dual optimization algo- rithms with randomized proximal updates,

    L. Condat and P. Richt ´arik, “RandProx: Primal-dual optimization algo- rithms with randomized proximal updates,” in Proc. Int. Conf. Learn. Represent., 2023

  50. [50]

    Optimal algorithms for smooth and strongly convex distributed optimization in networks,

    K. Scaman, F. Bach, S. Bubeck, Y .-T. Lee, and L. Massouli ´e, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” in Proc. Int. Conf. Mach. Learn. , pp. 3027–3036, 2017

  51. [51]

    Achieving linear speedup with network-independent learning rates in decentralized stochastic optimization,

    H. Yuan, S. A. Alghunaim, and K. Yuan, “Achieving linear speedup with network-independent learning rates in decentralized stochastic optimization,” in Proc. IEEE Conf. Decis. Control , pp. 139-144, 2023

  52. [52]

    LibSVM: A library for support vector machines,

    C.-C. Chang and C.-J. Lin, “LibSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol. , vol. 2, no. 3, 2011, Art. no. 27. Luyao Guo received the B.S. degree in infor- mation and computing science from Shanxi Uni- versity, Taiyuan, China, in 2020. He is currently pursuing the Ph.D. degree in applied mathematics with the Jiangsu Prov...