Achieving Linear Speedup with ProxSkip in Distributed Stochastic Optimization

Jinde Cao; Kun Yuan; Laurent Condat; Luyao Guo; Sulaiman A. Alghunaim

arxiv: 2310.07983 · v5 · pith:Z5RVPP3Enew · submitted 2023-10-12 · 💻 cs.LG · math.OC· stat.ML

Achieving Linear Speedup with ProxSkip in Distributed Stochastic Optimization

Luyao Guo , Sulaiman A. Alghunaim , Kun Yuan , Laurent Condat , Jinde Cao This is my paper

Pith reviewed 2026-05-24 06:45 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords ProxSkipdistributed optimizationlinear speedupstochastic gradientsdecentralizednon-convex optimizationcommunication efficiencylocal updates

0 comments

The pith

Decentralized ProxSkip achieves linear speedup in the number of nodes under stochastic gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified convergence analysis for decentralized ProxSkip that covers stochastic non-convex, convex, and strongly convex problems. It demonstrates for the first time that the method attains linear speedup with respect to the number of nodes even when gradients are stochastic. The analysis shows how gradient noise, local updates, network connectivity, and data heterogeneity together set the convergence rate. This matters because prior work left open whether such communication-saving methods could scale linearly in realistic noisy and non-convex settings while also cutting communication frequency through local steps.

Core claim

Decentralized ProxSkip attains linear speedup in the number of nodes under stochastic gradients, with a unified convergence analysis for non-convex, convex, and strongly convex objectives that reveals the joint influence of gradient noise, local updates, network connectivity, and data heterogeneity on the rate.

What carries the argument

Decentralized ProxSkip, which performs local updates and skips proximal steps while exchanging information over a connected network to reduce communication.

If this is right

Local updates reduce communication frequency while preserving linear speedup.
The linear speedup holds for non-convex problems as well as convex ones.
Network connectivity and data heterogeneity appear explicitly in the convergence bound.
Stochastic gradients do not prevent the linear scaling with more nodes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar local-update and skipping patterns could produce linear speedup in other distributed first-order methods.
Choosing the number of local steps according to measured network connectivity may further improve efficiency in practice.
The same analysis approach might extend to federated settings where data partitions are fixed and heterogeneous.

Load-bearing premise

The communication network remains connected and the variance of the stochastic gradients stays finite.

What would settle it

An experiment that doubles the number of nodes yet shows no proportional reduction in iterations needed to reach target accuracy under stochastic non-convex conditions would disprove the linear speedup.

Figures

Figures reproduced from arXiv: 2310.07983 by Jinde Cao, Kun Yuan, Laurent Condat, Luyao Guo, Sulaiman A. Alghunaim.

**Figure 1.** Figure 1: Learning synthetic convex function over 10 nodes with noise σ 2 = 1 (Local-DSGD [22], [31], K-GT [37], and LED [42]). All uses the same stepsize and are averaged by ten repetitions. The probability of communication for ProxSkip is p, and the number of local updates of local-DSGD, K-GT, and LED are 1/p. 0 10 20 30 40 50 10 4 10 2 10 0 x t x / x , = 0.1 p = 0.1 0 10 20 30 40 50 10 4 10 2 10 0 p = 0.2 0 20 40… view at source ↗

**Figure 2.** Figure 2: Experimental results for ProxSkip to logistic regression problem with a strongly convex regularizer [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Experimental results of logistic regression problem on the ijcnn1 dataset with regularizer [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Linear speedup of ProxSkip in non-convex settings over ijcnn1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Experimental comparison with the same stepsize in non-convex settings over ijcnn1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Experimental comparison with tuning stepsize in non-convex settings over ijcnn1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

The ProxSkip algorithm for distributed optimization is gaining increasing attention due to its effectiveness in reducing communication. However, existing analyses of ProxSkip are limited to the strongly convex setting and fail to achieve linear speedup with respect to the number of nodes. Key questions regarding its behavior in the non-convex setting and the achievability of linear speedup remain open. In this paper, we revisit decentralized ProxSkip and answer these questions affirmatively. We provide a unified convergence analysis for stochastic non-convex, convex, and strongly convex problems, revealing how gradient noise, local updates, network connectivity, and data heterogeneity jointly determine the convergence behavior. To the best of our knowledge, this is the first analysis showing that decentralized ProxSkip achieves linear speedup in the number of nodes under stochastic gradients. Moreover, our results demonstrate that local updates can effectively reduce communication frequency and improve communication efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first linear speedup guarantee for decentralized ProxSkip in the stochastic non-convex case.

read the letter

The main thing to know is that this paper claims and analyzes the first linear speedup for decentralized ProxSkip under stochastic gradients in non-convex problems. The work supplies a unified convergence analysis across the stochastic non-convex, convex, and strongly convex regimes. It explicitly separates the roles of gradient noise variance, the number of local updates, graph connectivity through mixing time or spectral gap, and data heterogeneity in the convergence rates. Linear speedup in the number of nodes appears once the local-update and communication parameters are set to balance those contributions. This extends prior analyses that were limited to strongly convex settings and did not achieve linear speedup. The paper shows that local updates can lower communication frequency while still delivering the speedup, which is relevant for practical distributed training. The soft spots are minor. The result depends on choosing the local step count and communication period to offset noise and heterogeneity terms, which is standard but means the speedup is conditional on good tuning. The assumptions are the usual bounded-variance and connectivity ones; no unusual restrictions show up in the rates. This paper is for people working on theoretical distributed stochastic optimization, particularly those interested in communication-efficient methods like ProxSkip. A reader who follows the literature on decentralized training will get value from the unified rates and the non-convex extension. It deserves a serious referee. The claim is new, the analysis appears to rest on standard techniques without circularity, and the separation of terms makes the result checkable.

Referee Report

0 major / 2 minor

Summary. The paper revisits decentralized ProxSkip and supplies a unified convergence analysis for stochastic non-convex, convex, and strongly convex regimes. The rates explicitly separate the effects of gradient noise variance, local-update count, graph connectivity (mixing time or spectral gap), and data heterogeneity; linear speedup in node count is obtained when local-update and communication parameters are chosen to balance these terms. The manuscript claims this is the first such analysis achieving linear speedup under stochastic gradients.

Significance. If the derivations hold, the work supplies the first linear-speedup guarantee for decentralized ProxSkip across convexity regimes under stochastic gradients. The explicit decomposition of convergence into noise, local steps, connectivity, and heterogeneity terms is a concrete strength that clarifies how to tune communication frequency. The unified treatment of three problem classes is also useful for the distributed optimization literature.

minor comments (2)

[Abstract] The abstract states that rates 'jointly determine the convergence behavior'; the main text should include a short table or corollary that lists the precise dependence on each term (noise variance, local steps K, spectral gap, heterogeneity) for each convexity regime.
Related-work discussion should explicitly contrast the new linear-speedup result with prior ProxSkip analyses (which were limited to strongly convex deterministic or single-node settings) by citing the exact rates or assumptions that prevented linear speedup in those works.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive assessment of the manuscript, including the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript supplies a unified convergence analysis for decentralized ProxSkip across stochastic non-convex, convex, and strongly convex regimes. Rate expressions explicitly separate contributions from gradient noise variance, local-update count, graph connectivity (mixing time or spectral gap), and heterogeneity; linear speedup emerges when local-update and communication parameters balance these terms. No derivation step reduces by construction to a fitted input, self-definition, or load-bearing self-citation chain. The analysis rests on standard stochastic optimization techniques and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full manuscript required to audit.

pith-pipeline@v0.9.0 · 5691 in / 912 out tokens · 17285 ms · 2026-05-24T06:45:11.562531+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Accelerated AB/Push–Pull methods for distributed optimization over time-varying directed networks,

D. T. A. Nguyen, D. T. Nguyen and A. Nedi ´c, “Accelerated AB/Push–Pull methods for distributed optimization over time-varying directed networks,” IEEE Trans. Control Netw. Syst., vol. 11, no. 3, pp. 1395-1407, 2024

work page 2024
[2]

Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,

X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 5330–5340, 2017

work page 2017
[3]

A distributed proximal alternating direction multiplier method for multiblock nonsmooth composite opti- mization,

Y . Zhou, L. Guo, X. Shi and J. Cao, “A distributed proximal alternating direction multiplier method for multiblock nonsmooth composite opti- mization,” IEEE Trans. Control Netw. Syst., vol. 12, no. 1, pp. 202–215, 2025

work page 2025
[4]

An improved analysis of gradient tracking for decentralized machine learning,

A. Koloskova, T. Lin, and S. Stich, “An improved analysis of gradient tracking for decentralized machine learning,”, in Proc. Adv. Neural Inf. Process. Sys., pp. 11422–11435, 2021

work page 2021
[5]

A unified and refined convergence analysis for non-convex decentralized learning,

S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Trans. Signal Process., vol. 70, pp. 3264–3279, 2022

work page 2022
[6]

DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,

L. Guo, X. Shi, S. Yang, and J. Cao, “DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,” IEEE Trans. Autom. Control, vol. 69, no. 5, pp. 2995–3010, 2024

work page 2024
[7]

Communication-efficient distributed cooperative learning with compressed beliefs,

M. T. Toghani and C. A. Uribe, “Communication-efficient distributed cooperative learning with compressed beliefs,” IEEE Trans. Control Netw. Syst., vol. 9, no. 3, pp. 1215–1226, 2022

work page 2022
[8]

CEDAS: A compressed decentralized stochastic gradient method with improved convergence,

K. Huang and S. Pu, “CEDAS: A compressed decentralized stochastic gradient method with improved convergence,” IEEE Trans. Autom. Control, vol. 70, no. 4, pp. 2242–2257, 2025

work page 2025
[9]

Momentum provably im- proves error feedback!,

I. Fatkhullin, A. Tyurin, and P. Richt ´arik, “Momentum provably im- proves error feedback!,” in Proc. Adv. Neural Inf. Process. Sys. , 2023

work page 2023
[10]

An accelerated distributed stochas- tic gradient method with momentum,

K. Huang, S. Pu, and A Nedi ´c, “An accelerated distributed stochas- tic gradient method with momentum,” Math. Program. , 2025, doi: 10.1007/s10107-025-02217-0

work page doi:10.1007/s10107-025-02217-0 2025
[11]

Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact Prox,

A. Sadiev, D. Kovalev, and P. Richt ´arik, “Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact Prox,” in Proc. Adv. Neural Inf. Process. Sys., pp. 21777–21791, 2022

work page 2022
[12]

Optimal and practical algo- rithms for smooth and strongly convex decentralized optimization,

D. Kovalev, A. Salim, and P. Richt ´arik, “Optimal and practical algo- rithms for smooth and strongly convex decentralized optimization,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 18342–18352, 2020

work page 2020
[13]

Decentralized accelerated gradi- ent methods with increasing penalty parameters,

H. Li, C. Fang, W. Yin and Z. Lin, “Decentralized accelerated gradi- ent methods with increasing penalty parameters,” IEEE Trans. Signal Process., vol. 68, pp. 4855–4870, 2020

work page 2020
[14]

Variance reduced EXTRA and DIGing and their optimal acceleration for strongly convex decentralized optimiza- tion,

H. Li, Z. Lin, and Y . Fang, “Variance reduced EXTRA and DIGing and their optimal acceleration for strongly convex decentralized optimiza- tion,” J. Mach. Learn. Res. , vol. 23, pp. 1–41, 2022

work page 2022
[15]

An optimal algorithm for decentralized finite-sum optimization,

H. Hendrikx, F. Bach, and L. Massouli ´e, “An optimal algorithm for decentralized finite-sum optimization,” SIAM J. Optim. , vol. 31, no. 4, pp. 2753–2783, 2021

work page 2021
[16]

Optimal gradient tracking for decentralized optimization,

Z. Song, L. Shi, S. Pu, and M. Yan, “Optimal gradient tracking for decentralized optimization,” Math. Program., vol. 207, pp. 1–53, 2024

work page 2024
[17]

Don’t use large mini-batches, use local SGD,

T. Lin, S. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local SGD,” in Proc. Int. Conf. Learn. Represent. , 2018

work page 2018
[18]

Is Local SGD Better than Minibatch SGD?,

B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, H. B. McMahan, O. Shamir, and N. Srebro, “Is Local SGD Better than Minibatch SGD?,” in Proc. Int. Conf. Mach. Learn. , pp. 10334–10343, 2020

work page 2020
[19]

Local SGD converges fast and communicates little,

S. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learn. Represent. , 2018

work page 2018
[20]

Achieving linear speedup with partial worker participation in non-IID federated learning,

H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial worker participation in non-IID federated learning,” in Proc. Int. Conf. Learn. Represent., 2021

work page 2021
[21]

Tighter theory for local SGD on identical and heterogeneous data,

A. Khaled, K. Mishchenko, and P. Richt ´arik, “Tighter theory for local SGD on identical and heterogeneous data,” in Proc. Int. Conf. Artif. Intell. Statist., pp. 4519–4529, 2020

work page 2020
[22]

Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,

J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,” J. Mach. Learn. Res., vol. 22, no. 213, 2021

work page 2021
[23]

SCAFFOLD: Stochastic controlled averaging for federated learning

S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning” , in Proc. Int. Conf. Mach. Learn. , pp. 5132–5143, 2020

work page 2020
[24]

Momentum benefits non-IID federated learning simply and provably,

Z. Cheng, X. Huang, and K. Yuan, “Momentum benefits non-IID federated learning simply and provably, ” in Proc. Int. Conf. Learn. Represent., 2024

work page 2024
[25]

Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,

A. Mitra, R. Jaafar, G. J. Pappas, and H. Hassani, “Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 14606–14619, 2021

work page 2021
[26]

FedPD: A federated learning framework with adaptivity to Non-IID data,

X. Zhang, M. Hong, S. Dhople, W. Yin, and Y . Liu, “FedPD: A federated learning framework with adaptivity to Non-IID data,”IEEE Trans. Signal Process., vol. 69, pp. 6055–6070, 2021

work page 2021
[27]

Federated learning based on dynamic regularization,

A. E. Durmus, Z. Yue, M. Ramon, M. Matthew, W. Paul, and S. Venkatesh, “Federated learning based on dynamic regularization,” in Proc. Int. Conf. Learn. Represent. , 2021

work page 2021
[28]

Vari- ance reduced local SGD with lower communication complexity,

X. Liang, S. Shen, J. Liu, Z. Pan, E. Chen, and Y . Cheng, “Vari- ance reduced local SGD with lower communication complexity,” 2019, arXiv:1912.12844

work page arXiv 2019
[29]

Feder- ated learning with compression: Unified analysis and sharp guarantees,

F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Feder- ated learning with compression: Unified analysis and sharp guarantees,” in Proc. Int. Conf. Artif. Intell. Statist. , pp. 2350–2358, 2021

work page 2021
[30]

Stochastic controlled averaging for federated learning with communication compression,

X. Huang, P. Li, and X. Li, “Stochastic controlled averaging for federated learning with communication compression,” in Proc. Int. Conf. Learn. Represent., 2024

work page 2024
[31]

A unified theory of decentralized SGD with changing topology and local updates,

A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Proc. Int. Conf. Mach. Learn. , pp. 5381–5393, 2020

work page 2020
[32]

Distributed stochastic gradient tracking methods,

S. Pu and A. Nedi ´c, “Distributed stochastic gradient tracking methods,” Math. Program., vol. 187, pp. 409–457, 2021

work page 2021
[33]

Achieving geometric convergence for distributed optimization over time-varying graphs,

A. Nedi ´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM J. Optim. , vol. 27, no. 4, pp. 2597–2633, 2017

work page 2017
[34]

Harnessing smoothness to accelerate distributed optimization,

G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245– 1260, Sep. 2018

work page 2018
[35]

On the computation-communication trade-off with a flexible gradient tracking approach,

Y . Huang and J. Xu. “On the computation-communication trade-off with a flexible gradient tracking approach,” in Proc. IEEE Conf. Decis. Control, pp. 284–289, 2023. 10

work page 2023
[36]

On the performance of gradient tracking with local updates,

E. D. H. Nguyen, S. A. Alghunaim, K. Yuan, and C. A. Uribe, “On the performance of gradient tracking with local updates,” in Proc. IEEE Conf. Decis. Control , pp. 4309–4313, 2023

work page 2023
[37]

Decentralized gradient tracking with local steps,

Y . Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,” Optimization Methods and Software, pp. 1-28, 2024

work page 2024
[38]

Exact diffusion for distributed optimization and learning-part I: Algorithm development,

K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning-part I: Algorithm development,” IEEE Trans. Signal Process. , vol. 67, no. 3, pp. 708–723, Feb. 2019

work page 2019
[39]

A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,

Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4494–4506, Sep. 2019

work page 2019
[40]

D 2: Decentralized training over decentralized data,

H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D 2: Decentralized training over decentralized data,” in Proc. Int. Conf. Mach. Learn. , pp. 4848–4856, 2018

work page 2018
[41]

Decentralized inexact proximal gradient method with network-independent stepsizes for convex com- posite optimization,

L. Guo, X. Shi, J. Cao, and Z. Wang, “Decentralized inexact proximal gradient method with network-independent stepsizes for convex com- posite optimization,” IEEE Trans. Signal Process., vol. 71, pp. 786–801, 2023

work page 2023
[42]

Local exact-diffusion for decentralized optimization and learning,

S. A. Alghunaim, “Local exact-diffusion for decentralized optimization and learning,” vol. 69, no. 11, pp. 7371–7386, 2024

work page 2024
[43]

ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,

K. Mishchenko, G. Malinovsky, S. Stich, and P. Richt ´arik, “ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,” in Proc. Int. Conf. Mach. Learn. , pp. 15750–15769, 2022

work page 2022
[44]

TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation,

L. Condat, I. Agarsk ´y, G. Malinovsky, and P. Richt ´arik, “TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation,” 2023, arXiv:2302.09832

work page arXiv 2023
[45]

Provably doubly accelerated federated learning: The first theoretically successful combination of local training and communication compression,

L. Condat, I. Agarsk ´y, and P. Richt ´arik, “Provably doubly accelerated federated learning: The first theoretically successful combination of local training and communication compression,” 2023, arXiv:2210.13277

work page arXiv 2023
[46]

LoCoDL: Communication- efficient distributed learning with local training and compression,

L. Condat, A. Maranjyan, and P. Richt ´arik, “LoCoDL: Communication- efficient distributed learning with local training and compression,” in Proc. Int. Conf. Learn. Represent. , 2025

work page 2025
[47]

Variance reduced ProxSkip: Algorithm, theory and application to federated learning,

G. Malinovsky, K. Yi, and P. Richt ´arik, “Variance reduced ProxSkip: Algorithm, theory and application to federated learning,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 15176–15189, 2022

work page 2022
[48]

Tighter analysis for ProxSkip,

Z. Hu and H. Huang, “Tighter analysis for ProxSkip,” in Proc. Int. Conf. Mach. Learn., pp. 13469–13496, 2023

work page 2023
[49]

RandProx: Primal-dual optimization algo- rithms with randomized proximal updates,

L. Condat and P. Richt ´arik, “RandProx: Primal-dual optimization algo- rithms with randomized proximal updates,” in Proc. Int. Conf. Learn. Represent., 2023

work page 2023
[50]

Optimal algorithms for smooth and strongly convex distributed optimization in networks,

K. Scaman, F. Bach, S. Bubeck, Y .-T. Lee, and L. Massouli ´e, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” in Proc. Int. Conf. Mach. Learn. , pp. 3027–3036, 2017

work page 2017
[51]

Achieving linear speedup with network-independent learning rates in decentralized stochastic optimization,

H. Yuan, S. A. Alghunaim, and K. Yuan, “Achieving linear speedup with network-independent learning rates in decentralized stochastic optimization,” in Proc. IEEE Conf. Decis. Control , pp. 139-144, 2023

work page 2023
[52]

LibSVM: A library for support vector machines,

C.-C. Chang and C.-J. Lin, “LibSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol. , vol. 2, no. 3, 2011, Art. no. 27. Luyao Guo received the B.S. degree in infor- mation and computing science from Shanxi Uni- versity, Taiyuan, China, in 2020. He is currently pursuing the Ph.D. degree in applied mathematics with the Jiangsu Prov...

work page 2011

[1] [1]

Accelerated AB/Push–Pull methods for distributed optimization over time-varying directed networks,

D. T. A. Nguyen, D. T. Nguyen and A. Nedi ´c, “Accelerated AB/Push–Pull methods for distributed optimization over time-varying directed networks,” IEEE Trans. Control Netw. Syst., vol. 11, no. 3, pp. 1395-1407, 2024

work page 2024

[2] [2]

Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,

X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 5330–5340, 2017

work page 2017

[3] [3]

A distributed proximal alternating direction multiplier method for multiblock nonsmooth composite opti- mization,

Y . Zhou, L. Guo, X. Shi and J. Cao, “A distributed proximal alternating direction multiplier method for multiblock nonsmooth composite opti- mization,” IEEE Trans. Control Netw. Syst., vol. 12, no. 1, pp. 202–215, 2025

work page 2025

[4] [4]

An improved analysis of gradient tracking for decentralized machine learning,

A. Koloskova, T. Lin, and S. Stich, “An improved analysis of gradient tracking for decentralized machine learning,”, in Proc. Adv. Neural Inf. Process. Sys., pp. 11422–11435, 2021

work page 2021

[5] [5]

A unified and refined convergence analysis for non-convex decentralized learning,

S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Trans. Signal Process., vol. 70, pp. 3264–3279, 2022

work page 2022

[6] [6]

DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,

L. Guo, X. Shi, S. Yang, and J. Cao, “DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,” IEEE Trans. Autom. Control, vol. 69, no. 5, pp. 2995–3010, 2024

work page 2024

[7] [7]

Communication-efficient distributed cooperative learning with compressed beliefs,

M. T. Toghani and C. A. Uribe, “Communication-efficient distributed cooperative learning with compressed beliefs,” IEEE Trans. Control Netw. Syst., vol. 9, no. 3, pp. 1215–1226, 2022

work page 2022

[8] [8]

CEDAS: A compressed decentralized stochastic gradient method with improved convergence,

K. Huang and S. Pu, “CEDAS: A compressed decentralized stochastic gradient method with improved convergence,” IEEE Trans. Autom. Control, vol. 70, no. 4, pp. 2242–2257, 2025

work page 2025

[9] [9]

Momentum provably im- proves error feedback!,

I. Fatkhullin, A. Tyurin, and P. Richt ´arik, “Momentum provably im- proves error feedback!,” in Proc. Adv. Neural Inf. Process. Sys. , 2023

work page 2023

[10] [10]

An accelerated distributed stochas- tic gradient method with momentum,

K. Huang, S. Pu, and A Nedi ´c, “An accelerated distributed stochas- tic gradient method with momentum,” Math. Program. , 2025, doi: 10.1007/s10107-025-02217-0

work page doi:10.1007/s10107-025-02217-0 2025

[11] [11]

Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact Prox,

A. Sadiev, D. Kovalev, and P. Richt ´arik, “Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact Prox,” in Proc. Adv. Neural Inf. Process. Sys., pp. 21777–21791, 2022

work page 2022

[12] [12]

Optimal and practical algo- rithms for smooth and strongly convex decentralized optimization,

D. Kovalev, A. Salim, and P. Richt ´arik, “Optimal and practical algo- rithms for smooth and strongly convex decentralized optimization,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 18342–18352, 2020

work page 2020

[13] [13]

Decentralized accelerated gradi- ent methods with increasing penalty parameters,

H. Li, C. Fang, W. Yin and Z. Lin, “Decentralized accelerated gradi- ent methods with increasing penalty parameters,” IEEE Trans. Signal Process., vol. 68, pp. 4855–4870, 2020

work page 2020

[14] [14]

Variance reduced EXTRA and DIGing and their optimal acceleration for strongly convex decentralized optimiza- tion,

H. Li, Z. Lin, and Y . Fang, “Variance reduced EXTRA and DIGing and their optimal acceleration for strongly convex decentralized optimiza- tion,” J. Mach. Learn. Res. , vol. 23, pp. 1–41, 2022

work page 2022

[15] [15]

An optimal algorithm for decentralized finite-sum optimization,

H. Hendrikx, F. Bach, and L. Massouli ´e, “An optimal algorithm for decentralized finite-sum optimization,” SIAM J. Optim. , vol. 31, no. 4, pp. 2753–2783, 2021

work page 2021

[16] [16]

Optimal gradient tracking for decentralized optimization,

Z. Song, L. Shi, S. Pu, and M. Yan, “Optimal gradient tracking for decentralized optimization,” Math. Program., vol. 207, pp. 1–53, 2024

work page 2024

[17] [17]

Don’t use large mini-batches, use local SGD,

T. Lin, S. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local SGD,” in Proc. Int. Conf. Learn. Represent. , 2018

work page 2018

[18] [18]

Is Local SGD Better than Minibatch SGD?,

B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, H. B. McMahan, O. Shamir, and N. Srebro, “Is Local SGD Better than Minibatch SGD?,” in Proc. Int. Conf. Mach. Learn. , pp. 10334–10343, 2020

work page 2020

[19] [19]

Local SGD converges fast and communicates little,

S. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learn. Represent. , 2018

work page 2018

[20] [20]

Achieving linear speedup with partial worker participation in non-IID federated learning,

H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial worker participation in non-IID federated learning,” in Proc. Int. Conf. Learn. Represent., 2021

work page 2021

[21] [21]

Tighter theory for local SGD on identical and heterogeneous data,

A. Khaled, K. Mishchenko, and P. Richt ´arik, “Tighter theory for local SGD on identical and heterogeneous data,” in Proc. Int. Conf. Artif. Intell. Statist., pp. 4519–4529, 2020

work page 2020

[22] [22]

Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,

J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,” J. Mach. Learn. Res., vol. 22, no. 213, 2021

work page 2021

[23] [23]

SCAFFOLD: Stochastic controlled averaging for federated learning

S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning” , in Proc. Int. Conf. Mach. Learn. , pp. 5132–5143, 2020

work page 2020

[24] [24]

Momentum benefits non-IID federated learning simply and provably,

Z. Cheng, X. Huang, and K. Yuan, “Momentum benefits non-IID federated learning simply and provably, ” in Proc. Int. Conf. Learn. Represent., 2024

work page 2024

[25] [25]

Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,

A. Mitra, R. Jaafar, G. J. Pappas, and H. Hassani, “Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 14606–14619, 2021

work page 2021

[26] [26]

FedPD: A federated learning framework with adaptivity to Non-IID data,

X. Zhang, M. Hong, S. Dhople, W. Yin, and Y . Liu, “FedPD: A federated learning framework with adaptivity to Non-IID data,”IEEE Trans. Signal Process., vol. 69, pp. 6055–6070, 2021

work page 2021

[27] [27]

Federated learning based on dynamic regularization,

A. E. Durmus, Z. Yue, M. Ramon, M. Matthew, W. Paul, and S. Venkatesh, “Federated learning based on dynamic regularization,” in Proc. Int. Conf. Learn. Represent. , 2021

work page 2021

[28] [28]

Vari- ance reduced local SGD with lower communication complexity,

X. Liang, S. Shen, J. Liu, Z. Pan, E. Chen, and Y . Cheng, “Vari- ance reduced local SGD with lower communication complexity,” 2019, arXiv:1912.12844

work page arXiv 2019

[29] [29]

Feder- ated learning with compression: Unified analysis and sharp guarantees,

F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Feder- ated learning with compression: Unified analysis and sharp guarantees,” in Proc. Int. Conf. Artif. Intell. Statist. , pp. 2350–2358, 2021

work page 2021

[30] [30]

Stochastic controlled averaging for federated learning with communication compression,

X. Huang, P. Li, and X. Li, “Stochastic controlled averaging for federated learning with communication compression,” in Proc. Int. Conf. Learn. Represent., 2024

work page 2024

[31] [31]

A unified theory of decentralized SGD with changing topology and local updates,

A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Proc. Int. Conf. Mach. Learn. , pp. 5381–5393, 2020

work page 2020

[32] [32]

Distributed stochastic gradient tracking methods,

S. Pu and A. Nedi ´c, “Distributed stochastic gradient tracking methods,” Math. Program., vol. 187, pp. 409–457, 2021

work page 2021

[33] [33]

Achieving geometric convergence for distributed optimization over time-varying graphs,

A. Nedi ´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM J. Optim. , vol. 27, no. 4, pp. 2597–2633, 2017

work page 2017

[34] [34]

Harnessing smoothness to accelerate distributed optimization,

G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245– 1260, Sep. 2018

work page 2018

[35] [35]

On the computation-communication trade-off with a flexible gradient tracking approach,

Y . Huang and J. Xu. “On the computation-communication trade-off with a flexible gradient tracking approach,” in Proc. IEEE Conf. Decis. Control, pp. 284–289, 2023. 10

work page 2023

[36] [36]

On the performance of gradient tracking with local updates,

E. D. H. Nguyen, S. A. Alghunaim, K. Yuan, and C. A. Uribe, “On the performance of gradient tracking with local updates,” in Proc. IEEE Conf. Decis. Control , pp. 4309–4313, 2023

work page 2023

[37] [37]

Decentralized gradient tracking with local steps,

Y . Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,” Optimization Methods and Software, pp. 1-28, 2024

work page 2024

[38] [38]

Exact diffusion for distributed optimization and learning-part I: Algorithm development,

K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning-part I: Algorithm development,” IEEE Trans. Signal Process. , vol. 67, no. 3, pp. 708–723, Feb. 2019

work page 2019

[39] [39]

A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,

Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4494–4506, Sep. 2019

work page 2019

[40] [40]

D 2: Decentralized training over decentralized data,

H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D 2: Decentralized training over decentralized data,” in Proc. Int. Conf. Mach. Learn. , pp. 4848–4856, 2018

work page 2018

[41] [41]

Decentralized inexact proximal gradient method with network-independent stepsizes for convex com- posite optimization,

L. Guo, X. Shi, J. Cao, and Z. Wang, “Decentralized inexact proximal gradient method with network-independent stepsizes for convex com- posite optimization,” IEEE Trans. Signal Process., vol. 71, pp. 786–801, 2023

work page 2023

[42] [42]

Local exact-diffusion for decentralized optimization and learning,

S. A. Alghunaim, “Local exact-diffusion for decentralized optimization and learning,” vol. 69, no. 11, pp. 7371–7386, 2024

work page 2024

[43] [43]

ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,

K. Mishchenko, G. Malinovsky, S. Stich, and P. Richt ´arik, “ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,” in Proc. Int. Conf. Mach. Learn. , pp. 15750–15769, 2022

work page 2022

[44] [44]

TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation,

L. Condat, I. Agarsk ´y, G. Malinovsky, and P. Richt ´arik, “TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation,” 2023, arXiv:2302.09832

work page arXiv 2023

[45] [45]

Provably doubly accelerated federated learning: The first theoretically successful combination of local training and communication compression,

L. Condat, I. Agarsk ´y, and P. Richt ´arik, “Provably doubly accelerated federated learning: The first theoretically successful combination of local training and communication compression,” 2023, arXiv:2210.13277

work page arXiv 2023

[46] [46]

LoCoDL: Communication- efficient distributed learning with local training and compression,

L. Condat, A. Maranjyan, and P. Richt ´arik, “LoCoDL: Communication- efficient distributed learning with local training and compression,” in Proc. Int. Conf. Learn. Represent. , 2025

work page 2025

[47] [47]

Variance reduced ProxSkip: Algorithm, theory and application to federated learning,

G. Malinovsky, K. Yi, and P. Richt ´arik, “Variance reduced ProxSkip: Algorithm, theory and application to federated learning,” in Proc. Adv. Neural Inf. Process. Sys. , pp. 15176–15189, 2022

work page 2022

[48] [48]

Tighter analysis for ProxSkip,

Z. Hu and H. Huang, “Tighter analysis for ProxSkip,” in Proc. Int. Conf. Mach. Learn., pp. 13469–13496, 2023

work page 2023

[49] [49]

RandProx: Primal-dual optimization algo- rithms with randomized proximal updates,

L. Condat and P. Richt ´arik, “RandProx: Primal-dual optimization algo- rithms with randomized proximal updates,” in Proc. Int. Conf. Learn. Represent., 2023

work page 2023

[50] [50]

Optimal algorithms for smooth and strongly convex distributed optimization in networks,

K. Scaman, F. Bach, S. Bubeck, Y .-T. Lee, and L. Massouli ´e, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” in Proc. Int. Conf. Mach. Learn. , pp. 3027–3036, 2017

work page 2017

[51] [51]

Achieving linear speedup with network-independent learning rates in decentralized stochastic optimization,

H. Yuan, S. A. Alghunaim, and K. Yuan, “Achieving linear speedup with network-independent learning rates in decentralized stochastic optimization,” in Proc. IEEE Conf. Decis. Control , pp. 139-144, 2023

work page 2023

[52] [52]

LibSVM: A library for support vector machines,

C.-C. Chang and C.-J. Lin, “LibSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol. , vol. 2, no. 3, 2011, Art. no. 27. Luyao Guo received the B.S. degree in infor- mation and computing science from Shanxi Uni- versity, Taiyuan, China, in 2020. He is currently pursuing the Ph.D. degree in applied mathematics with the Jiangsu Prov...

work page 2011