Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks

Angelia Nedi\'c; Kun Huang; Shi Pu

arxiv: 2412.13054 · v3 · submitted 2024-12-17 · 🧮 math.OC

Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks

Kun Huang , Shi Pu , Angelia Nedi\'c This is my paper

Pith reviewed 2026-05-23 06:40 UTC · model grok-4.3

classification 🧮 math.OC

keywords distributed optimizationstochastic proximal gradientnormal mapgradient trackingexact diffusionconvergence ratesnetwork topologycomposite objectives

0 comments

The pith

Two distributed stochastic proximal gradient algorithms achieve convergence rates comparable to the centralized method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified framework for distributed agents to minimize the average of local cost functions plus a shared nonsmooth function using stochastic proximal gradient methods based on normal maps. It proposes two algorithms, norM-DSGT and norM-ED, that asymptotically match the convergence rates of the centralized stochastic proximal gradient descent under a general variance condition on the stochastic gradients. This condition does not depend on the network topology or the nonsmooth term. The transient iterations needed are analyzed in terms of network size and spectral gap, matching or approaching those of non-proximal counterparts.

Core claim

By incorporating the normal map update scheme into distributed stochastic gradient tracking and exact diffusion, the methods norM-DSGT and norM-ED attain the same asymptotic convergence rate as the centralized stochastic proximal gradient descent for composite objectives, with transient times of O(n^3/(1-λ)^2) for norM-ED and O(max{n^3/(1-λ)^2, n/(1-λ)^3}) for norM-DSGT, under the general variance condition.

What carries the argument

The normal map update scheme, which reformulates the proximal step to enable distributed implementation while handling the nonsmooth regularizer.

If this is right

Both algorithms achieve comparable rates to centralized stochastic proximal gradient descent under the variance condition.
norM-ED's transient time matches the non-proximal exact diffusion algorithm.
norM-DSGT matches the transient time of non-proximal DSGT when the spectral gap satisfies (1-λ)^{-1} = O(n^2).
The results establish state-of-the-art convergence for distributed composite optimization over networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The normal map may serve as a template for extending other gradient-based distributed methods to nonsmooth composite settings.
Network connectivity mainly influences the length of the transient phase rather than the final asymptotic rate.
The independence of the variance condition from topology suggests the approach could apply directly to time-varying or directed networks with similar mixing properties.

Load-bearing premise

The stochastic gradients satisfy a general variance condition that is independent of the network topology and the nonsmooth term.

What would settle it

Numerical simulation on a fixed network showing the error decay rate deviates from the centralized rate when the variance condition holds but the claimed transient time bound is violated.

Figures

Figures reproduced from arXiv: 2412.13054 by Angelia Nedi\'c, Kun Huang, Shi Pu.

**Figure 1.** Figure 1: An illustration of the subsequence {kj}k≥0. The goal is to upper bound PK−1 k=0 E[∥F γ nor(¯zk)∥ 2 ] and PK−1 k=0 E[∥Πzk∥ 2 ] according to Lemma 3.5. To achieve the goal, we utilize the multi-step analysis from Section 3. Regarding PK−1 k=0 E[∥F γ nor(¯zk)∥ 2 ], we divide the summation into two parts: one summing over the points k0, k1, . . . , kT , and the other summing over the interval kj , kj + 1, . . … view at source ↗

**Figure 2.** Figure 2: Illustration of ring graph topology with [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison among norM-ED, norM-DSGT, norM-CSGD, Prox-CSGD, Prox-DASA, and Prox-DASA-GT [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison among norM-ED, norM-DSGT, norM-CSGD, Prox-CSGD, Prox-DASA, and Prox-DASA-GT [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Consider $n$ agents connected over a network collaborating to minimize the average of their local cost functions combined with a common nonsmooth function. This paper introduces a unified algorithmic framework for solving such a problem through distributed stochastic proximal gradient methods, leveraging the normal map update scheme. Within this framework, we propose two new algorithms, termed Normal Map-based Distributed Stochastic Gradient Tracking (norM-DSGT) and Normal Map-based Exact Diffusion (norM-ED). We demonstrate that both methods can asymptotically achieve comparable convergence rates to the centralized stochastic proximal gradient descent method under a general variance condition on the stochastic gradients. Additionally, the number of iterations required for norM-ED to achieve such a rate (i.e., the transient time) behaves as $\mathcal{O}(n^{3}/(1-\lambda)^2)$ for minimizing composite objective functions, matching the performance of the non-proximal ED algorithm. Here $1-\lambda$ denotes the spectral gap of the mixing matrix related to the underlying network topology. To our knowledge, such a convergence result is state-of-the-art for the considered composite problem. Under the same condition, norM-DSGT enjoys a transient time of $\mathcal{O}(\max\{n^3/(1-\lambda)^2, n/(1-\lambda)^3\})$, which matches that of the non-proximal DSGT algorithm and norM-ED under the condition $(1-\lambda)^{-1}=\mathcal{O}(n^{2})$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives two normal-map distributed stochastic proximal algorithms that match centralized rates asymptotically with transient times matching the smooth case.

read the letter

The main point is that this work introduces norM-DSGT and norM-ED, which apply a normal map reformulation to existing distributed tracking schemes so they can handle a shared nonsmooth term in stochastic composite optimization. Both reach the same asymptotic rate as centralized stochastic proximal gradient descent under a variance condition on the gradients that does not depend on the network or the nonsmooth function. The transient time for norM-ED is stated as O(n^3/(1-λ)^2), matching the non-proximal ED, and norM-DSGT gets O(max{n^3/(1-λ)^2, n/(1-λ)^3}), which aligns with DSGT when the spectral gap satisfies the given relation. This is presented as state-of-the-art for the composite setting over networks.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces two distributed stochastic proximal gradient algorithms, norM-DSGT and norM-ED, based on the normal map reformulation for solving composite optimization problems where agents minimize the average of local smooth costs plus a shared nonsmooth function over a network. It claims that both methods asymptotically achieve convergence rates comparable to the centralized stochastic proximal gradient descent under a general variance condition on the stochastic gradients that is independent of the network topology and the nonsmooth term. The transient time for norM-ED is O(n^3 / (1-λ)^2), matching the non-proximal ED, and for norM-DSGT it is O(max{n^3/(1-λ)^2, n/(1-λ)^3}), matching DSGT under certain conditions on the spectral gap.

Significance. If the convergence results hold, this work advances the field by providing state-of-the-art rates for distributed composite stochastic optimization, with the key strength being the extension to proximal settings without degrading the rates or introducing dependence on the nonsmooth term in the variance bound. The matching transient times to non-proximal counterparts is a significant contribution.

minor comments (2)

[Abstract] The abstract is quite dense with technical claims; expanding slightly on the variance condition would improve accessibility.
[Abstract] The notation for the spectral gap is introduced as 1-λ without an immediate definition, which may confuse readers unfamiliar with the context.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of our contributions, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces norM-DSGT and norM-ED via normal map reformulation of distributed stochastic proximal gradient methods and derives asymptotic rates matching centralized stochastic proximal gradient descent under a topology-independent variance condition on stochastic gradients. Transient times are stated explicitly in terms of n and the spectral gap 1-λ, with direct comparison to non-proximal ED/DSGT performance; these are external benchmarks rather than reductions to fitted constants or self-referential definitions. No equations or claims in the provided abstract reduce a prediction to an input by construction, and no load-bearing self-citation chain is invoked to establish the central rates. The derivation chain is therefore self-contained against the stated assumptions and external comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; ledger entries are inferred from stated assumptions rather than verified in proofs.

axioms (2)

domain assumption Stochastic gradients obey a general variance condition independent of network topology
Invoked to obtain centralized rates; location: abstract paragraph on convergence.
domain assumption Mixing matrix has spectral gap 1-λ > 0 (connected network)
Used to bound transient time; location: abstract discussion of norM-ED and norM-DSGT.

pith-pipeline@v0.9.0 · 5793 in / 1178 out tokens · 34744 ms · 2026-05-23T06:40:37.824257+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

[1]

A LGHUNAIM , K

S. A LGHUNAIM , K. Y UAN, AND A. H. S AYED, A linearly convergent proximal gradient algorithm for decentral- ized optimization, Advances in Neural Information Processing Systems, 32 (2019)

work page 2019
[2]

S. A. A LGHUNAIM , E. K. R YU, K. Y UAN, AND A. H. S AYED, Decentralized proximal gradient algorithms with linear convergence rates, IEEE Transactions on Automatic Control, 66 (2020), pp. 2787–2794

work page 2020
[3]

S. A. A LGHUNAIM AND K. Y UAN, A unified and refined convergence analysis for non-convex decentralized learning, IEEE Transactions on Signal Processing, 70 (2022), pp. 3264–3279

work page 2022
[4]

ARJEVANI , Y

Y. ARJEVANI , Y. CARMON , J. C. D UCHI , D. J. F OSTER , N. S REBRO , AND B. W OODWORTH , Lower bounds for non-convex stochastic optimization, Mathematical Programming, 199 (2023), pp. 165–214

work page 2023
[5]

ATTOUCH , J

H. ATTOUCH , J. B OLTE , AND B. F. S VAITER , Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods, Mathematical Programming, 137 (2013), pp. 91–129

work page 2013
[6]

B ECK, First-order methods in optimization, SIAM, 2017

A. B ECK, First-order methods in optimization, SIAM, 2017

work page 2017
[7]

BIANCHI AND J

P. BIANCHI AND J. J AKUBOWICZ , Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization, IEEE transactions on automatic control, 58 (2012), pp. 391–405

work page 2012
[8]

B OTTOU , F

L. B OTTOU , F. E. C URTIS , AND J. N OCEDAL , Optimization methods for large-scale machine learning, Siam Review, 60 (2018), pp. 223–311

work page 2018
[9]

A. I. C HEN AND A. O ZDAGLAR , A fast distributed proximal-gradient method, in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), IEEE, 2012, pp. 601–608

work page 2012
[10]

C HIERCHIA , E

G. C HIERCHIA , E. C HOUZENOUX , P. L. C OMBETTES , AND J.-C. P ESQUET , The proximity operator repository, User‚Äôs guide http://proximity-operator. net/download/guide. pdf. Accessed, 6 (2020)

work page 2020
[11]

D AVIS AND D

D. D AVIS AND D. D RUSVYATSKIY , Stochastic model-based minimization of weakly convex functions, SIAM Journal on Optimization, 29 (2019), pp. 207–239. 32 Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks A PREPRINT

work page 2019
[12]

DI LORENZO AND G

P. DI LORENZO AND G. S CUTARI , Next: In-network nonconvex optimization, IEEE Transactions on Signal and Information Processing over Networks, 2 (2016), pp. 120–136

work page 2016
[13]

D RUSVYATSKIY AND A

D. D RUSVYATSKIY AND A. S. L EWIS , Error bounds, quadratic growth, and linear convergence of proximal methods, Mathematics of Operations Research, 43 (2018), pp. 919–948

work page 2018
[14]

E L GHECHE , G

M. E L GHECHE , G. C HIERCHIA , AND J.-C. P ESQUET , Proximity operators of discrete information divergences, IEEE Transactions on Information Theory, 64 (2017), pp. 1092–1104

work page 2017
[15]

G HADIMI , G

S. G HADIMI , G. L AN, AND H. Z HANG , Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Mathematical Programming, 155 (2016), pp. 267–305

work page 2016
[16]

L. G UO, X. S HI, J. C AO, AND Z. W ANG, Decentralized inexact proximal gradient method with network- independent stepsizes for convex composite optimization, IEEE Transactions on Signal Processing, 71 (2023), pp. 786–801

work page 2023
[17]

H UANG , X

K. H UANG , X. L I, AND S. P U, Distributed stochastic optimization under a general variance condition, IEEE Transactions on Automatic Control, (2024)

work page 2024
[18]

H UANG AND S

K. H UANG AND S. P U, Cedas: A compressed decentralized stochastic gradient method with improved conver- gence, 2023, https://arxiv.org/abs/2301.05872

work page arXiv 2023
[19]

H UANG , S

K. H UANG , S. P U, AND A. N EDI ´C, An accelerated distributed stochastic gradient method with momentum, arXiv preprint arXiv:2402.09714, (2024)

work page arXiv 2024
[20]

H UANG , L

K. H UANG , L. Z HOU , AND S. P U, Distributed random reshuffling methods with improved convergence, 2023, https://arxiv.org/abs/2306.12037

work page arXiv 2023
[21]

B. M. I DREES , S. D. S HARMA , AND K. R AJAWAT, Analysis of decentralized stochastic successive convex approximation for composite non-convex problems, arXiv preprint arXiv:2405.07100, (2024)

work page arXiv 2024
[22]

Y. JI, G. S CUTARI , Y. S UN, AND H. H ONNAPPA , Distributed sparse regression via penalization, Journal of Machine Learning Research, 24 (2023), pp. 1–62

work page 2023
[23]

K HALED AND P

A. K HALED AND P. R ICHTÁRIK , Better theory for SGD in the nonconvex world , Transactions on Machine Learning Research, (2023), https://openreview.net/forum?id=AU4qHN2VkS. Survey Certification

work page 2023
[24]

L ECUN, Y

Y. L ECUN, Y. BENGIO , AND G. H INTON , Deep learning, nature, 521 (2015), pp. 436–444

work page 2015
[25]

Gradient-based learning applied to document recognition,

Y. LECUN , L. B OTTOU , Y. BENGIO , AND P. HAFFNER , Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324, https://doi.org/10.1109/5.726791

work page doi:10.1109/5.726791 1998
[26]

Y. LEI, T. H U, G. L I, AND K. T ANG, Stochastic gradient descent for nonconvex learning without bounded gradient assumptions, IEEE transactions on neural networks and learning systems, 31 (2019), pp. 4394–4400

work page 2019
[27]

L I AND A

X. L I AND A. M ILZAREK , A unified convergence theorem for stochastic optimization methods, Advances in Neural Information Processing Systems, 35 (2022), pp. 33107–33119

work page 2022
[28]

X. L I, A. M ILZAREK , AND J. Q IU, A new random reshuffling method for nonsmooth nonconvex finite-sum optimization, arXiv preprint arXiv:2312.01047, (2023)

work page arXiv 2023
[29]

Y. LI, X. L IU, J. T ANG , M. YAN, AND K. Y UAN, Decentralized composite optimization with compression, arXiv preprint arXiv:2108.04448, (2021)

work page arXiv 2021
[30]

Z. L I, W. S HI, AND M. YAN, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, IEEE Transactions on Signal Processing, 67 (2019), pp. 4494–4506

work page 2019
[31]

M ANCINO -BALL , S

G. M ANCINO -BALL , S. M IAO, Y. XU, AND J. C HEN, Proximal stochastic recursive momentum methods for nonconvex composite decentralized optimization, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 9055–9063

work page 2023
[32]

M ILZAREK AND J

A. M ILZAREK AND J. Q IU, Convergence of a normal map-based prox-sgd method under the kl inequality, arXiv preprint arXiv:2305.05828, (2023)

work page arXiv 2023
[33]

M ILZAREK , X

A. M ILZAREK , X. X IAO, S. C EN, Z. W EN, AND M. U LBRICH , A stochastic semismooth newton method for nonsmooth nonconvex optimization , SIAM Journal on Optimization, 29 (2019), pp. 2916–2948, https: //doi.org/10.1137/18M1181249

work page doi:10.1137/18m1181249 2019
[34]

N EDI ´C, A

A. N EDI ´C, A. O LSHEVSKY , AND M. G. R ABBAT, Network topology and communication-computation tradeoffs in decentralized optimization, Proceedings of the IEEE, 106 (2018), pp. 953–976

work page 2018
[35]

N EDI ´C, A

A. N EDI ´C, A. O LSHEVSKY , AND W. SHI, Achieving geometric convergence for distributed optimization over time-varying graphs, SIAM Journal on Optimization, 27 (2017), pp. 2597–2633. 33 Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks A PREPRINT

work page 2017
[36]

N EDIC AND A

A. N EDIC AND A. O ZDAGLAR , Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, 54 (2009), pp. 48–61

work page 2009
[37]

N EDI ´C, A

A. N EDI ´C, A. O ZDAGLAR , AND P. A. P ARRILO , Constrained consensus and optimization in multi-agent networks, IEEE Transactions on Automatic Control, 55 (2010), pp. 922–938

work page 2010
[38]

O LSHEVSKY , Asymptotic network independence and step-size for a distributed subgradient method, Journal of machine learning research, 23 (2022), pp

A. O LSHEVSKY , Asymptotic network independence and step-size for a distributed subgradient method, Journal of machine learning research, 23 (2022), pp. 1–32

work page 2022
[39]

O UYANG AND A

W. O UYANG AND A. M ILZAREK , A trust region-type normal map-based semismooth newton method for nonsmooth nonconvex composite optimization, Mathematical Programming, (2024), pp. 1–47

work page 2024
[40]

S. P U, A. O LSHEVSKY , AND I. C. P ASCHALIDIS , Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent , IEEE signal processing magazine, 37 (2020), pp. 114–122

work page 2020
[41]

R AVAZZI, S

C. R AVAZZI, S. M. F OSSON , AND E. M AGLI , Distributed iterative thresholding for ‚Ñì 0/‚Ñì 1-regularized linear inverse problems, IEEE Transactions on Information Theory, 61 (2015), pp. 2081–2100

work page 2015
[42]

S. M. R OBINSON , Normal maps induced by linear transformations, Mathematics of Operations Research, 17 (1992), pp. 691–714

work page 1992
[43]

W. S HI, Q. L ING , G. W U, AND W. Y IN, Extra: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization, 25 (2015), pp. 944–966

work page 2015
[44]

W. SHI, Q. L ING , G. W U, AND W. YIN, A proximal gradient algorithm for decentralized composite optimization, IEEE Transactions on Signal Processing, 63 (2015), pp. 6013–6023

work page 2015
[45]

WANG , L

L. WANG , L. B AO, AND X. L IU, A decentralized proximal gradient tracking algorithm for composite optimization on riemannian manifolds, arXiv preprint arXiv:2401.11573, (2024)

work page arXiv 2024
[46]

WANG , J

Z. WANG , J. Z HANG , T.-H. C HANG , J. L I, AND Z.-Q. L UO, Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems, IEEE Transactions on Signal Processing, 69 (2021), pp. 4486– 4501

work page 2021
[47]

T. XIAO, X. C HEN , K. B ALASUBRAMANIAN , AND S. G HADIMI , A one-sample decentralized proximal algorithm for non-convex stochastic composite optimization, in Uncertainty in Artificial Intelligence, PMLR, 2023, pp. 2324– 2334

work page 2023
[48]

R. X IN, S. D AS, U. A. K HAN , AND S. K AR, A stochastic proximal gradient framework for decentralized non-convex composite optimization: Topology-independent sample complexity and communication efficiency , arXiv preprint arXiv:2110.01594, (2021)

work page arXiv 2021
[49]

J. X U, Y. T IAN , Y. S UN, AND G. S CUTARI , Distributed algorithms for composite optimization: Unified framework and convergence analysis, IEEE Transactions on Signal Processing, (2021)

work page 2021
[50]

J. X U, S. Z HU, Y. C. S OH, AND L. X IE, Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes, in 2015 54th IEEE Conference on Decision and Control (CDC), IEEE, 2015, pp. 2055–2060

work page 2015
[51]

Y. YAN, J. C HEN , P.-Y. C HEN , X. C UI, S. L U, AND Y. XU, Compressed decentralized proximal stochastic gradient method for nonconvex composite problems with heterogeneous data , in International Conference on Machine Learning, PMLR, 2023, pp. 39035–39061

work page 2023
[52]

H. Y E, Z. Z HOU , L. L UO, AND T. ZHANG , Decentralized accelerated proximal gradient descent, Advances in Neural Information Processing Systems, 33 (2020), pp. 18308–18317

work page 2020
[53]

Y UAN, B

K. Y UAN, B. Y ING , X. Z HAO, AND A. H. S AYED, Exact diffusion for distributed optimization and learn- ing‚Äîpart i: Algorithm development, IEEE Transactions on Signal Processing, 67 (2018), pp. 708–723

work page 2018
[54]

Z ENG AND W

J. Z ENG AND W. YIN, On nonconvex decentralized gradient descent, IEEE Transactions on signal processing, 66 (2018), pp. 2834–2848

work page 2018
[55]

Z OU AND T

H. Z OU AND T. HASTIE , Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society Series B: Statistical Methodology, 67 (2005), pp. 301–320. 34

work page 2005

[1] [1]

A LGHUNAIM , K

S. A LGHUNAIM , K. Y UAN, AND A. H. S AYED, A linearly convergent proximal gradient algorithm for decentral- ized optimization, Advances in Neural Information Processing Systems, 32 (2019)

work page 2019

[2] [2]

S. A. A LGHUNAIM , E. K. R YU, K. Y UAN, AND A. H. S AYED, Decentralized proximal gradient algorithms with linear convergence rates, IEEE Transactions on Automatic Control, 66 (2020), pp. 2787–2794

work page 2020

[3] [3]

S. A. A LGHUNAIM AND K. Y UAN, A unified and refined convergence analysis for non-convex decentralized learning, IEEE Transactions on Signal Processing, 70 (2022), pp. 3264–3279

work page 2022

[4] [4]

ARJEVANI , Y

Y. ARJEVANI , Y. CARMON , J. C. D UCHI , D. J. F OSTER , N. S REBRO , AND B. W OODWORTH , Lower bounds for non-convex stochastic optimization, Mathematical Programming, 199 (2023), pp. 165–214

work page 2023

[5] [5]

ATTOUCH , J

H. ATTOUCH , J. B OLTE , AND B. F. S VAITER , Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods, Mathematical Programming, 137 (2013), pp. 91–129

work page 2013

[6] [6]

B ECK, First-order methods in optimization, SIAM, 2017

A. B ECK, First-order methods in optimization, SIAM, 2017

work page 2017

[7] [7]

BIANCHI AND J

P. BIANCHI AND J. J AKUBOWICZ , Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization, IEEE transactions on automatic control, 58 (2012), pp. 391–405

work page 2012

[8] [8]

B OTTOU , F

L. B OTTOU , F. E. C URTIS , AND J. N OCEDAL , Optimization methods for large-scale machine learning, Siam Review, 60 (2018), pp. 223–311

work page 2018

[9] [9]

A. I. C HEN AND A. O ZDAGLAR , A fast distributed proximal-gradient method, in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), IEEE, 2012, pp. 601–608

work page 2012

[10] [10]

C HIERCHIA , E

G. C HIERCHIA , E. C HOUZENOUX , P. L. C OMBETTES , AND J.-C. P ESQUET , The proximity operator repository, User‚Äôs guide http://proximity-operator. net/download/guide. pdf. Accessed, 6 (2020)

work page 2020

[11] [11]

D AVIS AND D

D. D AVIS AND D. D RUSVYATSKIY , Stochastic model-based minimization of weakly convex functions, SIAM Journal on Optimization, 29 (2019), pp. 207–239. 32 Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks A PREPRINT

work page 2019

[12] [12]

DI LORENZO AND G

P. DI LORENZO AND G. S CUTARI , Next: In-network nonconvex optimization, IEEE Transactions on Signal and Information Processing over Networks, 2 (2016), pp. 120–136

work page 2016

[13] [13]

D RUSVYATSKIY AND A

D. D RUSVYATSKIY AND A. S. L EWIS , Error bounds, quadratic growth, and linear convergence of proximal methods, Mathematics of Operations Research, 43 (2018), pp. 919–948

work page 2018

[14] [14]

E L GHECHE , G

M. E L GHECHE , G. C HIERCHIA , AND J.-C. P ESQUET , Proximity operators of discrete information divergences, IEEE Transactions on Information Theory, 64 (2017), pp. 1092–1104

work page 2017

[15] [15]

G HADIMI , G

S. G HADIMI , G. L AN, AND H. Z HANG , Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Mathematical Programming, 155 (2016), pp. 267–305

work page 2016

[16] [16]

L. G UO, X. S HI, J. C AO, AND Z. W ANG, Decentralized inexact proximal gradient method with network- independent stepsizes for convex composite optimization, IEEE Transactions on Signal Processing, 71 (2023), pp. 786–801

work page 2023

[17] [17]

H UANG , X

K. H UANG , X. L I, AND S. P U, Distributed stochastic optimization under a general variance condition, IEEE Transactions on Automatic Control, (2024)

work page 2024

[18] [18]

H UANG AND S

K. H UANG AND S. P U, Cedas: A compressed decentralized stochastic gradient method with improved conver- gence, 2023, https://arxiv.org/abs/2301.05872

work page arXiv 2023

[19] [19]

H UANG , S

K. H UANG , S. P U, AND A. N EDI ´C, An accelerated distributed stochastic gradient method with momentum, arXiv preprint arXiv:2402.09714, (2024)

work page arXiv 2024

[20] [20]

H UANG , L

K. H UANG , L. Z HOU , AND S. P U, Distributed random reshuffling methods with improved convergence, 2023, https://arxiv.org/abs/2306.12037

work page arXiv 2023

[21] [21]

B. M. I DREES , S. D. S HARMA , AND K. R AJAWAT, Analysis of decentralized stochastic successive convex approximation for composite non-convex problems, arXiv preprint arXiv:2405.07100, (2024)

work page arXiv 2024

[22] [22]

Y. JI, G. S CUTARI , Y. S UN, AND H. H ONNAPPA , Distributed sparse regression via penalization, Journal of Machine Learning Research, 24 (2023), pp. 1–62

work page 2023

[23] [23]

K HALED AND P

A. K HALED AND P. R ICHTÁRIK , Better theory for SGD in the nonconvex world , Transactions on Machine Learning Research, (2023), https://openreview.net/forum?id=AU4qHN2VkS. Survey Certification

work page 2023

[24] [24]

L ECUN, Y

Y. L ECUN, Y. BENGIO , AND G. H INTON , Deep learning, nature, 521 (2015), pp. 436–444

work page 2015

[25] [25]

Gradient-based learning applied to document recognition,

Y. LECUN , L. B OTTOU , Y. BENGIO , AND P. HAFFNER , Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324, https://doi.org/10.1109/5.726791

work page doi:10.1109/5.726791 1998

[26] [26]

Y. LEI, T. H U, G. L I, AND K. T ANG, Stochastic gradient descent for nonconvex learning without bounded gradient assumptions, IEEE transactions on neural networks and learning systems, 31 (2019), pp. 4394–4400

work page 2019

[27] [27]

L I AND A

X. L I AND A. M ILZAREK , A unified convergence theorem for stochastic optimization methods, Advances in Neural Information Processing Systems, 35 (2022), pp. 33107–33119

work page 2022

[28] [28]

X. L I, A. M ILZAREK , AND J. Q IU, A new random reshuffling method for nonsmooth nonconvex finite-sum optimization, arXiv preprint arXiv:2312.01047, (2023)

work page arXiv 2023

[29] [29]

Y. LI, X. L IU, J. T ANG , M. YAN, AND K. Y UAN, Decentralized composite optimization with compression, arXiv preprint arXiv:2108.04448, (2021)

work page arXiv 2021

[30] [30]

Z. L I, W. S HI, AND M. YAN, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, IEEE Transactions on Signal Processing, 67 (2019), pp. 4494–4506

work page 2019

[31] [31]

M ANCINO -BALL , S

G. M ANCINO -BALL , S. M IAO, Y. XU, AND J. C HEN, Proximal stochastic recursive momentum methods for nonconvex composite decentralized optimization, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 9055–9063

work page 2023

[32] [32]

M ILZAREK AND J

A. M ILZAREK AND J. Q IU, Convergence of a normal map-based prox-sgd method under the kl inequality, arXiv preprint arXiv:2305.05828, (2023)

work page arXiv 2023

[33] [33]

M ILZAREK , X

A. M ILZAREK , X. X IAO, S. C EN, Z. W EN, AND M. U LBRICH , A stochastic semismooth newton method for nonsmooth nonconvex optimization , SIAM Journal on Optimization, 29 (2019), pp. 2916–2948, https: //doi.org/10.1137/18M1181249

work page doi:10.1137/18m1181249 2019

[34] [34]

N EDI ´C, A

A. N EDI ´C, A. O LSHEVSKY , AND M. G. R ABBAT, Network topology and communication-computation tradeoffs in decentralized optimization, Proceedings of the IEEE, 106 (2018), pp. 953–976

work page 2018

[35] [35]

N EDI ´C, A

A. N EDI ´C, A. O LSHEVSKY , AND W. SHI, Achieving geometric convergence for distributed optimization over time-varying graphs, SIAM Journal on Optimization, 27 (2017), pp. 2597–2633. 33 Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks A PREPRINT

work page 2017

[36] [36]

N EDIC AND A

A. N EDIC AND A. O ZDAGLAR , Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, 54 (2009), pp. 48–61

work page 2009

[37] [37]

N EDI ´C, A

A. N EDI ´C, A. O ZDAGLAR , AND P. A. P ARRILO , Constrained consensus and optimization in multi-agent networks, IEEE Transactions on Automatic Control, 55 (2010), pp. 922–938

work page 2010

[38] [38]

O LSHEVSKY , Asymptotic network independence and step-size for a distributed subgradient method, Journal of machine learning research, 23 (2022), pp

A. O LSHEVSKY , Asymptotic network independence and step-size for a distributed subgradient method, Journal of machine learning research, 23 (2022), pp. 1–32

work page 2022

[39] [39]

O UYANG AND A

W. O UYANG AND A. M ILZAREK , A trust region-type normal map-based semismooth newton method for nonsmooth nonconvex composite optimization, Mathematical Programming, (2024), pp. 1–47

work page 2024

[40] [40]

S. P U, A. O LSHEVSKY , AND I. C. P ASCHALIDIS , Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent , IEEE signal processing magazine, 37 (2020), pp. 114–122

work page 2020

[41] [41]

R AVAZZI, S

C. R AVAZZI, S. M. F OSSON , AND E. M AGLI , Distributed iterative thresholding for ‚Ñì 0/‚Ñì 1-regularized linear inverse problems, IEEE Transactions on Information Theory, 61 (2015), pp. 2081–2100

work page 2015

[42] [42]

S. M. R OBINSON , Normal maps induced by linear transformations, Mathematics of Operations Research, 17 (1992), pp. 691–714

work page 1992

[43] [43]

W. S HI, Q. L ING , G. W U, AND W. Y IN, Extra: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization, 25 (2015), pp. 944–966

work page 2015

[44] [44]

W. SHI, Q. L ING , G. W U, AND W. YIN, A proximal gradient algorithm for decentralized composite optimization, IEEE Transactions on Signal Processing, 63 (2015), pp. 6013–6023

work page 2015

[45] [45]

WANG , L

L. WANG , L. B AO, AND X. L IU, A decentralized proximal gradient tracking algorithm for composite optimization on riemannian manifolds, arXiv preprint arXiv:2401.11573, (2024)

work page arXiv 2024

[46] [46]

WANG , J

Z. WANG , J. Z HANG , T.-H. C HANG , J. L I, AND Z.-Q. L UO, Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems, IEEE Transactions on Signal Processing, 69 (2021), pp. 4486– 4501

work page 2021

[47] [47]

T. XIAO, X. C HEN , K. B ALASUBRAMANIAN , AND S. G HADIMI , A one-sample decentralized proximal algorithm for non-convex stochastic composite optimization, in Uncertainty in Artificial Intelligence, PMLR, 2023, pp. 2324– 2334

work page 2023

[48] [48]

R. X IN, S. D AS, U. A. K HAN , AND S. K AR, A stochastic proximal gradient framework for decentralized non-convex composite optimization: Topology-independent sample complexity and communication efficiency , arXiv preprint arXiv:2110.01594, (2021)

work page arXiv 2021

[49] [49]

J. X U, Y. T IAN , Y. S UN, AND G. S CUTARI , Distributed algorithms for composite optimization: Unified framework and convergence analysis, IEEE Transactions on Signal Processing, (2021)

work page 2021

[50] [50]

J. X U, S. Z HU, Y. C. S OH, AND L. X IE, Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes, in 2015 54th IEEE Conference on Decision and Control (CDC), IEEE, 2015, pp. 2055–2060

work page 2015

[51] [51]

Y. YAN, J. C HEN , P.-Y. C HEN , X. C UI, S. L U, AND Y. XU, Compressed decentralized proximal stochastic gradient method for nonconvex composite problems with heterogeneous data , in International Conference on Machine Learning, PMLR, 2023, pp. 39035–39061

work page 2023

[52] [52]

H. Y E, Z. Z HOU , L. L UO, AND T. ZHANG , Decentralized accelerated proximal gradient descent, Advances in Neural Information Processing Systems, 33 (2020), pp. 18308–18317

work page 2020

[53] [53]

Y UAN, B

K. Y UAN, B. Y ING , X. Z HAO, AND A. H. S AYED, Exact diffusion for distributed optimization and learn- ing‚Äîpart i: Algorithm development, IEEE Transactions on Signal Processing, 67 (2018), pp. 708–723

work page 2018

[54] [54]

Z ENG AND W

J. Z ENG AND W. YIN, On nonconvex decentralized gradient descent, IEEE Transactions on signal processing, 66 (2018), pp. 2834–2848

work page 2018

[55] [55]

Z OU AND T

H. Z OU AND T. HASTIE , Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society Series B: Statistical Methodology, 67 (2005), pp. 301–320. 34

work page 2005