Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks
Pith reviewed 2026-05-23 06:40 UTC · model grok-4.3
The pith
Two distributed stochastic proximal gradient algorithms achieve convergence rates comparable to the centralized method.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By incorporating the normal map update scheme into distributed stochastic gradient tracking and exact diffusion, the methods norM-DSGT and norM-ED attain the same asymptotic convergence rate as the centralized stochastic proximal gradient descent for composite objectives, with transient times of O(n^3/(1-λ)^2) for norM-ED and O(max{n^3/(1-λ)^2, n/(1-λ)^3}) for norM-DSGT, under the general variance condition.
What carries the argument
The normal map update scheme, which reformulates the proximal step to enable distributed implementation while handling the nonsmooth regularizer.
If this is right
- Both algorithms achieve comparable rates to centralized stochastic proximal gradient descent under the variance condition.
- norM-ED's transient time matches the non-proximal exact diffusion algorithm.
- norM-DSGT matches the transient time of non-proximal DSGT when the spectral gap satisfies (1-λ)^{-1} = O(n^2).
- The results establish state-of-the-art convergence for distributed composite optimization over networks.
Where Pith is reading between the lines
- The normal map may serve as a template for extending other gradient-based distributed methods to nonsmooth composite settings.
- Network connectivity mainly influences the length of the transient phase rather than the final asymptotic rate.
- The independence of the variance condition from topology suggests the approach could apply directly to time-varying or directed networks with similar mixing properties.
Load-bearing premise
The stochastic gradients satisfy a general variance condition that is independent of the network topology and the nonsmooth term.
What would settle it
Numerical simulation on a fixed network showing the error decay rate deviates from the centralized rate when the variance condition holds but the claimed transient time bound is violated.
Figures
read the original abstract
Consider $n$ agents connected over a network collaborating to minimize the average of their local cost functions combined with a common nonsmooth function. This paper introduces a unified algorithmic framework for solving such a problem through distributed stochastic proximal gradient methods, leveraging the normal map update scheme. Within this framework, we propose two new algorithms, termed Normal Map-based Distributed Stochastic Gradient Tracking (norM-DSGT) and Normal Map-based Exact Diffusion (norM-ED). We demonstrate that both methods can asymptotically achieve comparable convergence rates to the centralized stochastic proximal gradient descent method under a general variance condition on the stochastic gradients. Additionally, the number of iterations required for norM-ED to achieve such a rate (i.e., the transient time) behaves as $\mathcal{O}(n^{3}/(1-\lambda)^2)$ for minimizing composite objective functions, matching the performance of the non-proximal ED algorithm. Here $1-\lambda$ denotes the spectral gap of the mixing matrix related to the underlying network topology. To our knowledge, such a convergence result is state-of-the-art for the considered composite problem. Under the same condition, norM-DSGT enjoys a transient time of $\mathcal{O}(\max\{n^3/(1-\lambda)^2, n/(1-\lambda)^3\})$, which matches that of the non-proximal DSGT algorithm and norM-ED under the condition $(1-\lambda)^{-1}=\mathcal{O}(n^{2})$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces two distributed stochastic proximal gradient algorithms, norM-DSGT and norM-ED, based on the normal map reformulation for solving composite optimization problems where agents minimize the average of local smooth costs plus a shared nonsmooth function over a network. It claims that both methods asymptotically achieve convergence rates comparable to the centralized stochastic proximal gradient descent under a general variance condition on the stochastic gradients that is independent of the network topology and the nonsmooth term. The transient time for norM-ED is O(n^3 / (1-λ)^2), matching the non-proximal ED, and for norM-DSGT it is O(max{n^3/(1-λ)^2, n/(1-λ)^3}), matching DSGT under certain conditions on the spectral gap.
Significance. If the convergence results hold, this work advances the field by providing state-of-the-art rates for distributed composite stochastic optimization, with the key strength being the extension to proximal settings without degrading the rates or introducing dependence on the nonsmooth term in the variance bound. The matching transient times to non-proximal counterparts is a significant contribution.
minor comments (2)
- [Abstract] The abstract is quite dense with technical claims; expanding slightly on the variance condition would improve accessibility.
- [Abstract] The notation for the spectral gap is introduced as 1-λ without an immediate definition, which may confuse readers unfamiliar with the context.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript, the accurate summary of our contributions, and the recommendation for minor revision. No specific major comments were raised in the report.
Circularity Check
No significant circularity identified
full rationale
The paper introduces norM-DSGT and norM-ED via normal map reformulation of distributed stochastic proximal gradient methods and derives asymptotic rates matching centralized stochastic proximal gradient descent under a topology-independent variance condition on stochastic gradients. Transient times are stated explicitly in terms of n and the spectral gap 1-λ, with direct comparison to non-proximal ED/DSGT performance; these are external benchmarks rather than reductions to fitted constants or self-referential definitions. No equations or claims in the provided abstract reduce a prediction to an input by construction, and no load-bearing self-citation chain is invoked to establish the central rates. The derivation chain is therefore self-contained against the stated assumptions and external comparisons.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Stochastic gradients obey a general variance condition independent of network topology
- domain assumption Mixing matrix has spectral gap 1-λ > 0 (connected network)
Reference graph
Works this paper leans on
-
[1]
S. A LGHUNAIM , K. Y UAN, AND A. H. S AYED, A linearly convergent proximal gradient algorithm for decentral- ized optimization, Advances in Neural Information Processing Systems, 32 (2019)
work page 2019
-
[2]
S. A. A LGHUNAIM , E. K. R YU, K. Y UAN, AND A. H. S AYED, Decentralized proximal gradient algorithms with linear convergence rates, IEEE Transactions on Automatic Control, 66 (2020), pp. 2787–2794
work page 2020
-
[3]
S. A. A LGHUNAIM AND K. Y UAN, A unified and refined convergence analysis for non-convex decentralized learning, IEEE Transactions on Signal Processing, 70 (2022), pp. 3264–3279
work page 2022
-
[4]
Y. ARJEVANI , Y. CARMON , J. C. D UCHI , D. J. F OSTER , N. S REBRO , AND B. W OODWORTH , Lower bounds for non-convex stochastic optimization, Mathematical Programming, 199 (2023), pp. 165–214
work page 2023
-
[5]
H. ATTOUCH , J. B OLTE , AND B. F. S VAITER , Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods, Mathematical Programming, 137 (2013), pp. 91–129
work page 2013
-
[6]
B ECK, First-order methods in optimization, SIAM, 2017
A. B ECK, First-order methods in optimization, SIAM, 2017
work page 2017
-
[7]
P. BIANCHI AND J. J AKUBOWICZ , Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization, IEEE transactions on automatic control, 58 (2012), pp. 391–405
work page 2012
-
[8]
L. B OTTOU , F. E. C URTIS , AND J. N OCEDAL , Optimization methods for large-scale machine learning, Siam Review, 60 (2018), pp. 223–311
work page 2018
-
[9]
A. I. C HEN AND A. O ZDAGLAR , A fast distributed proximal-gradient method, in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), IEEE, 2012, pp. 601–608
work page 2012
-
[10]
G. C HIERCHIA , E. C HOUZENOUX , P. L. C OMBETTES , AND J.-C. P ESQUET , The proximity operator repository, User’s guide http://proximity-operator. net/download/guide. pdf. Accessed, 6 (2020)
work page 2020
-
[11]
D. D AVIS AND D. D RUSVYATSKIY , Stochastic model-based minimization of weakly convex functions, SIAM Journal on Optimization, 29 (2019), pp. 207–239. 32 Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks A PREPRINT
work page 2019
-
[12]
P. DI LORENZO AND G. S CUTARI , Next: In-network nonconvex optimization, IEEE Transactions on Signal and Information Processing over Networks, 2 (2016), pp. 120–136
work page 2016
-
[13]
D. D RUSVYATSKIY AND A. S. L EWIS , Error bounds, quadratic growth, and linear convergence of proximal methods, Mathematics of Operations Research, 43 (2018), pp. 919–948
work page 2018
-
[14]
M. E L GHECHE , G. C HIERCHIA , AND J.-C. P ESQUET , Proximity operators of discrete information divergences, IEEE Transactions on Information Theory, 64 (2017), pp. 1092–1104
work page 2017
-
[15]
S. G HADIMI , G. L AN, AND H. Z HANG , Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Mathematical Programming, 155 (2016), pp. 267–305
work page 2016
-
[16]
L. G UO, X. S HI, J. C AO, AND Z. W ANG, Decentralized inexact proximal gradient method with network- independent stepsizes for convex composite optimization, IEEE Transactions on Signal Processing, 71 (2023), pp. 786–801
work page 2023
-
[17]
K. H UANG , X. L I, AND S. P U, Distributed stochastic optimization under a general variance condition, IEEE Transactions on Automatic Control, (2024)
work page 2024
-
[18]
K. H UANG AND S. P U, Cedas: A compressed decentralized stochastic gradient method with improved conver- gence, 2023, https://arxiv.org/abs/2301.05872
-
[19]
K. H UANG , S. P U, AND A. N EDI ´C, An accelerated distributed stochastic gradient method with momentum, arXiv preprint arXiv:2402.09714, (2024)
-
[20]
K. H UANG , L. Z HOU , AND S. P U, Distributed random reshuffling methods with improved convergence, 2023, https://arxiv.org/abs/2306.12037
- [21]
-
[22]
Y. JI, G. S CUTARI , Y. S UN, AND H. H ONNAPPA , Distributed sparse regression via penalization, Journal of Machine Learning Research, 24 (2023), pp. 1–62
work page 2023
-
[23]
A. K HALED AND P. R ICHTÁRIK , Better theory for SGD in the nonconvex world , Transactions on Machine Learning Research, (2023), https://openreview.net/forum?id=AU4qHN2VkS. Survey Certification
work page 2023
- [24]
-
[25]
Gradient-based learning applied to document recognition,
Y. LECUN , L. B OTTOU , Y. BENGIO , AND P. HAFFNER , Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324, https://doi.org/10.1109/5.726791
-
[26]
Y. LEI, T. H U, G. L I, AND K. T ANG, Stochastic gradient descent for nonconvex learning without bounded gradient assumptions, IEEE transactions on neural networks and learning systems, 31 (2019), pp. 4394–4400
work page 2019
- [27]
- [28]
- [29]
-
[30]
Z. L I, W. S HI, AND M. YAN, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, IEEE Transactions on Signal Processing, 67 (2019), pp. 4494–4506
work page 2019
-
[31]
G. M ANCINO -BALL , S. M IAO, Y. XU, AND J. C HEN, Proximal stochastic recursive momentum methods for nonconvex composite decentralized optimization, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 9055–9063
work page 2023
-
[32]
A. M ILZAREK AND J. Q IU, Convergence of a normal map-based prox-sgd method under the kl inequality, arXiv preprint arXiv:2305.05828, (2023)
-
[33]
A. M ILZAREK , X. X IAO, S. C EN, Z. W EN, AND M. U LBRICH , A stochastic semismooth newton method for nonsmooth nonconvex optimization , SIAM Journal on Optimization, 29 (2019), pp. 2916–2948, https: //doi.org/10.1137/18M1181249
-
[34]
A. N EDI ´C, A. O LSHEVSKY , AND M. G. R ABBAT, Network topology and communication-computation tradeoffs in decentralized optimization, Proceedings of the IEEE, 106 (2018), pp. 953–976
work page 2018
-
[35]
A. N EDI ´C, A. O LSHEVSKY , AND W. SHI, Achieving geometric convergence for distributed optimization over time-varying graphs, SIAM Journal on Optimization, 27 (2017), pp. 2597–2633. 33 Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks A PREPRINT
work page 2017
-
[36]
A. N EDIC AND A. O ZDAGLAR , Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, 54 (2009), pp. 48–61
work page 2009
-
[37]
A. N EDI ´C, A. O ZDAGLAR , AND P. A. P ARRILO , Constrained consensus and optimization in multi-agent networks, IEEE Transactions on Automatic Control, 55 (2010), pp. 922–938
work page 2010
-
[38]
A. O LSHEVSKY , Asymptotic network independence and step-size for a distributed subgradient method, Journal of machine learning research, 23 (2022), pp. 1–32
work page 2022
-
[39]
W. O UYANG AND A. M ILZAREK , A trust region-type normal map-based semismooth newton method for nonsmooth nonconvex composite optimization, Mathematical Programming, (2024), pp. 1–47
work page 2024
-
[40]
S. P U, A. O LSHEVSKY , AND I. C. P ASCHALIDIS , Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent , IEEE signal processing magazine, 37 (2020), pp. 114–122
work page 2020
-
[41]
C. R AVAZZI, S. M. F OSSON , AND E. M AGLI , Distributed iterative thresholding for ‚Ñì 0/‚Ñì 1-regularized linear inverse problems, IEEE Transactions on Information Theory, 61 (2015), pp. 2081–2100
work page 2015
-
[42]
S. M. R OBINSON , Normal maps induced by linear transformations, Mathematics of Operations Research, 17 (1992), pp. 691–714
work page 1992
-
[43]
W. S HI, Q. L ING , G. W U, AND W. Y IN, Extra: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization, 25 (2015), pp. 944–966
work page 2015
-
[44]
W. SHI, Q. L ING , G. W U, AND W. YIN, A proximal gradient algorithm for decentralized composite optimization, IEEE Transactions on Signal Processing, 63 (2015), pp. 6013–6023
work page 2015
- [45]
- [46]
-
[47]
T. XIAO, X. C HEN , K. B ALASUBRAMANIAN , AND S. G HADIMI , A one-sample decentralized proximal algorithm for non-convex stochastic composite optimization, in Uncertainty in Artificial Intelligence, PMLR, 2023, pp. 2324– 2334
work page 2023
- [48]
-
[49]
J. X U, Y. T IAN , Y. S UN, AND G. S CUTARI , Distributed algorithms for composite optimization: Unified framework and convergence analysis, IEEE Transactions on Signal Processing, (2021)
work page 2021
-
[50]
J. X U, S. Z HU, Y. C. S OH, AND L. X IE, Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes, in 2015 54th IEEE Conference on Decision and Control (CDC), IEEE, 2015, pp. 2055–2060
work page 2015
-
[51]
Y. YAN, J. C HEN , P.-Y. C HEN , X. C UI, S. L U, AND Y. XU, Compressed decentralized proximal stochastic gradient method for nonconvex composite problems with heterogeneous data , in International Conference on Machine Learning, PMLR, 2023, pp. 39035–39061
work page 2023
-
[52]
H. Y E, Z. Z HOU , L. L UO, AND T. ZHANG , Decentralized accelerated proximal gradient descent, Advances in Neural Information Processing Systems, 33 (2020), pp. 18308–18317
work page 2020
- [53]
-
[54]
J. Z ENG AND W. YIN, On nonconvex decentralized gradient descent, IEEE Transactions on signal processing, 66 (2018), pp. 2834–2848
work page 2018
-
[55]
H. Z OU AND T. HASTIE , Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society Series B: Statistical Methodology, 67 (2005), pp. 301–320. 34
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.