pith. sign in

arxiv: 2412.13054 · v3 · submitted 2024-12-17 · 🧮 math.OC

Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks

Pith reviewed 2026-05-23 06:40 UTC · model grok-4.3

classification 🧮 math.OC
keywords distributed optimizationstochastic proximal gradientnormal mapgradient trackingexact diffusionconvergence ratesnetwork topologycomposite objectives
0
0 comments X

The pith

Two distributed stochastic proximal gradient algorithms achieve convergence rates comparable to the centralized method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified framework for distributed agents to minimize the average of local cost functions plus a shared nonsmooth function using stochastic proximal gradient methods based on normal maps. It proposes two algorithms, norM-DSGT and norM-ED, that asymptotically match the convergence rates of the centralized stochastic proximal gradient descent under a general variance condition on the stochastic gradients. This condition does not depend on the network topology or the nonsmooth term. The transient iterations needed are analyzed in terms of network size and spectral gap, matching or approaching those of non-proximal counterparts.

Core claim

By incorporating the normal map update scheme into distributed stochastic gradient tracking and exact diffusion, the methods norM-DSGT and norM-ED attain the same asymptotic convergence rate as the centralized stochastic proximal gradient descent for composite objectives, with transient times of O(n^3/(1-λ)^2) for norM-ED and O(max{n^3/(1-λ)^2, n/(1-λ)^3}) for norM-DSGT, under the general variance condition.

What carries the argument

The normal map update scheme, which reformulates the proximal step to enable distributed implementation while handling the nonsmooth regularizer.

If this is right

  • Both algorithms achieve comparable rates to centralized stochastic proximal gradient descent under the variance condition.
  • norM-ED's transient time matches the non-proximal exact diffusion algorithm.
  • norM-DSGT matches the transient time of non-proximal DSGT when the spectral gap satisfies (1-λ)^{-1} = O(n^2).
  • The results establish state-of-the-art convergence for distributed composite optimization over networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The normal map may serve as a template for extending other gradient-based distributed methods to nonsmooth composite settings.
  • Network connectivity mainly influences the length of the transient phase rather than the final asymptotic rate.
  • The independence of the variance condition from topology suggests the approach could apply directly to time-varying or directed networks with similar mixing properties.

Load-bearing premise

The stochastic gradients satisfy a general variance condition that is independent of the network topology and the nonsmooth term.

What would settle it

Numerical simulation on a fixed network showing the error decay rate deviates from the centralized rate when the variance condition holds but the claimed transient time bound is violated.

Figures

Figures reproduced from arXiv: 2412.13054 by Angelia Nedi\'c, Kun Huang, Shi Pu.

Figure 1
Figure 1. Figure 1: An illustration of the subsequence {kj}k≥0. The goal is to upper bound PK−1 k=0 E[∥F γ nor(¯zk)∥ 2 ] and PK−1 k=0 E[∥Πzk∥ 2 ] according to Lemma 3.5. To achieve the goal, we utilize the multi-step analysis from Section 3. Regarding PK−1 k=0 E[∥F γ nor(¯zk)∥ 2 ], we divide the summation into two parts: one summing over the points k0, k1, . . . , kT , and the other summing over the interval kj , kj + 1, . . … view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of ring graph topology with [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison among norM-ED, norM-DSGT, norM-CSGD, Prox-CSGD, Prox-DASA, and Prox-DASA-GT [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison among norM-ED, norM-DSGT, norM-CSGD, Prox-CSGD, Prox-DASA, and Prox-DASA-GT [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Consider $n$ agents connected over a network collaborating to minimize the average of their local cost functions combined with a common nonsmooth function. This paper introduces a unified algorithmic framework for solving such a problem through distributed stochastic proximal gradient methods, leveraging the normal map update scheme. Within this framework, we propose two new algorithms, termed Normal Map-based Distributed Stochastic Gradient Tracking (norM-DSGT) and Normal Map-based Exact Diffusion (norM-ED). We demonstrate that both methods can asymptotically achieve comparable convergence rates to the centralized stochastic proximal gradient descent method under a general variance condition on the stochastic gradients. Additionally, the number of iterations required for norM-ED to achieve such a rate (i.e., the transient time) behaves as $\mathcal{O}(n^{3}/(1-\lambda)^2)$ for minimizing composite objective functions, matching the performance of the non-proximal ED algorithm. Here $1-\lambda$ denotes the spectral gap of the mixing matrix related to the underlying network topology. To our knowledge, such a convergence result is state-of-the-art for the considered composite problem. Under the same condition, norM-DSGT enjoys a transient time of $\mathcal{O}(\max\{n^3/(1-\lambda)^2, n/(1-\lambda)^3\})$, which matches that of the non-proximal DSGT algorithm and norM-ED under the condition $(1-\lambda)^{-1}=\mathcal{O}(n^{2})$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces two distributed stochastic proximal gradient algorithms, norM-DSGT and norM-ED, based on the normal map reformulation for solving composite optimization problems where agents minimize the average of local smooth costs plus a shared nonsmooth function over a network. It claims that both methods asymptotically achieve convergence rates comparable to the centralized stochastic proximal gradient descent under a general variance condition on the stochastic gradients that is independent of the network topology and the nonsmooth term. The transient time for norM-ED is O(n^3 / (1-λ)^2), matching the non-proximal ED, and for norM-DSGT it is O(max{n^3/(1-λ)^2, n/(1-λ)^3}), matching DSGT under certain conditions on the spectral gap.

Significance. If the convergence results hold, this work advances the field by providing state-of-the-art rates for distributed composite stochastic optimization, with the key strength being the extension to proximal settings without degrading the rates or introducing dependence on the nonsmooth term in the variance bound. The matching transient times to non-proximal counterparts is a significant contribution.

minor comments (2)
  1. [Abstract] The abstract is quite dense with technical claims; expanding slightly on the variance condition would improve accessibility.
  2. [Abstract] The notation for the spectral gap is introduced as 1-λ without an immediate definition, which may confuse readers unfamiliar with the context.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of our contributions, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces norM-DSGT and norM-ED via normal map reformulation of distributed stochastic proximal gradient methods and derives asymptotic rates matching centralized stochastic proximal gradient descent under a topology-independent variance condition on stochastic gradients. Transient times are stated explicitly in terms of n and the spectral gap 1-λ, with direct comparison to non-proximal ED/DSGT performance; these are external benchmarks rather than reductions to fitted constants or self-referential definitions. No equations or claims in the provided abstract reduce a prediction to an input by construction, and no load-bearing self-citation chain is invoked to establish the central rates. The derivation chain is therefore self-contained against the stated assumptions and external comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; ledger entries are inferred from stated assumptions rather than verified in proofs.

axioms (2)
  • domain assumption Stochastic gradients obey a general variance condition independent of network topology
    Invoked to obtain centralized rates; location: abstract paragraph on convergence.
  • domain assumption Mixing matrix has spectral gap 1-λ > 0 (connected network)
    Used to bound transient time; location: abstract discussion of norM-ED and norM-DSGT.

pith-pipeline@v0.9.0 · 5793 in / 1178 out tokens · 34744 ms · 2026-05-23T06:40:37.824257+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    A LGHUNAIM , K

    S. A LGHUNAIM , K. Y UAN, AND A. H. S AYED, A linearly convergent proximal gradient algorithm for decentral- ized optimization, Advances in Neural Information Processing Systems, 32 (2019)

  2. [2]

    S. A. A LGHUNAIM , E. K. R YU, K. Y UAN, AND A. H. S AYED, Decentralized proximal gradient algorithms with linear convergence rates, IEEE Transactions on Automatic Control, 66 (2020), pp. 2787–2794

  3. [3]

    S. A. A LGHUNAIM AND K. Y UAN, A unified and refined convergence analysis for non-convex decentralized learning, IEEE Transactions on Signal Processing, 70 (2022), pp. 3264–3279

  4. [4]

    ARJEVANI , Y

    Y. ARJEVANI , Y. CARMON , J. C. D UCHI , D. J. F OSTER , N. S REBRO , AND B. W OODWORTH , Lower bounds for non-convex stochastic optimization, Mathematical Programming, 199 (2023), pp. 165–214

  5. [5]

    ATTOUCH , J

    H. ATTOUCH , J. B OLTE , AND B. F. S VAITER , Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods, Mathematical Programming, 137 (2013), pp. 91–129

  6. [6]

    B ECK, First-order methods in optimization, SIAM, 2017

    A. B ECK, First-order methods in optimization, SIAM, 2017

  7. [7]

    BIANCHI AND J

    P. BIANCHI AND J. J AKUBOWICZ , Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization, IEEE transactions on automatic control, 58 (2012), pp. 391–405

  8. [8]

    B OTTOU , F

    L. B OTTOU , F. E. C URTIS , AND J. N OCEDAL , Optimization methods for large-scale machine learning, Siam Review, 60 (2018), pp. 223–311

  9. [9]

    A. I. C HEN AND A. O ZDAGLAR , A fast distributed proximal-gradient method, in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), IEEE, 2012, pp. 601–608

  10. [10]

    C HIERCHIA , E

    G. C HIERCHIA , E. C HOUZENOUX , P. L. C OMBETTES , AND J.-C. P ESQUET , The proximity operator repository, User’s guide http://proximity-operator. net/download/guide. pdf. Accessed, 6 (2020)

  11. [11]

    D AVIS AND D

    D. D AVIS AND D. D RUSVYATSKIY , Stochastic model-based minimization of weakly convex functions, SIAM Journal on Optimization, 29 (2019), pp. 207–239. 32 Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks A PREPRINT

  12. [12]

    DI LORENZO AND G

    P. DI LORENZO AND G. S CUTARI , Next: In-network nonconvex optimization, IEEE Transactions on Signal and Information Processing over Networks, 2 (2016), pp. 120–136

  13. [13]

    D RUSVYATSKIY AND A

    D. D RUSVYATSKIY AND A. S. L EWIS , Error bounds, quadratic growth, and linear convergence of proximal methods, Mathematics of Operations Research, 43 (2018), pp. 919–948

  14. [14]

    E L GHECHE , G

    M. E L GHECHE , G. C HIERCHIA , AND J.-C. P ESQUET , Proximity operators of discrete information divergences, IEEE Transactions on Information Theory, 64 (2017), pp. 1092–1104

  15. [15]

    G HADIMI , G

    S. G HADIMI , G. L AN, AND H. Z HANG , Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Mathematical Programming, 155 (2016), pp. 267–305

  16. [16]

    L. G UO, X. S HI, J. C AO, AND Z. W ANG, Decentralized inexact proximal gradient method with network- independent stepsizes for convex composite optimization, IEEE Transactions on Signal Processing, 71 (2023), pp. 786–801

  17. [17]

    H UANG , X

    K. H UANG , X. L I, AND S. P U, Distributed stochastic optimization under a general variance condition, IEEE Transactions on Automatic Control, (2024)

  18. [18]

    H UANG AND S

    K. H UANG AND S. P U, Cedas: A compressed decentralized stochastic gradient method with improved conver- gence, 2023, https://arxiv.org/abs/2301.05872

  19. [19]

    H UANG , S

    K. H UANG , S. P U, AND A. N EDI ´C, An accelerated distributed stochastic gradient method with momentum, arXiv preprint arXiv:2402.09714, (2024)

  20. [20]

    H UANG , L

    K. H UANG , L. Z HOU , AND S. P U, Distributed random reshuffling methods with improved convergence, 2023, https://arxiv.org/abs/2306.12037

  21. [21]

    B. M. I DREES , S. D. S HARMA , AND K. R AJAWAT, Analysis of decentralized stochastic successive convex approximation for composite non-convex problems, arXiv preprint arXiv:2405.07100, (2024)

  22. [22]

    Y. JI, G. S CUTARI , Y. S UN, AND H. H ONNAPPA , Distributed sparse regression via penalization, Journal of Machine Learning Research, 24 (2023), pp. 1–62

  23. [23]

    K HALED AND P

    A. K HALED AND P. R ICHTÁRIK , Better theory for SGD in the nonconvex world , Transactions on Machine Learning Research, (2023), https://openreview.net/forum?id=AU4qHN2VkS. Survey Certification

  24. [24]

    L ECUN, Y

    Y. L ECUN, Y. BENGIO , AND G. H INTON , Deep learning, nature, 521 (2015), pp. 436–444

  25. [25]

    Gradient-based learning applied to document recognition,

    Y. LECUN , L. B OTTOU , Y. BENGIO , AND P. HAFFNER , Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324, https://doi.org/10.1109/5.726791

  26. [26]

    Y. LEI, T. H U, G. L I, AND K. T ANG, Stochastic gradient descent for nonconvex learning without bounded gradient assumptions, IEEE transactions on neural networks and learning systems, 31 (2019), pp. 4394–4400

  27. [27]

    L I AND A

    X. L I AND A. M ILZAREK , A unified convergence theorem for stochastic optimization methods, Advances in Neural Information Processing Systems, 35 (2022), pp. 33107–33119

  28. [28]

    X. L I, A. M ILZAREK , AND J. Q IU, A new random reshuffling method for nonsmooth nonconvex finite-sum optimization, arXiv preprint arXiv:2312.01047, (2023)

  29. [29]

    Y. LI, X. L IU, J. T ANG , M. YAN, AND K. Y UAN, Decentralized composite optimization with compression, arXiv preprint arXiv:2108.04448, (2021)

  30. [30]

    Z. L I, W. S HI, AND M. YAN, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, IEEE Transactions on Signal Processing, 67 (2019), pp. 4494–4506

  31. [31]

    M ANCINO -BALL , S

    G. M ANCINO -BALL , S. M IAO, Y. XU, AND J. C HEN, Proximal stochastic recursive momentum methods for nonconvex composite decentralized optimization, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 9055–9063

  32. [32]

    M ILZAREK AND J

    A. M ILZAREK AND J. Q IU, Convergence of a normal map-based prox-sgd method under the kl inequality, arXiv preprint arXiv:2305.05828, (2023)

  33. [33]

    M ILZAREK , X

    A. M ILZAREK , X. X IAO, S. C EN, Z. W EN, AND M. U LBRICH , A stochastic semismooth newton method for nonsmooth nonconvex optimization , SIAM Journal on Optimization, 29 (2019), pp. 2916–2948, https: //doi.org/10.1137/18M1181249

  34. [34]

    N EDI ´C, A

    A. N EDI ´C, A. O LSHEVSKY , AND M. G. R ABBAT, Network topology and communication-computation tradeoffs in decentralized optimization, Proceedings of the IEEE, 106 (2018), pp. 953–976

  35. [35]

    N EDI ´C, A

    A. N EDI ´C, A. O LSHEVSKY , AND W. SHI, Achieving geometric convergence for distributed optimization over time-varying graphs, SIAM Journal on Optimization, 27 (2017), pp. 2597–2633. 33 Distributed Normal Map-based Stochastic Proximal Gradient Methods over Networks A PREPRINT

  36. [36]

    N EDIC AND A

    A. N EDIC AND A. O ZDAGLAR , Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, 54 (2009), pp. 48–61

  37. [37]

    N EDI ´C, A

    A. N EDI ´C, A. O ZDAGLAR , AND P. A. P ARRILO , Constrained consensus and optimization in multi-agent networks, IEEE Transactions on Automatic Control, 55 (2010), pp. 922–938

  38. [38]

    O LSHEVSKY , Asymptotic network independence and step-size for a distributed subgradient method, Journal of machine learning research, 23 (2022), pp

    A. O LSHEVSKY , Asymptotic network independence and step-size for a distributed subgradient method, Journal of machine learning research, 23 (2022), pp. 1–32

  39. [39]

    O UYANG AND A

    W. O UYANG AND A. M ILZAREK , A trust region-type normal map-based semismooth newton method for nonsmooth nonconvex composite optimization, Mathematical Programming, (2024), pp. 1–47

  40. [40]

    S. P U, A. O LSHEVSKY , AND I. C. P ASCHALIDIS , Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent , IEEE signal processing magazine, 37 (2020), pp. 114–122

  41. [41]

    R AVAZZI, S

    C. R AVAZZI, S. M. F OSSON , AND E. M AGLI , Distributed iterative thresholding for ‚Ñì 0/‚Ñì 1-regularized linear inverse problems, IEEE Transactions on Information Theory, 61 (2015), pp. 2081–2100

  42. [42]

    S. M. R OBINSON , Normal maps induced by linear transformations, Mathematics of Operations Research, 17 (1992), pp. 691–714

  43. [43]

    W. S HI, Q. L ING , G. W U, AND W. Y IN, Extra: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization, 25 (2015), pp. 944–966

  44. [44]

    W. SHI, Q. L ING , G. W U, AND W. YIN, A proximal gradient algorithm for decentralized composite optimization, IEEE Transactions on Signal Processing, 63 (2015), pp. 6013–6023

  45. [45]

    WANG , L

    L. WANG , L. B AO, AND X. L IU, A decentralized proximal gradient tracking algorithm for composite optimization on riemannian manifolds, arXiv preprint arXiv:2401.11573, (2024)

  46. [46]

    WANG , J

    Z. WANG , J. Z HANG , T.-H. C HANG , J. L I, AND Z.-Q. L UO, Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems, IEEE Transactions on Signal Processing, 69 (2021), pp. 4486– 4501

  47. [47]

    T. XIAO, X. C HEN , K. B ALASUBRAMANIAN , AND S. G HADIMI , A one-sample decentralized proximal algorithm for non-convex stochastic composite optimization, in Uncertainty in Artificial Intelligence, PMLR, 2023, pp. 2324– 2334

  48. [48]

    R. X IN, S. D AS, U. A. K HAN , AND S. K AR, A stochastic proximal gradient framework for decentralized non-convex composite optimization: Topology-independent sample complexity and communication efficiency , arXiv preprint arXiv:2110.01594, (2021)

  49. [49]

    J. X U, Y. T IAN , Y. S UN, AND G. S CUTARI , Distributed algorithms for composite optimization: Unified framework and convergence analysis, IEEE Transactions on Signal Processing, (2021)

  50. [50]

    J. X U, S. Z HU, Y. C. S OH, AND L. X IE, Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes, in 2015 54th IEEE Conference on Decision and Control (CDC), IEEE, 2015, pp. 2055–2060

  51. [51]

    Y. YAN, J. C HEN , P.-Y. C HEN , X. C UI, S. L U, AND Y. XU, Compressed decentralized proximal stochastic gradient method for nonconvex composite problems with heterogeneous data , in International Conference on Machine Learning, PMLR, 2023, pp. 39035–39061

  52. [52]

    H. Y E, Z. Z HOU , L. L UO, AND T. ZHANG , Decentralized accelerated proximal gradient descent, Advances in Neural Information Processing Systems, 33 (2020), pp. 18308–18317

  53. [53]

    Y UAN, B

    K. Y UAN, B. Y ING , X. Z HAO, AND A. H. S AYED, Exact diffusion for distributed optimization and learn- ing‚Äîpart i: Algorithm development, IEEE Transactions on Signal Processing, 67 (2018), pp. 708–723

  54. [54]

    Z ENG AND W

    J. Z ENG AND W. YIN, On nonconvex decentralized gradient descent, IEEE Transactions on signal processing, 66 (2018), pp. 2834–2848

  55. [55]

    Z OU AND T

    H. Z OU AND T. HASTIE , Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society Series B: Statistical Methodology, 67 (2005), pp. 301–320. 34