pith. sign in

arxiv: 2605.17027 · v1 · pith:3DHDKGLGnew · submitted 2026-05-16 · 🧮 math.OC

Clipped Stochastic Gradient Tracking For Locally Smooth Functions

Pith reviewed 2026-05-19 20:06 UTC · model grok-4.3

classification 🧮 math.OC
keywords distributed optimizationstochastic gradient trackinglocal smoothnessvariance reductionclipped gradientsrelative uniform continuityfinite-sum problemsadaptive stepsizes
0
0 comments X

The pith

A clipped stochastic gradient tracking method with staggered variance reduction converges using only local smoothness for RUC-regular distributed problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the relative uniform continuity condition to describe allowable growth in local smoothness constants across different sets of points. It then constructs a clipped gradient tracking algorithm that incorporates staggered variance reduction and relies exclusively on these local constants rather than any global bound. The analysis covers finite-sum distributed optimization and yields an explicit complexity bound that scales with local dataset sizes. A sympathetic reader would care because many practical objectives have smoothness that varies sharply or becomes large in some regions, rendering global-smoothness methods either inefficient or inapplicable. The new condition is claimed to encompass the growth rates that arise in most common objective functions.

Core claim

For RUC-regular distributed optimization problems with finite-sum structure, we derive a clipped gradient tracking method with staggered variance reduction, which only relies on the local smoothness of objective functions, and an O(∑_i n_i^{1.5} + n_i^{0.5} ε^{-1}) complexity has been established for our algorithm.

What carries the argument

The relative uniform continuity (RUC) condition on the local smoothness constant viewed as a function of sets, which justifies the clipping and staggered variance reduction steps that keep the analysis valid without global constants.

If this is right

  • The method converges without needing a precomputed global smoothness upper bound.
  • It applies when local smoothness grows logarithmically, polynomially, or exponentially with distance or set size.
  • The total complexity splits into a term linear in the square root of each local sample size and a term linear in the inverse of the target accuracy.
  • Consensus among agents is preserved even though each agent uses a step size informed only by its own local smoothness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clipping-plus-staggering pattern may transfer to other adaptive distributed schemes that currently assume global Lipschitz constants.
  • Empirical checks of the RUC growth rate on common loss surfaces could indicate which neural-network training tasks are immediately covered.
  • Asynchronous or dynamic-network variants could be analyzed by verifying that the RUC condition still holds along the realized communication pattern.

Load-bearing premise

The problems must obey the relative uniform continuity condition that limits how quickly local smoothness constants can change between nearby sets.

What would settle it

Construct a finite-sum distributed problem whose local smoothness constant grows faster than any RUC-allowed function and observe whether the algorithm still meets the stated iteration bound or diverges.

Figures

Figures reproduced from arXiv: 2605.17027 by Junyu Zhang, Leilei Mei.

Figure 1
Figure 1. Figure 1: (a) illustrates the distance upper bound for two arbitrary local iterates at time [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Result of Baboon data. Three columns stand for ring, grid, and random networks. [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Result of Barbara data. Three columns stand for ring, grid, and random networks. [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bank Customer Segmentation Data As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
read the original abstract

Most stochastic gradient tracking (GT) methods adopt pre-scheduled stepsize rules, while a few recent works studied adaptive stepsizes that attempt to respond to the problem's local landscape. These methods are typically built upon the problem's global smoothness constant in both analysis and implementation, even for the adaptive ones. On the one hand, for many problems the local smoothness constant may vary drastically across the domain, and sometimes even unbounded, using the global upper bound of the local constants is too conservative. On the other hand, drastic stepsize changes can cause difficulties in the analysis of convergence and consensus of distributed algorithms, making the direct use of local smoothness constants risky and theoretically challenging. In this paper, we propose a \emph{Relative Uniform Continuity} (RUC) regularity condition for the local smoothness constant as a function of sets. The RUC condition covers most common growth functions for local smoothness constant, ranging from constant and logarithmic to polynomial and even exponential. For RUC-regular distributed optimization problems with finite-sum structure, we derive a clipped gradient tracking method with staggered variance reduction, which only relies on the local smoothness of objective functions, and an $\mathcal{O}(\sum_in_i^{1.5}+n_i^{0.5}\epsilon^{-1})$ complexity has been established for our algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Relative Uniform Continuity (RUC) regularity condition on local smoothness constants viewed as functions of sets. For distributed finite-sum optimization problems satisfying RUC, it proposes a clipped stochastic gradient tracking algorithm that incorporates staggered variance reduction and relies solely on local smoothness information. The central result is an iteration complexity bound of O(∑_i n_i^{1.5} + n_i^{0.5} ε^{-1}) for reaching an ε-stationary point.

Significance. If the analysis is completed rigorously, the work would offer a principled approach to distributed optimization under non-uniform or rapidly growing local smoothness, avoiding overly conservative global Lipschitz assumptions that are common in gradient-tracking literature. The combination of clipping with staggered variance reduction in a distributed GT framework represents a concrete algorithmic contribution that could improve practical step-size adaptation.

major comments (2)
  1. [RUC definition and convergence analysis (likely §4)] Definition of RUC (likely §2 or §3): The condition is stated to apply to local smoothness constants on per-node trajectory sets and to cover exponential growth. However, the gradient-tracking update and consensus error imply that nodes evaluate local functions at points offset by the current disagreement vector. It is not shown that RUC on individual node sets controls the effective Lipschitz constant experienced by the tracking error term when the union of points across nodes is considered; this gap directly affects whether the Lyapunov decrease can simultaneously close both consensus and optimality gaps.
  2. [Theorem 5.1 / complexity analysis] Main complexity theorem (likely Theorem 5.1 or §5): The claimed O(∑_i n_i^{1.5} + n_i^{0.5} ε^{-1}) bound rests on the interaction between the clipping threshold (chosen from local constants) and the staggered variance-reduction steps. Without an explicit accounting of how clipping affects the variance-reduction factor under RUC (especially when local constants differ across nodes), it is unclear whether the n_i^{1.5} term remains valid or whether additional factors appear.
minor comments (2)
  1. [Notation and preliminaries] The notation for the local sample sizes n_i and the precise definition of the RUC function should be introduced with an explicit mathematical statement before the algorithm is presented.
  2. [Algorithm 1 and figures] Figure captions and algorithm pseudocode would benefit from explicit labeling of the clipping threshold and the staggering schedule to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below. The concerns primarily involve making certain steps in the existing analysis more explicit; we will incorporate clarifications and supporting lemmas in the revised manuscript.

read point-by-point responses
  1. Referee: [RUC definition and convergence analysis (likely §4)] Definition of RUC (likely §2 or §3): The condition is stated to apply to local smoothness constants on per-node trajectory sets and to cover exponential growth. However, the gradient-tracking update and consensus error imply that nodes evaluate local functions at points offset by the current disagreement vector. It is not shown that RUC on individual node sets controls the effective Lipschitz constant experienced by the tracking error term when the union of points across nodes is considered; this gap directly affects whether the Lyapunov decrease can simultaneously close both consensus and optimality gaps.

    Authors: We agree that the interaction between the consensus error and the effective smoothness under RUC merits an explicit statement. The current proof of the Lyapunov decrease (Section 4) already constructs the relevant sets for each node to include the current disagreement vector when bounding the gradient-tracking term; RUC is then applied to these augmented per-node sets, whose union is controlled by the separate consensus-error bound. This ensures the same RUC growth function governs both the optimality and consensus terms without extra factors. To address the referee’s concern directly, we will insert a short supporting lemma (new Lemma 4.3) that formally defines the augmented sets and verifies that RUC extends to their union under the bounded-disagreement assumption already used in the analysis. revision: partial

  2. Referee: [Theorem 5.1 / complexity analysis] Main complexity theorem (likely Theorem 5.1 or §5): The claimed O(∑_i n_i^{1.5} + n_i^{0.5} ε^{-1}) bound rests on the interaction between the clipping threshold (chosen from local constants) and the staggered variance-reduction steps. Without an explicit accounting of how clipping affects the variance-reduction factor under RUC (especially when local constants differ across nodes), it is unclear whether the n_i^{1.5} term remains valid or whether additional factors appear.

    Authors: The clipping threshold at each node is set using the local RUC value evaluated at the current local point; the staggered variance-reduction schedule is synchronized across nodes so that the variance-reduction factor is bounded by the maximum local RUC constant appearing in any given iteration. Because RUC is a uniform continuity condition on sets, heterogeneity of the local constants does not introduce multiplicative factors beyond those already absorbed into the per-node n_i^{1.5} term. The proof of Theorem 5.1 therefore preserves the stated complexity. We will add a dedicated paragraph immediately after the statement of Theorem 5.1 that derives the variance bound under node-wise differing RUC constants and clipping, making the absence of extra factors fully transparent. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via new RUC condition and algorithm analysis

full rationale

The paper proposes a new Relative Uniform Continuity (RUC) regularity condition on local smoothness constants as a function of sets, states that it covers common growth functions from constant to exponential, and then analyzes a clipped gradient tracking algorithm with staggered variance reduction for finite-sum distributed problems under this condition. The claimed complexity bound follows from the algorithm design and the RUC assumption rather than any reduction of a prediction or result to a fitted parameter, self-cited uniqueness theorem, or definitional equivalence within the paper's own equations. No load-bearing step is shown to collapse by construction to the inputs; the central claims rest on the independent content of the proposed regularity condition and the convergence analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's central claim depends primarily on the newly introduced RUC regularity condition as a domain assumption to justify using local rather than global smoothness constants in the analysis.

axioms (1)
  • domain assumption The objective functions satisfy the Relative Uniform Continuity (RUC) regularity condition for the local smoothness constant as a function of sets.
    This condition is proposed by the paper to cover common growth behaviors of local smoothness and enable the clipped GT analysis.

pith-pipeline@v0.9.0 · 5755 in / 1397 out tokens · 59966 ms · 2026-05-19T20:06:34.601391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    Stochastic gradient push for distributed deep learning

    Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning , pages 344–353. PMLR, 2019

  2. [2]

    A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications

    Heinz H Bauschke, Jérome Bolte, and Marc Teboulle. A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research , 42(2):330–348, 2017

  3. [3]

    One hundred years since the introduction of the set distance by dimitrie pompeiu

    Temistocle Birsan and Dan Tiba. One hundred years since the introduction of the set distance by dimitrie pompeiu. In System Modeling and Optimization: Proceedings of the 22nd IFIP TC7 Conference held from July 18–22, 2005, in Turin, Italy 22 , pages 35–39. Springer, 2006

  4. [4]

    First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems

    Jérome Bolte, Shoham Sabach, Marc Teboulle, and Yakov Vaisbourd. First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization , 28(3):2131–2151, 2018. 27

  5. [5]

    Phase retrieval via wirtinger flow: Theory and algorithms

    Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory , 61(4):1985–2007, 2015

  6. [6]

    Diffusion adaptation strategies for distributed optimization and learning over networks

    Jianshu Chen and Ali H Sayed. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing , 60(8):4289–4305, 2012

  7. [7]

    Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization

    Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization. In International Conference on Machine Learning, pages 5396–5427. PMLR, 2023

  8. [8]

    Momentum-based variance reduction in non-convex sgd

    Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems , 32, 2019

  9. [9]

    Saga: A fast incremental gradient method with support for non-strongly convex composite objectives

    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information pro- cessing systems , 27, 2014

  10. [10]

    Prox-pda: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks

    Mingyi Hong, Davood Hajinezhad, and Ming-Min Zhao. Prox-pda: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In International Conference on Machine Learning , pages 1529–1538. PMLR, 2017

  11. [11]

    On the divergence of decentralized non-convex optimization

    Mingyi Hong, Siliang Zeng, Junyu Zhang, and Haoran Sun. On the divergence of decentralized non-convex optimization. arXiv preprint arXiv:2006.11662 , 2020

  12. [12]

    Distributed stochastic gradient tracking al- gorithm with variance reduction for non-convex optimization

    Xia Jiang, Xianlin Zeng, Jian Sun, and Jie Chen. Distributed stochastic gradient tracking al- gorithm with variance reduction for non-convex optimization. IEEE Transactions on Neural Networks and Learning Systems , 34(9):5310–5321, 2022

  13. [13]

    Non-convex distributionally robust optimization: Non-asymptotic analysis

    Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems , 34:2771–2782, 2021

  14. [14]

    Accelerating stochastic gradient descent using predictive variance reduction

    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems , 26, 2013

  15. [15]

    Revisiting gradient clipping: Stochastic bias and tight convergence guarantees

    Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In International Conference on Machine Learn- ing, pages 17343–17363. PMLR, 2023

  16. [16]

    An improved analysis of gradient tracking for decentralized machine learning

    Anastasiia Koloskova, Tao Lin, and Sebastian U Stich. An improved analysis of gradient tracking for decentralized machine learning. Advances in Neural Information Processing Systems, 34:11422– 11435, 2021

  17. [17]

    Communication-efficient distributed opti- mization in networks with gradient tracking and variance reduction

    Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-efficient distributed opti- mization in networks with gradient tracking and variance reduction. Journal of Machine Learning Research, 21(180):1–51, 2020

  18. [18]

    Convex and non-convex optimization under generalized smoothness

    Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems , 36:40238–40271, 2023

  19. [19]

    A decentralized proximal-gradient method with network inde- pendent step-sizes and separated convergence rates

    Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network inde- pendent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing , 67(17):4494–4506, 2019

  20. [20]

    Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent

    Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems , 30, 2017

  21. [21]

    Decentralized jointly sparse optimization by reweighted lq minimization

    Qing Ling, Zaiwen Wen, and Wotao Yin. Decentralized jointly sparse optimization by reweighted lq minimization. IEEE Transactions on Signal Processing , 61(5):1165–1170, 2012. 28

  22. [22]

    Relatively smooth convex optimization by first-order methods, and applications

    Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization , 28(1):333–354, 2018

  23. [23]

    Gnsd: A gradient-tracking based non- convex stochastic algorithm for decentralized optimization

    Songtao Lu, Xinwei Zhang, Haoran Sun, and Mingyi Hong. Gnsd: A gradient-tracking based non- convex stochastic algorithm for decentralized optimization. In 2019 IEEE Data Science Workshop (DSW), pages 315–321. IEEE, 2019

  24. [24]

    Accelerated first-order methods for convex optimization with locally lipschitz continuous gradient

    Zhaosong Lu and Sanyou Mei. Accelerated first-order methods for convex optimization with locally lipschitz continuous gradient. SIAM Journal on Optimization , 33(3):2275–2310, 2023

  25. [25]

    Primal-dual extrapolation methods for monotone inclusions under local lipschitz continuity

    Zhaosong Lu and Sanyou Mei. Primal-dual extrapolation methods for monotone inclusions under local lipschitz continuity. Mathematics of Operations Research , 2024

  26. [26]

    Distributed gradient methods for convex machine learning problems in networks

    Angelia Nedic. Distributed gradient methods for convex machine learning problems in networks. IEEE Signal Processing Magazine , 10, 2020

  27. [27]

    Achieving geometric convergence for distributed optimization over time-varying graphs

    Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization , 27(4):2597–2633, 2017

  28. [28]

    Distributed subgradient methods for multi-agent optimiza- tion

    Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimiza- tion. IEEE Transactions on Automatic Control , 54(1):48–61, 2009

  29. [29]

    Introductory lectures on convex optimization: A basic course , volume 87

    Yurii Nesterov. Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & Business Media, 2013

  30. [30]

    Consensus and cooperation in networked multi-agent systems

    Reza Olfati-Saber, J Alex Fax, and Richard M Murray. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE , 95(1):215–233, 2007

  31. [31]

    A Class of Randomized Primal-Dual Algorithms for Distributed Optimization

    Jean-Christophe Pesquet and Audrey Repetti. A class of randomized primal-dual algorithms for distributed optimization. arXiv preprint arXiv:1406.6404 , 2014

  32. [32]

    Proxsarah: An efficient algorithmic framework for stochastic composite nonconvex optimization

    Nhan H Pham, Lam M Nguyen, Dzung T Phan, and Quoc Tran-Dinh. Proxsarah: An efficient algorithmic framework for stochastic composite nonconvex optimization. The Journal of Machine Learning Research, 21(1):4455–4502, 2020

  33. [33]

    Distributed stochastic gradient tracking methods

    Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods. Mathematical Programming, 187(1):409–457, 2021

  34. [34]

    Harnessing smoothness to accelerate distributed optimization

    Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems , 5(3):1245–1260, 2017

  35. [35]

    On random graph

    Erdos Renyi. On random graph. Publicationes Mathematicate, 6:290–297, 1959

  36. [36]

    Extra: An exact first-order algorithm for decen- tralized consensus optimization

    Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decen- tralized consensus optimization. SIAM Journal on Optimization , 25(2):944–966, 2015

  37. [37]

    Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms

    Haoran Sun and Mingyi Hong. Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms. IEEE Transactions on Signal processing, 67(22):5912–5928, 2019

  38. [38]

    Improving the sample and communication com- plexity for decentralized non-convex optimization: Joint gradient estimation and tracking

    Haoran Sun, Songtao Lu, and Mingyi Hong. Improving the sample and communication com- plexity for decentralized non-convex optimization: Joint gradient estimation and tracking. In International conference on machine learning , pages 9217–9228. PMLR, 2020

  39. [39]

    Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation

    Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization , 32(2):354–385, 2022

  40. [40]

    d2: Decentralized training over decentralized data

    Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. d2: Decentralized training over decentralized data. In International Conference on Machine Learning , pages 4848–4856. PMLR, 2018

  41. [41]

    A simplified view of first order methods for optimization

    Marc Teboulle. A simplified view of first order methods for optimization. Mathematical Program- ming, 170(1):67–96, 2018. 29

  42. [42]

    A near-optimal stochastic gradient method for decentralized non-convex finite-sum optimization

    Ran Xin, Usman A Khan, and Soummya Kar. A near-optimal stochastic gradient method for decentralized non-convex finite-sum optimization. arXiv preprint arXiv:2008.07428 , 2020

  43. [43]

    Variance-reduced decentralized stochastic opti- mization with accelerated convergence

    Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized stochastic opti- mization with accelerated convergence. IEEE Transactions on Signal Processing , 68:6255–6271, 2020

  44. [44]

    A fast randomized incremental gradient method for decentralized nonconvex optimization

    Ran Xin, Usman A Khan, and Soummya Kar. A fast randomized incremental gradient method for decentralized nonconvex optimization. IEEE Transactions on Automatic Control , 67(10):5150– 5165, 2021

  45. [45]

    Fast decentralized nonconvex finite-sum optimiza- tion with recursive variance reduction

    Ran Xin, Usman A Khan, and Soummya Kar. Fast decentralized nonconvex finite-sum optimiza- tion with recursive variance reduction. SIAM Journal on Optimization , 32(1):1–28, 2022

  46. [46]

    A general framework for decentralized optimization with first-order methods

    Ran Xin, Shi Pu, Angelia Nedić, and Usman A Khan. A general framework for decentralized optimization with first-order methods. Proceedings of the IEEE , 108(11):1869–1889, 2020

  47. [47]

    Real analysis: theory of measure and integration second edition

    James Yeh. Real analysis: theory of measure and integration second edition . World Scientific Publishing Company, 2006

  48. [48]

    On the convergence of decentralized gradient descent

    Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization , 26(3):1835–1854, 2016

  49. [49]

    On nonconvex decentralized gradient descent

    Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing , 66(11):2834–2848, 2018

  50. [50]

    Distributed optimization using the primal-dual method of multipliers

    Guoqiang Zhang and Richard Heusdens. Distributed optimization using the primal-dual method of multipliers. IEEE Transactions on Signal and Information Processing over Networks , 4(1):173– 187, 2017

  51. [51]

    Decentralized stochastic gradient tracking for non-convex empirical risk minimization

    Jiaqi Zhang and Keyou You. Decentralized stochastic gradient tracking for non-convex empirical risk minimization. arXiv preprint arXiv:1909.02712 , 2019

  52. [52]

    Why gradient clipping acceler- ates training: A theoretical justification for adaptivity

    Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping acceler- ates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019

  53. [53]

    Stochastic bregman proximal gradient method revisited: Kernel conditioning and painless variance reduction

    Junyu Zhang. Stochastic bregman proximal gradient method revisited: Kernel conditioning and painless variance reduction. Mathematical Programming, pages 1–60, 2025

  54. [54]

    First-order algorithms without lipschitz gradient: A sequential local optimization approach

    Junyu Zhang and Mingyi Hong. First-order algorithms without lipschitz gradient: A sequential local optimization approach. INFORMS Journal on Optimization , 6(2):118–136, 2024

  55. [55]

    On the convergence and sample efficiency of variance-reduced policy gradient method

    Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi Wang, et al. On the convergence and sample efficiency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021

  56. [56]

    Distributed optimization for generalized phase retrieval over networks

    Ziping Zhao, Songtao Lu, Mingyi Hong, and Daniel P Palomar. Distributed optimization for generalized phase retrieval over networks. In 2018 52nd Asilomar Conference on Signals, Systems, and Computers , pages 48–52. IEEE, 2018. 30