Clipped Stochastic Gradient Tracking For Locally Smooth Functions

Junyu Zhang; Leilei Mei

arxiv: 2605.17027 · v1 · pith:3DHDKGLGnew · submitted 2026-05-16 · 🧮 math.OC

Clipped Stochastic Gradient Tracking For Locally Smooth Functions

Leilei Mei , Junyu Zhang This is my paper

Pith reviewed 2026-05-19 20:06 UTC · model grok-4.3

classification 🧮 math.OC

keywords distributed optimizationstochastic gradient trackinglocal smoothnessvariance reductionclipped gradientsrelative uniform continuityfinite-sum problemsadaptive stepsizes

0 comments

The pith

A clipped stochastic gradient tracking method with staggered variance reduction converges using only local smoothness for RUC-regular distributed problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the relative uniform continuity condition to describe allowable growth in local smoothness constants across different sets of points. It then constructs a clipped gradient tracking algorithm that incorporates staggered variance reduction and relies exclusively on these local constants rather than any global bound. The analysis covers finite-sum distributed optimization and yields an explicit complexity bound that scales with local dataset sizes. A sympathetic reader would care because many practical objectives have smoothness that varies sharply or becomes large in some regions, rendering global-smoothness methods either inefficient or inapplicable. The new condition is claimed to encompass the growth rates that arise in most common objective functions.

Core claim

For RUC-regular distributed optimization problems with finite-sum structure, we derive a clipped gradient tracking method with staggered variance reduction, which only relies on the local smoothness of objective functions, and an O(∑_i n_i^{1.5} + n_i^{0.5} ε^{-1}) complexity has been established for our algorithm.

What carries the argument

The relative uniform continuity (RUC) condition on the local smoothness constant viewed as a function of sets, which justifies the clipping and staggered variance reduction steps that keep the analysis valid without global constants.

If this is right

The method converges without needing a precomputed global smoothness upper bound.
It applies when local smoothness grows logarithmically, polynomially, or exponentially with distance or set size.
The total complexity splits into a term linear in the square root of each local sample size and a term linear in the inverse of the target accuracy.
Consensus among agents is preserved even though each agent uses a step size informed only by its own local smoothness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clipping-plus-staggering pattern may transfer to other adaptive distributed schemes that currently assume global Lipschitz constants.
Empirical checks of the RUC growth rate on common loss surfaces could indicate which neural-network training tasks are immediately covered.
Asynchronous or dynamic-network variants could be analyzed by verifying that the RUC condition still holds along the realized communication pattern.

Load-bearing premise

The problems must obey the relative uniform continuity condition that limits how quickly local smoothness constants can change between nearby sets.

What would settle it

Construct a finite-sum distributed problem whose local smoothness constant grows faster than any RUC-allowed function and observe whether the algorithm still meets the stated iteration bound or diverges.

Figures

Figures reproduced from arXiv: 2605.17027 by Junyu Zhang, Leilei Mei.

**Figure 2.** Figure 2: Result of Baboon data. Three columns stand for ring, grid, and random networks. [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

**Figure 3.** Figure 3: Result of Barbara data. Three columns stand for ring, grid, and random networks. [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Bank Customer Segmentation Data As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

read the original abstract

Most stochastic gradient tracking (GT) methods adopt pre-scheduled stepsize rules, while a few recent works studied adaptive stepsizes that attempt to respond to the problem's local landscape. These methods are typically built upon the problem's global smoothness constant in both analysis and implementation, even for the adaptive ones. On the one hand, for many problems the local smoothness constant may vary drastically across the domain, and sometimes even unbounded, using the global upper bound of the local constants is too conservative. On the other hand, drastic stepsize changes can cause difficulties in the analysis of convergence and consensus of distributed algorithms, making the direct use of local smoothness constants risky and theoretically challenging. In this paper, we propose a \emph{Relative Uniform Continuity} (RUC) regularity condition for the local smoothness constant as a function of sets. The RUC condition covers most common growth functions for local smoothness constant, ranging from constant and logarithmic to polynomial and even exponential. For RUC-regular distributed optimization problems with finite-sum structure, we derive a clipped gradient tracking method with staggered variance reduction, which only relies on the local smoothness of objective functions, and an $\mathcal{O}(\sum_in_i^{1.5}+n_i^{0.5}\epsilon^{-1})$ complexity has been established for our algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RUC lets them drop the global Lipschitz assumption in distributed GT via a set-based condition and clipped staggered-VR algorithm, but the consensus-error handling looks like the part to check first.

read the letter

Hi, the main things here are the Relative Uniform Continuity condition, which treats local smoothness as a function of sets and covers growth up to exponential, plus a clipped gradient-tracking method with staggered variance reduction that targets finite-sum distributed problems and claims the O(sum n_i^{1.5} + n_i^{0.5} eps^{-1}) rate without a global bound. The paper does a clear job explaining why global constants are wasteful and why jumping straight to local ones breaks consensus analysis, then shows how clipping plus staggering can make the steps work under RUC. That combination is the actual new piece. The soft spot is exactly the stress-test point: if RUC is only invoked on each node's own trajectory set, the constants seen by the tracking error can still differ sharply when points are offset by disagreement, especially under exponential growth. The Lyapunov argument that closes both optimality and consensus gaps would need to handle the union of those sets or an equivalent bound; without the proofs it is not obvious this is done. The work is aimed at people already working on distributed optimization with non-uniform or locally smooth objectives. Someone looking for concrete alternatives to global-Lipschitz GT would find the condition and the algorithm design worth reading. I would send it to referees so the details of how RUC is applied across the network and how the staggered steps close the bounds can be checked directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Relative Uniform Continuity (RUC) regularity condition on local smoothness constants viewed as functions of sets. For distributed finite-sum optimization problems satisfying RUC, it proposes a clipped stochastic gradient tracking algorithm that incorporates staggered variance reduction and relies solely on local smoothness information. The central result is an iteration complexity bound of O(∑_i n_i^{1.5} + n_i^{0.5} ε^{-1}) for reaching an ε-stationary point.

Significance. If the analysis is completed rigorously, the work would offer a principled approach to distributed optimization under non-uniform or rapidly growing local smoothness, avoiding overly conservative global Lipschitz assumptions that are common in gradient-tracking literature. The combination of clipping with staggered variance reduction in a distributed GT framework represents a concrete algorithmic contribution that could improve practical step-size adaptation.

major comments (2)

[RUC definition and convergence analysis (likely §4)] Definition of RUC (likely §2 or §3): The condition is stated to apply to local smoothness constants on per-node trajectory sets and to cover exponential growth. However, the gradient-tracking update and consensus error imply that nodes evaluate local functions at points offset by the current disagreement vector. It is not shown that RUC on individual node sets controls the effective Lipschitz constant experienced by the tracking error term when the union of points across nodes is considered; this gap directly affects whether the Lyapunov decrease can simultaneously close both consensus and optimality gaps.
[Theorem 5.1 / complexity analysis] Main complexity theorem (likely Theorem 5.1 or §5): The claimed O(∑_i n_i^{1.5} + n_i^{0.5} ε^{-1}) bound rests on the interaction between the clipping threshold (chosen from local constants) and the staggered variance-reduction steps. Without an explicit accounting of how clipping affects the variance-reduction factor under RUC (especially when local constants differ across nodes), it is unclear whether the n_i^{1.5} term remains valid or whether additional factors appear.

minor comments (2)

[Notation and preliminaries] The notation for the local sample sizes n_i and the precise definition of the RUC function should be introduced with an explicit mathematical statement before the algorithm is presented.
[Algorithm 1 and figures] Figure captions and algorithm pseudocode would benefit from explicit labeling of the clipping threshold and the staggering schedule to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below. The concerns primarily involve making certain steps in the existing analysis more explicit; we will incorporate clarifications and supporting lemmas in the revised manuscript.

read point-by-point responses

Referee: [RUC definition and convergence analysis (likely §4)] Definition of RUC (likely §2 or §3): The condition is stated to apply to local smoothness constants on per-node trajectory sets and to cover exponential growth. However, the gradient-tracking update and consensus error imply that nodes evaluate local functions at points offset by the current disagreement vector. It is not shown that RUC on individual node sets controls the effective Lipschitz constant experienced by the tracking error term when the union of points across nodes is considered; this gap directly affects whether the Lyapunov decrease can simultaneously close both consensus and optimality gaps.

Authors: We agree that the interaction between the consensus error and the effective smoothness under RUC merits an explicit statement. The current proof of the Lyapunov decrease (Section 4) already constructs the relevant sets for each node to include the current disagreement vector when bounding the gradient-tracking term; RUC is then applied to these augmented per-node sets, whose union is controlled by the separate consensus-error bound. This ensures the same RUC growth function governs both the optimality and consensus terms without extra factors. To address the referee’s concern directly, we will insert a short supporting lemma (new Lemma 4.3) that formally defines the augmented sets and verifies that RUC extends to their union under the bounded-disagreement assumption already used in the analysis. revision: partial
Referee: [Theorem 5.1 / complexity analysis] Main complexity theorem (likely Theorem 5.1 or §5): The claimed O(∑_i n_i^{1.5} + n_i^{0.5} ε^{-1}) bound rests on the interaction between the clipping threshold (chosen from local constants) and the staggered variance-reduction steps. Without an explicit accounting of how clipping affects the variance-reduction factor under RUC (especially when local constants differ across nodes), it is unclear whether the n_i^{1.5} term remains valid or whether additional factors appear.

Authors: The clipping threshold at each node is set using the local RUC value evaluated at the current local point; the staggered variance-reduction schedule is synchronized across nodes so that the variance-reduction factor is bounded by the maximum local RUC constant appearing in any given iteration. Because RUC is a uniform continuity condition on sets, heterogeneity of the local constants does not introduce multiplicative factors beyond those already absorbed into the per-node n_i^{1.5} term. The proof of Theorem 5.1 therefore preserves the stated complexity. We will add a dedicated paragraph immediately after the statement of Theorem 5.1 that derives the variance bound under node-wise differing RUC constants and clipping, making the absence of extra factors fully transparent. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via new RUC condition and algorithm analysis

full rationale

The paper proposes a new Relative Uniform Continuity (RUC) regularity condition on local smoothness constants as a function of sets, states that it covers common growth functions from constant to exponential, and then analyzes a clipped gradient tracking algorithm with staggered variance reduction for finite-sum distributed problems under this condition. The claimed complexity bound follows from the algorithm design and the RUC assumption rather than any reduction of a prediction or result to a fitted parameter, self-cited uniqueness theorem, or definitional equivalence within the paper's own equations. No load-bearing step is shown to collapse by construction to the inputs; the central claims rest on the independent content of the proposed regularity condition and the convergence analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's central claim depends primarily on the newly introduced RUC regularity condition as a domain assumption to justify using local rather than global smoothness constants in the analysis.

axioms (1)

domain assumption The objective functions satisfy the Relative Uniform Continuity (RUC) regularity condition for the local smoothness constant as a function of sets.
This condition is proposed by the paper to cover common growth behaviors of local smoothness and enable the clipped GT analysis.

pith-pipeline@v0.9.0 · 5755 in / 1397 out tokens · 59966 ms · 2026-05-19T20:06:34.601391+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a Relative Uniform Continuity (RUC) regularity condition for the local smoothness constant as a function of sets... covers... exponential... clipped gradient tracking method with staggered variance reduction... O(∑_i n_i^{1.5} + n_i^{0.5} ε^{-1})
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dH(X,Y) ≤ δ implies |1 - max{L(X)/L(Y), L(Y)/L(X)}| < ε

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

[1]

Stochastic gradient push for distributed deep learning

Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning , pages 344–353. PMLR, 2019

work page 2019
[2]

A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications

Heinz H Bauschke, Jérome Bolte, and Marc Teboulle. A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research , 42(2):330–348, 2017

work page 2017
[3]

One hundred years since the introduction of the set distance by dimitrie pompeiu

Temistocle Birsan and Dan Tiba. One hundred years since the introduction of the set distance by dimitrie pompeiu. In System Modeling and Optimization: Proceedings of the 22nd IFIP TC7 Conference held from July 18–22, 2005, in Turin, Italy 22 , pages 35–39. Springer, 2006

work page 2005
[4]

First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems

Jérome Bolte, Shoham Sabach, Marc Teboulle, and Yakov Vaisbourd. First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization , 28(3):2131–2151, 2018. 27

work page 2018
[5]

Phase retrieval via wirtinger flow: Theory and algorithms

Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory , 61(4):1985–2007, 2015

work page 1985
[6]

Diffusion adaptation strategies for distributed optimization and learning over networks

Jianshu Chen and Ali H Sayed. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing , 60(8):4289–4305, 2012

work page 2012
[7]

Generalized-smooth nonconvex optimiza- tion is as eﬀicient as smooth nonconvex optimization

Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimiza- tion is as eﬀicient as smooth nonconvex optimization. In International Conference on Machine Learning, pages 5396–5427. PMLR, 2023

work page 2023
[8]

Momentum-based variance reduction in non-convex sgd

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems , 32, 2019

work page 2019
[9]

Saga: A fast incremental gradient method with support for non-strongly convex composite objectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information pro- cessing systems , 27, 2014

work page 2014
[10]

Prox-pda: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks

Mingyi Hong, Davood Hajinezhad, and Ming-Min Zhao. Prox-pda: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In International Conference on Machine Learning , pages 1529–1538. PMLR, 2017

work page 2017
[11]

On the divergence of decentralized non-convex optimization

Mingyi Hong, Siliang Zeng, Junyu Zhang, and Haoran Sun. On the divergence of decentralized non-convex optimization. arXiv preprint arXiv:2006.11662 , 2020

work page arXiv 2006
[12]

Distributed stochastic gradient tracking al- gorithm with variance reduction for non-convex optimization

Xia Jiang, Xianlin Zeng, Jian Sun, and Jie Chen. Distributed stochastic gradient tracking al- gorithm with variance reduction for non-convex optimization. IEEE Transactions on Neural Networks and Learning Systems , 34(9):5310–5321, 2022

work page 2022
[13]

Non-convex distributionally robust optimization: Non-asymptotic analysis

Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems , 34:2771–2782, 2021

work page 2021
[14]

Accelerating stochastic gradient descent using predictive variance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems , 26, 2013

work page 2013
[15]

Revisiting gradient clipping: Stochastic bias and tight convergence guarantees

Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In International Conference on Machine Learn- ing, pages 17343–17363. PMLR, 2023

work page 2023
[16]

An improved analysis of gradient tracking for decentralized machine learning

Anastasiia Koloskova, Tao Lin, and Sebastian U Stich. An improved analysis of gradient tracking for decentralized machine learning. Advances in Neural Information Processing Systems, 34:11422– 11435, 2021

work page 2021
[17]

Communication-eﬀicient distributed opti- mization in networks with gradient tracking and variance reduction

Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-eﬀicient distributed opti- mization in networks with gradient tracking and variance reduction. Journal of Machine Learning Research, 21(180):1–51, 2020

work page 2020
[18]

Convex and non-convex optimization under generalized smoothness

Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems , 36:40238–40271, 2023

work page 2023
[19]

A decentralized proximal-gradient method with network inde- pendent step-sizes and separated convergence rates

Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network inde- pendent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing , 67(17):4494–4506, 2019

work page 2019
[20]

Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems , 30, 2017

work page 2017
[21]

Decentralized jointly sparse optimization by reweighted lq minimization

Qing Ling, Zaiwen Wen, and Wotao Yin. Decentralized jointly sparse optimization by reweighted lq minimization. IEEE Transactions on Signal Processing , 61(5):1165–1170, 2012. 28

work page 2012
[22]

Relatively smooth convex optimization by first-order methods, and applications

Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization , 28(1):333–354, 2018

work page 2018
[23]

Gnsd: A gradient-tracking based non- convex stochastic algorithm for decentralized optimization

Songtao Lu, Xinwei Zhang, Haoran Sun, and Mingyi Hong. Gnsd: A gradient-tracking based non- convex stochastic algorithm for decentralized optimization. In 2019 IEEE Data Science Workshop (DSW), pages 315–321. IEEE, 2019

work page 2019
[24]

Accelerated first-order methods for convex optimization with locally lipschitz continuous gradient

Zhaosong Lu and Sanyou Mei. Accelerated first-order methods for convex optimization with locally lipschitz continuous gradient. SIAM Journal on Optimization , 33(3):2275–2310, 2023

work page 2023
[25]

Primal-dual extrapolation methods for monotone inclusions under local lipschitz continuity

Zhaosong Lu and Sanyou Mei. Primal-dual extrapolation methods for monotone inclusions under local lipschitz continuity. Mathematics of Operations Research , 2024

work page 2024
[26]

Distributed gradient methods for convex machine learning problems in networks

Angelia Nedic. Distributed gradient methods for convex machine learning problems in networks. IEEE Signal Processing Magazine , 10, 2020

work page 2020
[27]

Achieving geometric convergence for distributed optimization over time-varying graphs

Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization , 27(4):2597–2633, 2017

work page 2017
[28]

Distributed subgradient methods for multi-agent optimiza- tion

Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimiza- tion. IEEE Transactions on Automatic Control , 54(1):48–61, 2009

work page 2009
[29]

Introductory lectures on convex optimization: A basic course , volume 87

Yurii Nesterov. Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & Business Media, 2013

work page 2013
[30]

Consensus and cooperation in networked multi-agent systems

Reza Olfati-Saber, J Alex Fax, and Richard M Murray. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE , 95(1):215–233, 2007

work page 2007
[31]

A Class of Randomized Primal-Dual Algorithms for Distributed Optimization

Jean-Christophe Pesquet and Audrey Repetti. A class of randomized primal-dual algorithms for distributed optimization. arXiv preprint arXiv:1406.6404 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[32]

Proxsarah: An eﬀicient algorithmic framework for stochastic composite nonconvex optimization

Nhan H Pham, Lam M Nguyen, Dzung T Phan, and Quoc Tran-Dinh. Proxsarah: An eﬀicient algorithmic framework for stochastic composite nonconvex optimization. The Journal of Machine Learning Research, 21(1):4455–4502, 2020

work page 2020
[33]

Distributed stochastic gradient tracking methods

Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods. Mathematical Programming, 187(1):409–457, 2021

work page 2021
[34]

Harnessing smoothness to accelerate distributed optimization

Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems , 5(3):1245–1260, 2017

work page 2017
[35]

On random graph

Erdos Renyi. On random graph. Publicationes Mathematicate, 6:290–297, 1959

work page 1959
[36]

Extra: An exact first-order algorithm for decen- tralized consensus optimization

Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decen- tralized consensus optimization. SIAM Journal on Optimization , 25(2):944–966, 2015

work page 2015
[37]

Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms

Haoran Sun and Mingyi Hong. Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms. IEEE Transactions on Signal processing, 67(22):5912–5928, 2019

work page 2019
[38]

Improving the sample and communication com- plexity for decentralized non-convex optimization: Joint gradient estimation and tracking

Haoran Sun, Songtao Lu, and Mingyi Hong. Improving the sample and communication com- plexity for decentralized non-convex optimization: Joint gradient estimation and tracking. In International conference on machine learning , pages 9217–9228. PMLR, 2020

work page 2020
[39]

Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation

Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization , 32(2):354–385, 2022

work page 2022
[40]

d2: Decentralized training over decentralized data

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. d2: Decentralized training over decentralized data. In International Conference on Machine Learning , pages 4848–4856. PMLR, 2018

work page 2018
[41]

A simplified view of first order methods for optimization

Marc Teboulle. A simplified view of first order methods for optimization. Mathematical Program- ming, 170(1):67–96, 2018. 29

work page 2018
[42]

A near-optimal stochastic gradient method for decentralized non-convex finite-sum optimization

Ran Xin, Usman A Khan, and Soummya Kar. A near-optimal stochastic gradient method for decentralized non-convex finite-sum optimization. arXiv preprint arXiv:2008.07428 , 2020

work page arXiv 2008
[43]

Variance-reduced decentralized stochastic opti- mization with accelerated convergence

Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized stochastic opti- mization with accelerated convergence. IEEE Transactions on Signal Processing , 68:6255–6271, 2020

work page 2020
[44]

A fast randomized incremental gradient method for decentralized nonconvex optimization

Ran Xin, Usman A Khan, and Soummya Kar. A fast randomized incremental gradient method for decentralized nonconvex optimization. IEEE Transactions on Automatic Control , 67(10):5150– 5165, 2021

work page 2021
[45]

Fast decentralized nonconvex finite-sum optimiza- tion with recursive variance reduction

Ran Xin, Usman A Khan, and Soummya Kar. Fast decentralized nonconvex finite-sum optimiza- tion with recursive variance reduction. SIAM Journal on Optimization , 32(1):1–28, 2022

work page 2022
[46]

A general framework for decentralized optimization with first-order methods

Ran Xin, Shi Pu, Angelia Nedić, and Usman A Khan. A general framework for decentralized optimization with first-order methods. Proceedings of the IEEE , 108(11):1869–1889, 2020

work page 2020
[47]

Real analysis: theory of measure and integration second edition

James Yeh. Real analysis: theory of measure and integration second edition . World Scientific Publishing Company, 2006

work page 2006
[48]

On the convergence of decentralized gradient descent

Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization , 26(3):1835–1854, 2016

work page 2016
[49]

On nonconvex decentralized gradient descent

Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing , 66(11):2834–2848, 2018

work page 2018
[50]

Distributed optimization using the primal-dual method of multipliers

Guoqiang Zhang and Richard Heusdens. Distributed optimization using the primal-dual method of multipliers. IEEE Transactions on Signal and Information Processing over Networks , 4(1):173– 187, 2017

work page 2017
[51]

Decentralized stochastic gradient tracking for non-convex empirical risk minimization

Jiaqi Zhang and Keyou You. Decentralized stochastic gradient tracking for non-convex empirical risk minimization. arXiv preprint arXiv:1909.02712 , 2019

work page arXiv 1909
[52]

Why gradient clipping acceler- ates training: A theoretical justification for adaptivity

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping acceler- ates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019

work page 2019
[53]

Stochastic bregman proximal gradient method revisited: Kernel conditioning and painless variance reduction

Junyu Zhang. Stochastic bregman proximal gradient method revisited: Kernel conditioning and painless variance reduction. Mathematical Programming, pages 1–60, 2025

work page 2025
[54]

First-order algorithms without lipschitz gradient: A sequential local optimization approach

Junyu Zhang and Mingyi Hong. First-order algorithms without lipschitz gradient: A sequential local optimization approach. INFORMS Journal on Optimization , 6(2):118–136, 2024

work page 2024
[55]

On the convergence and sample eﬀiciency of variance-reduced policy gradient method

Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi Wang, et al. On the convergence and sample eﬀiciency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021

work page 2021
[56]

Distributed optimization for generalized phase retrieval over networks

Ziping Zhao, Songtao Lu, Mingyi Hong, and Daniel P Palomar. Distributed optimization for generalized phase retrieval over networks. In 2018 52nd Asilomar Conference on Signals, Systems, and Computers , pages 48–52. IEEE, 2018. 30

work page 2018

[1] [1]

Stochastic gradient push for distributed deep learning

Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning , pages 344–353. PMLR, 2019

work page 2019

[2] [2]

A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications

Heinz H Bauschke, Jérome Bolte, and Marc Teboulle. A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research , 42(2):330–348, 2017

work page 2017

[3] [3]

One hundred years since the introduction of the set distance by dimitrie pompeiu

Temistocle Birsan and Dan Tiba. One hundred years since the introduction of the set distance by dimitrie pompeiu. In System Modeling and Optimization: Proceedings of the 22nd IFIP TC7 Conference held from July 18–22, 2005, in Turin, Italy 22 , pages 35–39. Springer, 2006

work page 2005

[4] [4]

First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems

Jérome Bolte, Shoham Sabach, Marc Teboulle, and Yakov Vaisbourd. First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization , 28(3):2131–2151, 2018. 27

work page 2018

[5] [5]

Phase retrieval via wirtinger flow: Theory and algorithms

Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory , 61(4):1985–2007, 2015

work page 1985

[6] [6]

Diffusion adaptation strategies for distributed optimization and learning over networks

Jianshu Chen and Ali H Sayed. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing , 60(8):4289–4305, 2012

work page 2012

[7] [7]

Generalized-smooth nonconvex optimiza- tion is as eﬀicient as smooth nonconvex optimization

Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimiza- tion is as eﬀicient as smooth nonconvex optimization. In International Conference on Machine Learning, pages 5396–5427. PMLR, 2023

work page 2023

[8] [8]

Momentum-based variance reduction in non-convex sgd

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems , 32, 2019

work page 2019

[9] [9]

Saga: A fast incremental gradient method with support for non-strongly convex composite objectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information pro- cessing systems , 27, 2014

work page 2014

[10] [10]

Prox-pda: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks

Mingyi Hong, Davood Hajinezhad, and Ming-Min Zhao. Prox-pda: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In International Conference on Machine Learning , pages 1529–1538. PMLR, 2017

work page 2017

[11] [11]

On the divergence of decentralized non-convex optimization

Mingyi Hong, Siliang Zeng, Junyu Zhang, and Haoran Sun. On the divergence of decentralized non-convex optimization. arXiv preprint arXiv:2006.11662 , 2020

work page arXiv 2006

[12] [12]

Distributed stochastic gradient tracking al- gorithm with variance reduction for non-convex optimization

Xia Jiang, Xianlin Zeng, Jian Sun, and Jie Chen. Distributed stochastic gradient tracking al- gorithm with variance reduction for non-convex optimization. IEEE Transactions on Neural Networks and Learning Systems , 34(9):5310–5321, 2022

work page 2022

[13] [13]

Non-convex distributionally robust optimization: Non-asymptotic analysis

Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems , 34:2771–2782, 2021

work page 2021

[14] [14]

Accelerating stochastic gradient descent using predictive variance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems , 26, 2013

work page 2013

[15] [15]

Revisiting gradient clipping: Stochastic bias and tight convergence guarantees

Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In International Conference on Machine Learn- ing, pages 17343–17363. PMLR, 2023

work page 2023

[16] [16]

An improved analysis of gradient tracking for decentralized machine learning

Anastasiia Koloskova, Tao Lin, and Sebastian U Stich. An improved analysis of gradient tracking for decentralized machine learning. Advances in Neural Information Processing Systems, 34:11422– 11435, 2021

work page 2021

[17] [17]

Communication-eﬀicient distributed opti- mization in networks with gradient tracking and variance reduction

Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-eﬀicient distributed opti- mization in networks with gradient tracking and variance reduction. Journal of Machine Learning Research, 21(180):1–51, 2020

work page 2020

[18] [18]

Convex and non-convex optimization under generalized smoothness

Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems , 36:40238–40271, 2023

work page 2023

[19] [19]

A decentralized proximal-gradient method with network inde- pendent step-sizes and separated convergence rates

Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network inde- pendent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing , 67(17):4494–4506, 2019

work page 2019

[20] [20]

Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems , 30, 2017

work page 2017

[21] [21]

Decentralized jointly sparse optimization by reweighted lq minimization

Qing Ling, Zaiwen Wen, and Wotao Yin. Decentralized jointly sparse optimization by reweighted lq minimization. IEEE Transactions on Signal Processing , 61(5):1165–1170, 2012. 28

work page 2012

[22] [22]

Relatively smooth convex optimization by first-order methods, and applications

Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization , 28(1):333–354, 2018

work page 2018

[23] [23]

Gnsd: A gradient-tracking based non- convex stochastic algorithm for decentralized optimization

Songtao Lu, Xinwei Zhang, Haoran Sun, and Mingyi Hong. Gnsd: A gradient-tracking based non- convex stochastic algorithm for decentralized optimization. In 2019 IEEE Data Science Workshop (DSW), pages 315–321. IEEE, 2019

work page 2019

[24] [24]

Accelerated first-order methods for convex optimization with locally lipschitz continuous gradient

Zhaosong Lu and Sanyou Mei. Accelerated first-order methods for convex optimization with locally lipschitz continuous gradient. SIAM Journal on Optimization , 33(3):2275–2310, 2023

work page 2023

[25] [25]

Primal-dual extrapolation methods for monotone inclusions under local lipschitz continuity

Zhaosong Lu and Sanyou Mei. Primal-dual extrapolation methods for monotone inclusions under local lipschitz continuity. Mathematics of Operations Research , 2024

work page 2024

[26] [26]

Distributed gradient methods for convex machine learning problems in networks

Angelia Nedic. Distributed gradient methods for convex machine learning problems in networks. IEEE Signal Processing Magazine , 10, 2020

work page 2020

[27] [27]

Achieving geometric convergence for distributed optimization over time-varying graphs

Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization , 27(4):2597–2633, 2017

work page 2017

[28] [28]

Distributed subgradient methods for multi-agent optimiza- tion

Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimiza- tion. IEEE Transactions on Automatic Control , 54(1):48–61, 2009

work page 2009

[29] [29]

Introductory lectures on convex optimization: A basic course , volume 87

Yurii Nesterov. Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & Business Media, 2013

work page 2013

[30] [30]

Consensus and cooperation in networked multi-agent systems

Reza Olfati-Saber, J Alex Fax, and Richard M Murray. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE , 95(1):215–233, 2007

work page 2007

[31] [31]

A Class of Randomized Primal-Dual Algorithms for Distributed Optimization

Jean-Christophe Pesquet and Audrey Repetti. A class of randomized primal-dual algorithms for distributed optimization. arXiv preprint arXiv:1406.6404 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[32] [32]

Proxsarah: An eﬀicient algorithmic framework for stochastic composite nonconvex optimization

Nhan H Pham, Lam M Nguyen, Dzung T Phan, and Quoc Tran-Dinh. Proxsarah: An eﬀicient algorithmic framework for stochastic composite nonconvex optimization. The Journal of Machine Learning Research, 21(1):4455–4502, 2020

work page 2020

[33] [33]

Distributed stochastic gradient tracking methods

Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods. Mathematical Programming, 187(1):409–457, 2021

work page 2021

[34] [34]

Harnessing smoothness to accelerate distributed optimization

Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems , 5(3):1245–1260, 2017

work page 2017

[35] [35]

On random graph

Erdos Renyi. On random graph. Publicationes Mathematicate, 6:290–297, 1959

work page 1959

[36] [36]

Extra: An exact first-order algorithm for decen- tralized consensus optimization

Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decen- tralized consensus optimization. SIAM Journal on Optimization , 25(2):944–966, 2015

work page 2015

[37] [37]

Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms

Haoran Sun and Mingyi Hong. Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms. IEEE Transactions on Signal processing, 67(22):5912–5928, 2019

work page 2019

[38] [38]

Improving the sample and communication com- plexity for decentralized non-convex optimization: Joint gradient estimation and tracking

Haoran Sun, Songtao Lu, and Mingyi Hong. Improving the sample and communication com- plexity for decentralized non-convex optimization: Joint gradient estimation and tracking. In International conference on machine learning , pages 9217–9228. PMLR, 2020

work page 2020

[39] [39]

Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation

Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization , 32(2):354–385, 2022

work page 2022

[40] [40]

d2: Decentralized training over decentralized data

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. d2: Decentralized training over decentralized data. In International Conference on Machine Learning , pages 4848–4856. PMLR, 2018

work page 2018

[41] [41]

A simplified view of first order methods for optimization

Marc Teboulle. A simplified view of first order methods for optimization. Mathematical Program- ming, 170(1):67–96, 2018. 29

work page 2018

[42] [42]

A near-optimal stochastic gradient method for decentralized non-convex finite-sum optimization

Ran Xin, Usman A Khan, and Soummya Kar. A near-optimal stochastic gradient method for decentralized non-convex finite-sum optimization. arXiv preprint arXiv:2008.07428 , 2020

work page arXiv 2008

[43] [43]

Variance-reduced decentralized stochastic opti- mization with accelerated convergence

Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized stochastic opti- mization with accelerated convergence. IEEE Transactions on Signal Processing , 68:6255–6271, 2020

work page 2020

[44] [44]

A fast randomized incremental gradient method for decentralized nonconvex optimization

Ran Xin, Usman A Khan, and Soummya Kar. A fast randomized incremental gradient method for decentralized nonconvex optimization. IEEE Transactions on Automatic Control , 67(10):5150– 5165, 2021

work page 2021

[45] [45]

Fast decentralized nonconvex finite-sum optimiza- tion with recursive variance reduction

Ran Xin, Usman A Khan, and Soummya Kar. Fast decentralized nonconvex finite-sum optimiza- tion with recursive variance reduction. SIAM Journal on Optimization , 32(1):1–28, 2022

work page 2022

[46] [46]

A general framework for decentralized optimization with first-order methods

Ran Xin, Shi Pu, Angelia Nedić, and Usman A Khan. A general framework for decentralized optimization with first-order methods. Proceedings of the IEEE , 108(11):1869–1889, 2020

work page 2020

[47] [47]

Real analysis: theory of measure and integration second edition

James Yeh. Real analysis: theory of measure and integration second edition . World Scientific Publishing Company, 2006

work page 2006

[48] [48]

On the convergence of decentralized gradient descent

Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization , 26(3):1835–1854, 2016

work page 2016

[49] [49]

On nonconvex decentralized gradient descent

Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing , 66(11):2834–2848, 2018

work page 2018

[50] [50]

Distributed optimization using the primal-dual method of multipliers

Guoqiang Zhang and Richard Heusdens. Distributed optimization using the primal-dual method of multipliers. IEEE Transactions on Signal and Information Processing over Networks , 4(1):173– 187, 2017

work page 2017

[51] [51]

Decentralized stochastic gradient tracking for non-convex empirical risk minimization

Jiaqi Zhang and Keyou You. Decentralized stochastic gradient tracking for non-convex empirical risk minimization. arXiv preprint arXiv:1909.02712 , 2019

work page arXiv 1909

[52] [52]

Why gradient clipping acceler- ates training: A theoretical justification for adaptivity

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping acceler- ates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019

work page 2019

[53] [53]

Stochastic bregman proximal gradient method revisited: Kernel conditioning and painless variance reduction

Junyu Zhang. Stochastic bregman proximal gradient method revisited: Kernel conditioning and painless variance reduction. Mathematical Programming, pages 1–60, 2025

work page 2025

[54] [54]

First-order algorithms without lipschitz gradient: A sequential local optimization approach

Junyu Zhang and Mingyi Hong. First-order algorithms without lipschitz gradient: A sequential local optimization approach. INFORMS Journal on Optimization , 6(2):118–136, 2024

work page 2024

[55] [55]

On the convergence and sample eﬀiciency of variance-reduced policy gradient method

Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi Wang, et al. On the convergence and sample eﬀiciency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021

work page 2021

[56] [56]

Distributed optimization for generalized phase retrieval over networks

Ziping Zhao, Songtao Lu, Mingyi Hong, and Daniel P Palomar. Distributed optimization for generalized phase retrieval over networks. In 2018 52nd Asilomar Conference on Signals, Systems, and Computers , pages 48–52. IEEE, 2018. 30

work page 2018