pith. sign in

arxiv: 2605.00281 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.MA· math.OC

High-Probability Convergence in Decentralized Stochastic Optimization with Gradient Tracking

Pith reviewed 2026-05-09 19:48 UTC · model grok-4.3

classification 💻 cs.LG cs.MAmath.OC
keywords decentralized optimizationstochastic gradient descenthigh-probability convergencegradient trackingnon-convex optimizationPolyak-Lojasiewicz conditionmulti-agent learning
0
0 comments X

The pith

Decentralized stochastic optimization with gradient tracking achieves order-optimal high-probability convergence rates under relaxed assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to provide high-probability convergence guarantees for decentralized stochastic optimization using a bias-corrected method. It proves that GT-DSGD, which incorporates gradient tracking, attains order-optimal rates in high probability for both non-convex and Polyak-Lojasiewicz objectives, matching the rates known for centralized or average-case settings. This matters because earlier high-probability analyses in decentralized networks demanded stricter conditions on data or functions, limiting their applicability, while mean-squared error results already allowed more general cases. By closing this gap, the work shows that practical improvements from bias correction also hold when one cares about performance with high probability rather than just on average.

Core claim

We study high-probability convergence in decentralized stochastic optimization with multiple agents collaborating over a network. By analyzing GT-DSGD that uses gradient tracking for bias correction, we establish the first high-probability guarantees for such bias-corrected methods. Specifically, under relaxed sub-Gaussian noise and standard cost assumptions, the method achieves O(log(1/δ)/sqrt(nT)) for non-convex and O(log(1/δ)/(nT)) for Polyak-Lojasiewicz costs, which are order-optimal.

What carries the argument

The gradient tracking technique in GT-DSGD, which maintains a local estimate of the global average gradient to correct bias arising from local stochastic updates and network heterogeneity.

Load-bearing premise

The stochastic noise satisfies a relaxed sub-Gaussian condition and the local cost functions meet the same smoothness and other assumptions already used for the mean-squared error analysis of GT-DSGD.

What would settle it

Experiments or analysis showing that the high-probability error for GT-DSGD grows faster than the claimed order with increasing T or n, or fails entirely under sub-Gaussian noise distributions that are not strongly bounded, would contradict the rates.

Figures

Figures reproduced from arXiv: 2605.00281 by Aleksandar Armacki, Ali H. Sayed, Haoyuan Cai.

Figure 1
Figure 1. Figure 1: Exponential tail decay on synthetic data. Left to right: MSE performance and tail decay with threshold [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Linear speed-up on synthetic data. Left to right: MSE performance and tail decay with threshold [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Exponential tail decay on real data. Top to bottom: average gradient norm across all runs and its empirical tail [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Linear speed-up on real data. Top to bottom: average gradient norm across all runs and its empirical [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

We study high-probability (HP) convergence guarantees in decentralized stochastic optimization, where multiple agents collaborate to jointly train a model over a network. Existing HP results in decentralized settings almost exclusively focus on the Decentralized Stochastic Gradient Descent ($\mathtt{DSGD}$) algorithm, which requires strong assumptions, such as bounded data heterogeneity, or strong convexity of each agent's cost. This is contrary to the mean-squared error (MSE) results, where methods incorporating bias-correction techniques are known to converge under relaxed assumptions and achieve better practical performance. In this paper we provide the first step toward bridging the gap, by studying HP convergence of $\mathtt{DSGD}$ incorporating the gradient tracking technique, in the presence of noise satisfying a relaxed sub-Gaussian condition. We show that the resulting method, dubbed $\mathtt{GT-DSGD}$, achieves order-optimal HP convergence rates for both non-convex and Polyak-\L{}ojasiewicz costs, of order $\mathcal{O}\Big(\frac{\log(1/\delta)}{\sqrt{nT}}\Big)$ and $\mathcal{O}\Big(\frac{\log(1/\delta)}{nT}\Big)$, respectively, where $n$ is the number of agents, $T$ is the time horizon and $\delta \in (0,1)$ is the confidence parameter. Our results establish that $\mathtt{GT-DSGD}$ converges in the HP sense under the same conditions on the cost as in the MSE sense, while achieving comparable transient times. To the best of our knowledge, these are the first HP guarantees for decentralized optimization methods incorporating bias-correction. Numerical experiments on real and synthetic data verify our theoretical findings, underlining the superior performance of $\mathtt{GT-DSGD}$ and highlighting that the benefits of incorporating bias-correction are also maintained in the HP sense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to provide the first high-probability (HP) convergence guarantees for the gradient-tracking variant of decentralized SGD (GT-DSGD). It establishes order-optimal rates of O(log(1/δ)/√(nT)) for non-convex costs and O(log(1/δ)/(nT)) for Polyak-Łojasiewicz costs, under a relaxed sub-Gaussian noise condition and exactly the same cost-function assumptions (L-smoothness, bounded variance) previously used for mean-squared-error analysis of the same algorithm. The work positions these results as bridging the gap between existing MSE theory and the stronger assumptions required by prior HP analyses of plain DSGD.

Significance. If the central claims hold, the result is significant because it shows that bias-correction via gradient tracking preserves its advantages in the high-probability regime without needing bounded data heterogeneity or per-agent strong convexity. The rates match the optimal dependence on n, T, and δ known from centralized or MSE settings, and the numerical experiments on real and synthetic data are said to confirm the theoretical transient times and superior performance of GT-DSGD over DSGD.

major comments (1)
  1. [Abstract (and the main convergence theorems)] The abstract asserts that GT-DSGD attains the stated HP rates under the identical cost assumptions used for its MSE analysis plus only a relaxed sub-Gaussian noise condition. High-probability arguments must control the supremum of a martingale (or the coupled tracking errors) over T steps. When the noise variance proxy is permitted to grow with ||∇f(x)||, standard Azuma–Hoeffding or Freedman bounds require either an a-priori uniform bound on the differences or a self-normalized inequality; neither is automatically inherited from the MSE proof. The manuscript must explicitly state the precise noise assumption (e.g., whether the sub-Gaussian parameter is uniform or state-dependent) and identify the lemma that supplies the required tail bound.
minor comments (1)
  1. Ensure that all theorem statements explicitly list every assumption (including network connectivity and the precise form of the relaxed sub-Gaussian condition) so that readers can verify the “same conditions as MSE” claim without cross-referencing prior work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying an important point of clarification regarding the high-probability analysis. We address the major comment below and commit to revisions that make the noise assumption and supporting lemma fully explicit while preserving the paper's claims.

read point-by-point responses
  1. Referee: [Abstract (and the main convergence theorems)] The abstract asserts that GT-DSGD attains the stated HP rates under the identical cost assumptions used for its MSE analysis plus only a relaxed sub-Gaussian noise condition. High-probability arguments must control the supremum of a martingale (or the coupled tracking errors) over T steps. When the noise variance proxy is permitted to grow with ||∇f(x)||, standard Azuma–Hoeffding or Freedman bounds require either an a-priori uniform bound on the differences or a self-normalized inequality; neither is automatically inherited from the MSE proof. The manuscript must explicitly state the precise noise assumption (e.g., whether the sub-Gaussian parameter is uniform or state-dependent) and identify the lemma that supplies the required tail bound.

    Authors: We appreciate the referee's observation on the technical requirements for high-probability bounds. The relaxed sub-Gaussian noise condition is formally introduced in Assumption 3.1 of the manuscript, which permits a state-dependent sub-Gaussian parameter that scales linearly with the gradient norm (but remains controlled by the smoothness and bounded-variance assumptions already used in the MSE analysis). To bound the supremum of the relevant martingales and coupled tracking errors over T steps, the proofs invoke a self-normalized concentration result stated as Lemma 4.2 (a Freedman-type inequality adapted to our setting). This lemma is applied directly in the proofs of Theorems 4.1 and 4.3. We agree that the abstract and theorem statements would benefit from explicit cross-references to Assumption 3.1 and Lemma 4.2. In the revised manuscript we will (i) update the abstract to name these items, (ii) add a short remark after the statement of the main theorems clarifying the application of the tail bound, and (iii) ensure the dependence on the gradient norm is highlighted in the proof sketch. These changes make the argument self-contained without altering the stated rates or the set of assumptions. revision: yes

Circularity Check

0 steps flagged

Minor self-citation not load-bearing; derivation uses external concentration tools

full rationale

The paper's high-probability analysis for GT-DSGD invokes standard martingale concentration inequalities (e.g., Freedman or Azuma-type bounds) and network mixing properties that are established independently of the target convergence rates. Cost-function assumptions are explicitly carried over from prior MSE analyses of the same algorithm, but the HP proof does not define any quantity (such as a variance proxy or tracking error bound) in terms of the claimed O(log(1/δ)/sqrt(nT)) or O(log(1/δ)/(nT)) rates, nor does it rename a fitted quantity as a prediction. Self-citations to the authors' earlier MSE work on GT-DSGD supply only algorithmic context and basic lemmas; they are not invoked as a uniqueness theorem or to justify the HP rates themselves. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on a relaxed sub-Gaussian noise model and on cost-function assumptions already standard in the MSE analysis of GT-DSGD; no new free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Stochastic gradients satisfy a relaxed sub-Gaussian tail condition
    Explicitly stated in the abstract as the noise model that replaces stronger bounded-variance or bounded-heterogeneity assumptions.
  • domain assumption Cost functions satisfy the same conditions used for MSE convergence of GT-DSGD
    The abstract claims the HP result holds under exactly those conditions.

pith-pipeline@v0.9.0 · 5648 in / 1414 out tokens · 36874 ms · 2026-05-09T19:48:24.720856+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references

  1. [1]

    Fed- erated Learning: Strategies for Improving Communication Efficiency,

    J. Koneˇ cn` y, H. B. McMahan, F. X. Yu, P. Richt´ arik, A. T. Suresh, and D. Bacon, “Fed- erated Learning: Strategies for Improving Communication Efficiency,”NIPS Workshop on Private Multi-Party Machine Learning, 2016

  2. [2]

    Bullo, J

    F. Bullo, J. Cort´ es, and S. Martinez,Distributed Control of Robotic Networks: A Mathe- matical Approach to Motion Coordination Algorithms. Princeton University Press, 2009

  3. [3]

    Distributed Learning in the Nonconvex World: From batch data to streaming and beyond,

    T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed Learning in the Nonconvex World: From batch data to streaming and beyond,”IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 26–38, 2020

  4. [4]

    NEXT: In-Network Nonconvex Optimization,

    P. D. Lorenzo and G. Scutari, “NEXT: In-Network Nonconvex Optimization,”IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120– 136, 2016. 18

  5. [5]

    Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs,

    A. Nedi´ c, A. Olshevsky, and W. Shi, “Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs,”SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017

  6. [6]

    Harnessing Smoothness to Accelerate Distributed Optimization,

    G. Qu and N. Li, “Harnessing Smoothness to Accelerate Distributed Optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2018

  7. [7]

    Exact Diffusion for Distributed Opti- mization and Learning—Part I: Algorithm Development,

    K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact Diffusion for Distributed Opti- mization and Learning—Part I: Algorithm Development,”IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, 2019

  8. [8]

    Exact Diffusion for Distributed Opti- mization and Learning—Part II: Convergence Analysis,

    K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact Diffusion for Distributed Opti- mization and Learning—Part II: Convergence Analysis,”IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 724–739, 2019

  9. [9]

    D 2: Decentralized Training over Decentralized Data,

    H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D 2: Decentralized Training over Decentralized Data,” inProceedings of the 35th International Conference on Machine Learning, vol. 80 ofProceedings of Machine Learning Research, pp. 4848–4856, PMLR, 2018

  10. [10]

    EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,

    W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,”SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

  11. [11]

    A Decentralized Proximal-Gradient Method With Network Independent Step-Sizes and Separated Convergence Rates,

    Z. Li, W. Shi, and M. Yan, “A Decentralized Proximal-Gradient Method With Network Independent Step-Sizes and Separated Convergence Rates,”IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506, 2019

  12. [12]

    An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization,

    R. Xin, U. A. Khan, and S. Kar, “An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization,”IEEE Transactions on Signal Processing, vol. 69, pp. 1842–1858, 2021

  13. [13]

    A Unified and Refined Convergence Analysis for Non-Convex Decentralized Learning,

    S. A. Alghunaim and K. Yuan, “A Unified and Refined Convergence Analysis for Non-Convex Decentralized Learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022

  14. [14]

    Adaptation, Learning, and Optimization over Networks,

    A. H. Sayed, “Adaptation, Learning, and Optimization over Networks,”Foundations and Trends®in Machine Learning, vol. 7, no. 4-5, pp. 311–801, 2014

  15. [15]

    A. H. Sayed,Inference and Learning from Data: Foundations, vol. 1. Cambridge Uni- versity Press, 2023

  16. [16]

    Tight analyses for non-smooth stochastic gradient descent,

    N. J. A. Harvey, C. Liaw, Y. Plan, and S. Randhawa, “Tight analyses for non-smooth stochastic gradient descent,” inProceedings of the Thirty-Second Conference on Learning Theory, vol. 99 ofProceedings of Machine Learning Research, pp. 1579–1613, PMLR, 2019

  17. [17]

    A high probability analysis of adaptive sgd with momentum,

    X. Li and F. Orabona, “A high probability analysis of adaptive sgd with momentum,” Workshop on “Beyond first-order methods in ML systems”, 37th International Confer- ence on Machine Learning, 2020. 19

  18. [18]

    High Probability Con- vergence of Stochastic Gradient Methods,

    Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “High Probability Con- vergence of Stochastic Gradient Methods,” inProceedings of the 40th International Con- ference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research, pp. 21884–21914, PMLR, 2023

  19. [19]

    Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise,

    T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise,” inAdvances in Neural Information Processing Systems, vol. 36, pp. 24191–24222, Curran Associates, Inc., 2023

  20. [20]

    High-Probability Bounds for Stochastic Optimization and Vari- ational Inequalities: the Case of Unbounded Variance,

    A. Sadiev, M. Danilova, E. Gorbunov, S. Horv´ ath, G. Gidel, P. Dvurechensky, A. Gas- nikov, and P. Richt´ arik, “High-Probability Bounds for Stochastic Optimization and Vari- ational Inequalities: the Case of Unbounded Variance,” inProceedings of the 40th Inter- national Conference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research,...

  21. [21]

    Convergence in High Probability of Dis- tributed Stochastic Gradient Descent Algorithms,

    K. Lu, H. Wang, H. Zhang, and L. Wang, “Convergence in High Probability of Dis- tributed Stochastic Gradient Descent Algorithms,”IEEE Transactions on Automatic Control, vol. 69, no. 4, pp. 2189–2204, 2024

  22. [22]

    High-probability Convergence Guarantees of Decentralized SGD,

    A. Armacki and A. H. Sayed, “High-probability Convergence Guarantees of Decentralized SGD,”arXiv preprint, 2025

  23. [23]

    Robust Stochastic Approximation Approach to Stochastic Programming,

    A. Nemirovski˘ ı, A. Juditsky, G. Lan, and A. Shapiro, “Robust Stochastic Approximation Approach to Stochastic Programming,”SIAM Journal on Optimization, vol. 19, no. 4, pp. 1574–1609, 2009

  24. [24]

    Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming,

    S. Ghadimi and G. Lan, “Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

  25. [25]

    Large deviations rates for stochastic gradient de- scent with strongly convex functions,

    D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Large deviations rates for stochastic gradient de- scent with strongly convex functions,” inProceedings of The 26th International Confer- ence on Artificial Intelligence and Statistics, vol. 206 ofProceedings of Machine Learning Research, pp. 10095–10111, PMLR, 2023

  26. [26]

    Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods,

    Z. Liu and Z. Zhou, “Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods,” inThe Twelfth International Conference on Learning Representations, 2024

  27. [27]

    High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise,

    L. Madden, E. Dall’Anese, and S. Becker, “High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise,”Journal of Machine Learning Research, vol. 25, no. 241, pp. 1–36, 2024

  28. [28]

    High Probability Guarantees for Nonconvex Stochastic Gradient De- scent with Heavy Tails,

    S. Li and Y. Liu, “High Probability Guarantees for Nonconvex Stochastic Gradient De- scent with Heavy Tails,” inProceedings of the 39th International Conference on Ma- chine Learning, vol. 162 ofProceedings of Machine Learning Research, pp. 12931–12963, PMLR, 2022

  29. [29]

    General Tail Bounds for Non-Smooth Stochastic Mirror Descent,

    K. Eldowa and A. Paudice, “General Tail Bounds for Non-Smooth Stochastic Mirror Descent,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, vol. 238 ofProceedings of Machine Learning Research, pp. 3205–3213, PMLR, 2024. 20

  30. [30]

    Stochastic Optimization with Heavy- Tailed Noise via Accelerated Gradient Clipping,

    E. Gorbunov, M. Danilova, and A. Gasnikov, “Stochastic Optimization with Heavy- Tailed Noise via Accelerated Gradient Clipping,” vol. 33, pp. 15042–15053, 2020

  31. [31]

    High probability bounds for stochas- tic subgradient schemes with heavy tailed noise,

    D. A. Parletta, A. Paudice, M. Pontil, and S. Salzo, “High probability bounds for stochas- tic subgradient schemes with heavy tailed noise,”SIAM Journal on Mathematics of Data Science, vol. 6, no. 4, pp. 953–977, 2024

  32. [32]

    High-probability Bounds for Non-Convex Stochastic Op- timization with Heavy Tails,

    A. Cutkosky and H. Mehta, “High-probability Bounds for Non-Convex Stochastic Op- timization with Heavy Tails,” inAdvances in Neural Information Processing Systems, vol. 34, pp. 4883–4895, Curran Associates, Inc., 2021

  33. [33]

    Breaking the Lower Bound with (Little) Structure: Ac- celeration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise,

    Z. Liu, J. Zhang, and Z. Zhou, “Breaking the Lower Bound with (Little) Structure: Ac- celeration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise,” inProceed- ings of Thirty Sixth Conference on Learning Theory, vol. 195 ofProceedings of Machine Learning Research, pp. 2266–2290, PMLR, 2023

  34. [34]

    From Gradient Clipping to Normalization for Heavy Tailed SGD,

    F. H¨ ubler, I. Fatkhullin, and N. He, “From Gradient Clipping to Normalization for Heavy Tailed SGD,” inProceedings of The 28th International Conference on Artificial Intelli- gence and Statistics, vol. 258 ofProceedings of Machine Learning Research, pp. 2413– 2421, PMLR, 03–05 May 2025

  35. [35]

    Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under (L 0, L1)-Smoothness,

    N. Kornilov, P. Zmushko, A. Semenov, A. Gasnikov, and A. Beznosikov, “Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under (L 0, L1)-Smoothness,”arXiv preprint, 2025

  36. [36]

    Lower bounds for non-convex stochastic optimization,

    Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth, “Lower bounds for non-convex stochastic optimization,”Mathematical Programming, vol. 199, no. 1, pp. 165–214, 2022

  37. [37]

    Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems,

    N. Puchkin, E. Gorbunov, N. Kutuzov, and A. Gasnikov, “Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems,” inProceedings of The 27th Inter- national Conference on Artificial Intelligence and Statistics, vol. 238 ofProceedings of Machine Learning Research, pp. 856–864, PMLR, 2024

  38. [38]

    Large Deviation Upper Bounds and Improved MSE Rates of Nonlinear SGD: Heavy-Tailed Noise and Power of Symme- try,

    A. Armacki, S. Yu, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Large Deviation Upper Bounds and Improved MSE Rates of Nonlinear SGD: Heavy-Tailed Noise and Power of Symme- try,”SIAM Journal on Optimization, vol. 36, no. 1, pp. 32–59, 2026

  39. [39]

    Sharp High-Probability Rates for Nonlinear SGD under Heavy-Tailed Noise via Symmetrization,

    A. Armacki, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Sharp High-Probability Rates for Nonlinear SGD under Heavy-Tailed Noise via Symmetrization,”arXiv preprint, 2025

  40. [40]

    High- probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent un- der Heavy-tailed Noise,

    A. Armacki, S. Yu, P. Sharma, G. Joshi, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “High- probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent un- der Heavy-tailed Noise,” inProceedings of The 28th International Conference on Arti- ficial Intelligence and Statistics, vol. 258 ofProceedings of Machine Learning Research, pp. 1774–1...

  41. [41]

    Better Theory for SGD in the Nonconvex World,

    A. Khaled and P. Richt´ arik, “Better Theory for SGD in the Nonconvex World,”Trans- actions on Machine Learning Research, 2023. 21

  42. [42]

    Convergence Rates for Distributed Stochastic Optimization Over Random Networks,

    D. Jakoveti´ c, D. Bajovi´ c, A. K. Sahu, and S. Kar, “Convergence Rates for Distributed Stochastic Optimization Over Random Networks,” in2018 IEEE Conference on Decision and Control (CDC), pp. 4238–4245, 2018

  43. [43]

    Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate,

    S. Vlaski and A. H. Sayed, “Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate,”IEEE Transactions on Signal Processing, vol. 69, pp. 1242– 1256, 2021

  44. [44]

    Distributed Learning in Non-Convex Environments— Part II: Polynomial Escape From Saddle-Points,

    S. Vlaski and A. H. Sayed, “Distributed Learning in Non-Convex Environments— Part II: Polynomial Escape From Saddle-Points,”IEEE Transactions on Signal Processing, vol. 69, pp. 1257–1270, 2021

  45. [45]

    Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms,

    J. Wang and G. Joshi, “Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms,”Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021

  46. [46]

    Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees,

    A. Koloskova, H. Hendrikx, and S. U. Stich, “Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees,” inProceedings of the 40th International Con- ference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research, pp. 17343–17363, PMLR, 2023

  47. [47]

    Variance-Reduced Decentralized Stochastic Optimiza- tion With Accelerated Convergence,

    R. Xin, U. A. Khan, and S. Kar, “Variance-Reduced Decentralized Stochastic Optimiza- tion With Accelerated Convergence,”IEEE Transactions on Signal Processing, vol. 68, pp. 6255–6271, 2020

  48. [48]

    Decentralized Nonconvex Optimization under Heavy- Tailed Noise: Normalization and Optimal Convergence,

    S. Yu, D. Jakoveti´ c, and S. Kar, “Decentralized Nonconvex Optimization under Heavy- Tailed Noise: Normalization and Optimal Convergence,” inThe Fourteenth International Conference on Learning Representations, 2026

  49. [49]

    Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication,

    S. Kar, J. M. F. Moura, and K. Ramanan, “Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication,”IEEE Trans- actions on Information Theory, vol. 58, no. 6, pp. 3575–3605, 2012

  50. [50]

    Distributed Pareto Optimization via Diffusion Strategies,

    J. Chen and A. H. Sayed, “Distributed Pareto Optimization via Diffusion Strategies,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 2, pp. 205–220, 2013

  51. [51]

    Multitask Learning Over Graphs: An Approach for Distributed, Streaming Machine Learning,

    R. Nassif, S. Vlaski, C. Richard, J. Chen, and A. H. Sayed, “Multitask Learning Over Graphs: An Approach for Distributed, Streaming Machine Learning,”IEEE Signal Pro- cessing Magazine, vol. 37, no. 3, pp. 14–25, 2020

  52. [52]

    Online Distributed Algorithms for Online Noncooperative Games With Stochas- tic Cost Functions: High Probability Bound of Regrets,

    K. Lu, “Online Distributed Algorithms for Online Noncooperative Games With Stochas- tic Cost Functions: High Probability Bound of Regrets,”IEEE Transactions on Auto- matic Control, vol. 69, no. 12, pp. 8860–8867, 2024

  53. [53]

    Online distributed nonconvex optimization with stochas- tic objective functions: High probability bound analysis of dynamic regrets,

    H. Xu, K. Lu, and Y.-L. Wang, “Online distributed nonconvex optimization with stochas- tic objective functions: High probability bound analysis of dynamic regrets,”Automatica, vol. 170, p. 111863, 2024

  54. [54]

    High Probability Convergence of Clipped Dis- tributed Dual Averaging With Heavy-Tailed Noises,

    Y. Qin, K. Lu, H. Xu, and X. Chen, “High Probability Convergence of Clipped Dis- tributed Dual Averaging With Heavy-Tailed Noises,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 4, pp. 2624–2632, 2025. 22

  55. [55]

    High Probability Convergence of Distributed Clipped Stochastic Gradient Descent with Heavy-tailed Noise,

    Y. Yang, K. Lu, and L. Wang, “High Probability Convergence of Distributed Clipped Stochastic Gradient Descent with Heavy-tailed Noise,”Systems & Control Letters, vol. 209, p. 106358, 2026

  56. [56]

    Online distributed optimization with clipped stochastic gradients: High probability bound of regrets,

    Y. Yang, K. Lu, and L. Wang, “Online distributed optimization with clipped stochastic gradients: High probability bound of regrets,”Automatica, vol. 182, p. 112525, 2025

  57. [57]

    Distributed Detection via Gaussian Running Consensus: Large Deviations Asymptotic Analysis,

    D. Bajovi´ c, D. Jakoveti´ c, J. Xavier, B. Sinopoli, and J. M. F. Moura, “Distributed Detection via Gaussian Running Consensus: Large Deviations Asymptotic Analysis,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4381–4396, 2011

  58. [58]

    Large Devia- tions Performance of Consensus+Innovations Distributed Detection With Non-Gaussian Observations,

    D. Bajovi´ c, D. Jakoveti´ c, J. M. F. Moura, J. Xavier, and B. Sinopoli, “Large Devia- tions Performance of Consensus+Innovations Distributed Detection With Non-Gaussian Observations,”IEEE Transactions on Signal Processing, vol. 60, no. 11, pp. 5987–6002, 2012

  59. [59]

    Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime,

    V. Matta, P. Braca, S. Marano, and A. H. Sayed, “Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime,”IEEE Transac- tions on Information Theory, vol. 62, no. 8, pp. 4710–4732, 2016

  60. [60]

    Distributed Detection Over Adaptive Networks: Refined Asymptotics and the Role of Connectivity,

    V. Matta, P. Braca, S. Marano, and A. H. Sayed, “Distributed Detection Over Adaptive Networks: Refined Asymptotics and the Role of Connectivity,”IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 442–460, 2016

  61. [61]

    Adaptive Social Learning,

    V. Bordignon, V. Matta, and A. H. Sayed, “Adaptive Social Learning,”IEEE Transac- tions on Information Theory, vol. 67, no. 9, pp. 6053–6081, 2021

  62. [62]

    Inaccuracy Rates for Distributed Inference Over Random Networks With Applications to Social Learning,

    D. Bajovi´ c, “Inaccuracy Rates for Distributed Inference Over Random Networks With Applications to Social Learning,”IEEE Transactions on Information Theory, vol. 70, no. 1, pp. 415–435, 2024

  63. [63]

    Matta, V

    V. Matta, V. Bordignon, and A. H. Sayed,Social Learning: Opinion Formation and Decision-Making over Graphs. Emerald Publishing Limited, 2025

  64. [64]

    R. A. Horn and C. R. Johnson,Matrix Analysis. Cambridge University Press, 2 ed., 2012

  65. [65]

    Linear Convergence of Gradient and Proximal- Gradient Methods Under the Polyak- Lojasiewicz Condition,

    H. Karimi, J. Nutini, and M. Schmidt, “Linear Convergence of Gradient and Proximal- Gradient Methods Under the Polyak- Lojasiewicz Condition,” inMachine Learning and Knowledge Discovery in Databases, pp. 795–811, Springer International, 2016

  66. [66]

    Vershynin,High-Dimensional Probability: An Introduction with Applications in Data Science

    R. Vershynin,High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge Uni- versity Press, 2018

  67. [67]

    Distributed stochastic gradient tracking methods,

    S. Pu and A. Nedi´ c, “Distributed stochastic gradient tracking methods,”Mathematical Programming, vol. 187, pp. 409–457, May 2021

  68. [68]

    A Unification and Generalization of Exact Distributed First-Order Meth- ods,

    D. Jakoveti´ c, “A Unification and Generalization of Exact Distributed First-Order Meth- ods,”IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 1, pp. 31–46, 2019. 23

  69. [69]

    A General Framework for Decentralized Optimization With First-Order Methods,

    R. Xin, S. Pu, A. Nedi´ c, and U. A. Khan, “A General Framework for Decentralized Optimization With First-Order Methods,”Proceedings of the IEEE, vol. 108, no. 11, pp. 1869–1889, 2020

  70. [70]

    Diffusion Least-Mean Squares Over Adaptive Net- works: Formulation and Performance Analysis,

    C. G. Lopes and A. H. Sayed, “Diffusion Least-Mean Squares Over Adaptive Net- works: Formulation and Performance Analysis,”IEEE Transactions on Signal Process- ing, vol. 56, no. 7, pp. 3122–3136, 2008

  71. [71]

    Diffusion LMS Strategies for Distributed Estimation,

    F. S. Cattivelli and A. H. Sayed, “Diffusion LMS Strategies for Distributed Estimation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1035–1048, 2010

  72. [72]

    Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks,

    J. Chen and A. H. Sayed, “Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks,”IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4289–4305, 2012

  73. [73]

    A Short Note on Con- centration Inequalities for Random Vectors with SubGaussian Norm,

    C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “A Short Note on Con- centration Inequalities for Random Vectors with SubGaussian Norm,”arXiv preprint, 2019

  74. [74]

    A Unified Theory of De- centralized SGD with Changing Topology and Local Updates,

    A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A Unified Theory of De- centralized SGD with Changing Topology and Local Updates,” inProceedings of the 37th International Conference on Machine Learning, vol. 119 ofProceedings of Machine Learning Research, pp. 5381–5393, PMLR, 2020

  75. [75]

    Fast linear iterations for distributed averaging,

    L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004

  76. [76]

    Penalized likelihood regression for gener- alized linear models with non-quadratic penalties,

    A. Antoniadis, I. Gijbels, and M. Nikolova, “Penalized likelihood regression for gener- alized linear models with non-quadratic penalties,”Annals of the Institute of Statistical Mathematics, vol. 63, no. 3, pp. 585–615, 2011

  77. [77]

    LIBSVM: A library for support vector machines,

    C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,”ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, pp. 1–27, 2011

  78. [78]

    Gradient Convergence in Gradient methods with Errors,

    D. P. Bertsekas and J. N. Tsitsiklis, “Gradient Convergence in Gradient methods with Errors,”SIAM Journal on Optimization, vol. 10, no. 3, pp. 627–642, 2000

  79. [79]

    Nesterov,Lectures on Convex Optimization

    Y. Nesterov,Lectures on Convex Optimization. Springer Optimization and Its Applica- tions, Springer Cham, 2nd ed., 2018

  80. [80]

    Distributed Subgradient Methods for Multi-Agent Opti- mization,

    A. Nedi´ c and A. Ozdaglar, “Distributed Subgradient Methods for Multi-Agent Opti- mization,”IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009

Showing first 80 references.