High-probability Convergence Guarantees of Decentralized SGD

Aleksandar Armacki; Ali H. Sayed

arxiv: 2510.06141 · v5 · submitted 2025-10-07 · 💻 cs.LG · cs.MA· math.OC

High-probability Convergence Guarantees of Decentralized SGD

Aleksandar Armacki , Ali H. Sayed This is my paper

Pith reviewed 2026-05-18 08:42 UTC · model grok-4.3

classification 💻 cs.LG cs.MAmath.OC

keywords decentralized SGDhigh-probability convergencestochastic optimizationlinear speedupnon-convex optimizationstrongly convex optimizationlight-tailed noise

0 comments

The pith

Decentralized SGD converges in high probability under the same cost conditions as mean-squared error convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that decentralized stochastic gradient descent reaches high-probability convergence when the objective satisfies exactly the assumptions already used for mean-squared error analysis. Earlier decentralized high-probability results required stronger restrictions such as uniformly bounded gradients or asymptotically vanishing noise. The new bounds remove those restrictions by working with light-tailed noise and deliver order-optimal rates for both non-convex and strongly convex objectives. The same analysis also shows a linear speedup with the number of participating nodes and competitive or better transient times than the mean-squared error counterparts.

Core claim

Decentralized SGD converges in high probability under the same conditions on the cost function as those needed for mean-squared error convergence, yielding order-optimal rates for non-convex and strongly convex problems together with a linear speedup in the number of users, all under light-tailed noise.

What carries the argument

A variance-reduction property of decentralized averaging in the high-probability regime together with a novel moment-generating-function bound for strongly convex costs.

If this is right

Order-optimal high-probability rates hold for both non-convex and strongly convex objectives without extra bounded-gradient assumptions.
Linear speedup in the number of users is obtained in the high-probability sense, matching or improving transient times from mean-squared error analysis.
The same light-tailed noise model suffices for high-probability and mean-squared error guarantees, closing the previous gap.
A variance-reduction effect of decentralized averaging appears directly in the high-probability tail bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same moment-generating-function techniques may transfer to other first-order decentralized methods such as gradient tracking or push-sum variants.
Practical implementations could monitor empirical tail behavior of gradients to verify the light-tailed premise before relying on the high-probability guarantees.
The relaxation of assumptions suggests that high-probability analysis can now be applied to federated or edge-learning settings that already satisfy mean-squared error conditions.

Load-bearing premise

The stochastic gradients possess light tails so that their moment-generating functions exist and remain bounded.

What would settle it

A concrete counter-example or numerical run in which decentralized SGD fails to converge in high probability for a light-tailed but non-sub-Gaussian noise distribution while mean-squared error convergence still holds.

Figures

Figures reproduced from arXiv: 2510.06141 by Aleksandar Armacki, Ali H. Sayed.

**Figure 1.** Figure 1: Performance of DSGD, in the MSE sense (left) and HP sense (right). We can see that DSGD achieves an exponential tail decay for all values of threshold ε. For the threshold ε = 10−4 , the tail probability starts decaying exponentially after approximately t = 6000 iterations, which is consistent with the MSE behaviour, where we can see that DSGD takes around the same number of iterations to reach the average… view at source ↗

**Figure 2.** Figure 2: Linear speed-up of DSGD, in the MSE and HP sense. Left to right and top to bottom: MSE performance and tail decay with threshold ε = 10−2 , 10−3 , 10−4 [PITH_FULL_IMAGE:figures/full_fig_p043_2.png] view at source ↗

read the original abstract

Convergence in high-probability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized $\mathtt{SGD}$ ($\mathtt{DSGD}$) in the presence of light-tailed noise, providing several strong results. First, we show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching, or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows $\mathtt{DSGD}$ achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the MGF of strongly convex costs, which is of interest even in centralized settings. Finally, we provide experiments that validate our theory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows DSGD gets high-probability convergence and linear speedup under the same cost assumptions as MSE analysis, using light-tailed noise and new variance-reduction and MGF tools.

read the letter

The main point is that this work removes the restrictive assumptions that earlier high-probability analyses of decentralized SGD needed. It shows convergence in high probability for both non-convex and strongly convex costs under the same conditions used for mean-squared error bounds, plus linear speedup in the number of nodes, all with only light-tailed stochastic gradient noise instead of uniform bounds or vanishing noise. The rates are claimed to be order-optimal and the transient times match or improve on the MSE versions, which suggests the analysis is reasonably tight. The new pieces that make this possible are a high-probability variance-reduction result for the consensus error and a fresh MGF bound for strongly convex objectives; both look like they could travel to other settings. The stress-test finds no obvious gap between the stated assumptions and the derived rates, which is the right sign. The light-tailed noise condition is still a real restriction, though clearly weaker than what came before, and it may not cover every practical data distribution. Without the full proofs it is hard to judge the exact constants or whether the induction steps are as clean as they appear in outline, but the abstract and lemmas line up coherently and there is no sign of circularity or self-referential quantities. Experiments are mentioned to check the theory, though their scope is not detailed here. This paper is aimed at researchers already working on distributed optimization who care about tail bounds rather than just average-case performance. A reader familiar with the existing MSE theory for DSGD will see the value in the new high-probability tools and the speedup claim. I would send it to a serious referee; the targeted gap is genuine and the technical steps appear substantive enough to deserve detailed checking.

Referee Report

0 major / 2 minor

Summary. The paper studies high-probability (HP) convergence of Decentralized SGD (DSGD) under light-tailed stochastic gradient noise. It claims that DSGD converges in HP under the same conditions on the cost function as required for mean-squared error (MSE) convergence, without needing uniformly bounded gradients or vanishing noise. The analysis provides order-optimal rates for both non-convex and strongly convex objectives, establishes linear speedup with the number of users, and introduces supporting technical results including a HP variance-reduction lemma and a novel MGF bound for strongly convex costs. Experiments are included to validate the theory.

Significance. If the derivations hold, this work closes a notable gap between HP and MSE analyses in decentralized optimization by relaxing prior restrictive assumptions. The linear speedup result in the HP sense appears to be the first of its kind for DSGD. The variance-reduction lemma and MGF bound are of independent interest and may apply beyond decentralized settings, including to centralized SGD.

minor comments (2)

Abstract: Explicitly stating the precise order-optimal rates (e.g., the dependence on T and number of nodes) would make the contribution clearer to readers.
The assumption of light-tailed noise (existence of MGF) is weaker than uniform bounds but should be contrasted more explicitly with common data distributions in the introduction to highlight practicality.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. We are pleased that the contributions—HP convergence of DSGD under the same conditions as MSE convergence, order-optimal rates, and the first linear speedup result in the HP sense—are recognized as closing an important gap and of independent interest. No major comments were listed in the report, so we have no specific points to address point-by-point. We remain available to incorporate any minor suggestions or clarifications in a revised version.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper establishes high-probability convergence for DSGD by deriving new technical lemmas on variance reduction for consensus error and a novel MGF bound for strongly convex costs, then closing the induction directly from standard non-convex/strongly-convex assumptions plus light-tailed noise. These steps rely on first-principles bounds and graph mixing rates rather than any fitted parameters, self-referential definitions, or load-bearing self-citations. The linear speed-up and order-optimal rates follow from the same mixing factor used in MSE analyses but extended without reducing the final claims to inputs by construction. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard smoothness and convexity assumptions for the objective together with a light-tailed noise condition that enables the moment-generating function bounds; no new entities are postulated and no free parameters are fitted to data.

axioms (2)

domain assumption The objective function satisfies standard smoothness and either non-convexity or strong convexity.
Invoked to obtain the same conditions as MSE convergence.
domain assumption Stochastic gradients have light tails allowing moment-generating function bounds.
Key relaxed assumption replacing uniform boundedness used in earlier HP analyses.

pith-pipeline@v0.9.0 · 5865 in / 1336 out tokens · 47756 ms · 2026-05-18T08:42:09.459510+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 3 internal anchors

[1]

Adaptive Networks,

A. H. Sayed, “Adaptive Networks,”Proceedings of the IEEE, vol. 102, no. 4, pp. 460–497, 2014

work page 2014
[2]

Communication- Efficient Learning of Deep Networks from Decentralized Data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication- Efficient Learning of Deep Networks from Decentralized Data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics(A. Singh and J. Zhu, eds.), vol. 54 ofProceedings of Machine Learning Research, pp. 1273–1282, PMLR, 20–22 Apr 2017

work page 2017
[3]

Networked Signal and Information Processing: Learning by multiagent systems,

S. Vlaski, S. Kar, A. H. Sayed, and J. M. Moura, “Networked Signal and Information Processing: Learning by multiagent systems,”IEEE Signal Processing Magazine, vol. 40, no. 5, pp. 92–105, 2023

work page 2023
[4]

Practical secure aggregation for privacy-preserving machine learning,

K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ram- age, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” inproceedings of the 2017 ACM SIGSAC Conference on Computer and Com- munications Security, pp. 1175–1191, 2017

work page 2017
[5]

Secure Distributed Optimization Under Gradient Attacks,

S. Yu and S. Kar, “Secure Distributed Optimization Under Gradient Attacks,”IEEE Transactions on Signal Processing, vol. 71, pp. 1802–1816, 2023

work page 2023
[6]

Federated Learning: Strategies for Improving Communication Efficiency

J. Koneˇ cn` y, H. B. McMahan, F. X. Yu, P. Richt´ arik, A. T. Suresh, and D. Bacon, “Fed- erated learning: Strategies for improving communication efficiency,”arXiv:1610.05492, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Bullo, J

F. Bullo, J. Cort´ es, and S. Martinez,Distributed Control of Robotic Networks: A Mathe- matical Approach to Motion Coordination Algorithms. Princeton University Press, 2009

work page 2009
[8]

Distributed Learning in the Nonconvex World: From batch data to streaming and beyond,

T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed Learning in the Nonconvex World: From batch data to streaming and beyond,”IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 26–38, 2020

work page 2020
[9]

Local SGD Converges Fast and Communicates Little,

S. U. Stich, “Local SGD Converges Fast and Communicates Little,” inInternational Conference on Learning Representations, 2019

work page 2019
[10]

Towards Understanding Biased Client Selection in Federated Learning ,

Y. J. Cho, J. Wang, and G. Joshi, “ Towards Understanding Biased Client Selection in Federated Learning ,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics(G. Camps-Valls, F. J. R. Ruiz, and I. Valera, eds.), vol. 151 ofProceedings of Machine Learning Research, pp. 10351–10375, PMLR, 28–30 Mar 2022. 14

work page 2022
[11]

A Unified Theory of Decen- tralized SGD with Changing Topology and Local Updates,

A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A Unified Theory of Decen- tralized SGD with Changing Topology and Local Updates,” inProceedings of the 37th International Conference on Machine Learning(H. D. III and A. Singh, eds.), vol. 119 ofProceedings of Machine Learning Research, pp. 5381–5393, PMLR, 13–18 Jul 2020

work page 2020
[12]

Asynchronous parallel algo- rithms for nonconvex optimization,

L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronous parallel algo- rithms for nonconvex optimization,”Mathematical Programming, vol. 184, pp. 121–154, Nov 2020

work page 2020
[13]

Distributed Subgradient Methods for Multi-Agent Opti- mization,

A. Nedi´ c and A. Ozdaglar, “Distributed Subgradient Methods for Multi-Agent Opti- mization,”IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009

work page 2009
[14]

Fast Distributed Gradient Methods,

D. Jakoveti´ c, J. Xavier, and J. M. F. Moura, “Fast Distributed Gradient Methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014

work page 2014
[15]

EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,

W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,”SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

work page 2015
[16]

NEXT: In-Network Nonconvex Optimization,

P. D. Lorenzo and G. Scutari, “NEXT: In-Network Nonconvex Optimization,”IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120– 136, 2016

work page 2016
[17]

On the Convergence of Decentralized Gradient Descent,

K. Yuan, Q. Ling, and W. Yin, “On the Convergence of Decentralized Gradient Descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016

work page 2016
[18]

A Unification and Generalization of Exact Distributed First-Order Meth- ods,

D. Jakoveti´ c, “A Unification and Generalization of Exact Distributed First-Order Meth- ods,”IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 1, pp. 31–46, 2019

work page 2019
[19]

Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima,

B. Swenson, R. Murray, H. V. Poor, and S. Kar, “Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima,”JMLR, vol. 23, jan 2022

work page 2022
[20]

Convergence Rates for Distributed Stochastic Optimization Over Random Networks,

D. Jakoveti´ c, D. Bajovic, A. K. Sahu, and S. Kar, “Convergence Rates for Distributed Stochastic Optimization Over Random Networks,” in2018 IEEE Conference on Decision and Control (CDC), pp. 4238–4245, 2018

work page 2018
[21]

Distributed stochastic gradient tracking methods,

S. Pu and A. Nedi´ c, “Distributed stochastic gradient tracking methods,”Mathematical Programming, vol. 187, pp. 409–457, May 2021

work page 2021
[22]

Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate,

S. Vlaski and A. H. Sayed, “Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate,”IEEE Transactions on Signal Processing, vol. 69, pp. 1242– 1256, 2021

work page 2021
[23]

Distributed Learning in Non-Convex Environments— Part II: Polynomial Escape From Saddle-Points,

S. Vlaski and A. H. Sayed, “Distributed Learning in Non-Convex Environments— Part II: Polynomial Escape From Saddle-Points,”IEEE Transactions on Signal Processing, vol. 69, pp. 1257–1270, 2021

work page 2021
[24]

Variance-Reduced Decentralized Stochastic Optimiza- tion With Accelerated Convergence,

R. Xin, U. A. Khan, and S. Kar, “Variance-Reduced Decentralized Stochastic Optimiza- tion With Accelerated Convergence,”IEEE Transactions on Signal Processing, vol. 68, pp. 6255–6271, 2020. 15

work page 2020
[25]

An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization,

R. Xin, U. A. Khan, and S. Kar, “An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization,”IEEE Transactions on Signal Processing, vol. 69, pp. 1842–1858, 2021

work page 2021
[26]

Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms,

J. Wang and G. Joshi, “Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms,”Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021

work page 2021
[27]

Adaptation, Learning, and Optimization over Networks,

A. H. Sayed, “Adaptation, Learning, and Optimization over Networks,”Foundations and Trends®in Machine Learning, vol. 7, no. 4-5, pp. 311–801, 2014

work page 2014
[28]

Robust stochastic approximation approach to stochastic programming,

A. Nemirovski˘ ı, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,”SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009

work page 2009
[29]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

work page 2013
[30]

On the Convergence of Stochastic Gradient Descent with Adap- tive Stepsizes,

X. Li and F. Orabona, “On the Convergence of Stochastic Gradient Descent with Adap- tive Stepsizes,” inProceedings of the Twenty-Second International Conference on Ar- tificial Intelligence and Statistics(K. Chaudhuri and M. Sugiyama, eds.), vol. 89 of Proceedings of Machine Learning Research, pp. 983–992, PMLR, 16–18 Apr 2019

work page 2019
[31]

Tight analyses for non-smooth stochastic gradient descent,

N. J. A. Harvey, C. Liaw, Y. Plan, and S. Randhawa, “Tight analyses for non-smooth stochastic gradient descent,” inProceedings of the Thirty-Second Conference on Learning Theory(A. Beygelzimer and D. Hsu, eds.), vol. 99 ofProceedings of Machine Learning Research, pp. 1579–1613, PMLR, 25–28 Jun 2019

work page 2019
[32]

Large deviations rates for stochastic gradient descent with strongly convex functions,

D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Large deviations rates for stochastic gradient descent with strongly convex functions,” inProceedings of The 26th International Con- ference on Artificial Intelligence and Statistics(F. Ruiz, J. Dy, and J.-W. van de Meent, eds.), vol. 206 ofProceedings of Machine Learning Research, pp. 10095–10111, PMLR, 25–27 Apr 2023

work page 2023
[33]

High Probability Conver- gence of Stochastic Gradient Methods,

Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “High Probability Conver- gence of Stochastic Gradient Methods,” inProceedings of the 40th International Confer- ence on Machine Learning(A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, eds.), vol. 202 ofProceedings of Machine Learning Research, pp. 21884– 21914, PMLR, ...

work page 2023
[34]

Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise,

T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise,” inAdvances in Neural Information Processing Systems(A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds.), vol. 36, pp. 24191–24222, Curran Associates, Inc., 2023

work page 2023
[35]

Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise,

Z. Liu, J. Zhang, and Z. Zhou, “Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise,” inPro- ceedings of Thirty Sixth Conference on Learning Theory(G. Neu and L. Rosasco, eds.), vol. 195 ofProceedings of Machine Learning Research, pp. 2266–2290, PMLR, 12–15 Jul 2023. 16

work page 2023
[36]

From Gradient Clipping to Normalization for Heavy Tailed SGD,

F. H¨ ubler, I. Fatkhullin, and N. He, “From Gradient Clipping to Normalization for Heavy Tailed SGD,” inThe 28th International Conference on Artificial Intelligence and Statistics, vol. 258, 2025

work page 2025
[37]

Sign operator for coping with heavy-tailed noise in non-convex optimization: High probability bounds under(l 0, l1)-smoothness.arXiv preprint arXiv:2502.07923,

N. Kornilov, P. Zmushko, A. Semenov, A. Gasnikov, and A. Beznosikov, “Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds with Exten- sions to Distributed Optimization and Comparison Oracle,”arXiv:2502.07923, 2025

work page arXiv 2025
[38]

High- probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent under Heavy-tailed Noise,

A. Armacki, S. Yu, P. Sharma, G. Joshi, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “High- probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent under Heavy-tailed Noise,” inThe 28th International Conference on Artificial Intelligence and Statistics(Y. Li, S. Mandt, S. Agrawal, and E. Khan, eds.), vol. 258 ofProceedings of Machine ...

work page 2025
[39]

A high probability analysis of adaptive sgd with momentum,

X. Li and F. Orabona, “A high probability analysis of adaptive sgd with momentum,” Workshop on “Beyond first-order methods in ML systems”, 37th International Confer- ence on Machine Learning, 2020

work page 2020
[40]

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods,

Z. Liu and Z. Zhou, “Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[41]

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise,

L. Madden, E. Dall’Anese, and S. Becker, “High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise,”Journal of Machine Learning Research, vol. 25, no. 241, pp. 1–36, 2024

work page 2024
[42]

J. Nair, A. Wierman, and B. Zwart,The Fundamentals of Heavy Tails: Properties, Emer- gence, and Estimation. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, 2022

work page 2022
[43]

High-Probability Bounds for Stochastic Optimization and Vari- ational Inequalities: the Case of Unbounded Variance,

A. Sadiev, M. Danilova, E. Gorbunov, S. Horv´ ath, G. Gidel, P. Dvurechensky, A. Gas- nikov, and P. Richt´ arik, “High-Probability Bounds for Stochastic Optimization and Vari- ational Inequalities: the Case of Unbounded Variance,” inInternational Conference on Machine Learning, pp. 29563–29648, PMLR, 2023

work page 2023
[44]

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems,

N. Puchkin, E. Gorbunov, N. Kutuzov, and A. Gasnikov, “Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems,”arXiv:2311.04161, 2023

work page arXiv 2023
[45]

Large Deviations and Im- proved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry,

A. Armacki, S. Yu, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Large Deviations and Im- proved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry,”arXiv:2410.15637, 2024

work page arXiv 2024
[46]

Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization,

A. Armacki, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization,” arXiv:2507.09093, 2025

work page arXiv 2025
[47]

Better Theory for SGD in the Nonconvex World,

A. Khaled and P. Richt´ arik, “Better Theory for SGD in the Nonconvex World,”Trans- actions on Machine Learning Research, 2023

work page 2023
[48]

Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs,

A. Nedi´ c, A. Olshevsky, and W. Shi, “Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs,”SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. 17

work page 2017
[49]

Harnessing Smoothness to Accelerate Distributed Optimization,

G. Qu and N. Li, “Harnessing Smoothness to Accelerate Distributed Optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2018

work page 2018
[50]

Second-Order Guarantees of Stochastic Gradient Descent in Nonconvex Optimization,

S. Vlaski and A. H. Sayed, “Second-Order Guarantees of Stochastic Gradient Descent in Nonconvex Optimization,”IEEE Transactions on Automatic Control, vol. 67, no. 12, pp. 6489–6504, 2022

work page 2022
[51]

Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees,

A. Koloskova, H. Hendrikx, and S. U. Stich, “Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees,”arXiv:2305.01588, 2023

work page arXiv 2023
[52]

Linear convergence of gradient and proximal- gradient methods under the Polyak- Lojasiewicz condition,

H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal- gradient methods under the Polyak- Lojasiewicz condition,” inMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp. 795–811, Springer, 2016

work page 2016
[53]

Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence

S. Yu, D. Jakoveti´ c, and S. Kar, “Decentralized Nonconvex Optimization under Heavy- Tailed Noise: Normalization and Optimal Convergence,”arXiv:2505.03736, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication,

S. Kar, J. M. F. Moura, and K. Ramanan, “Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication,”IEEE Trans- actions on Information Theory, vol. 58, no. 6, pp. 3575–3605, 2012

work page 2012
[55]

Distributed Pareto Optimization via Diffusion Strategies,

J. Chen and A. H. Sayed, “Distributed Pareto Optimization via Diffusion Strategies,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 2, pp. 205–220, 2013

work page 2013
[56]

Multitask learning over graphs: An approach for distributed, streaming machine learning,

R. Nassif, S. Vlaski, C. Richard, J. Chen, and A. H. Sayed, “Multitask learning over graphs: An approach for distributed, streaming machine learning,”IEEE Signal Pro- cessing Magazine, vol. 37, no. 3, pp. 14–25, 2020

work page 2020
[57]

Convergence in High Probability of Dis- tributed Stochastic Gradient Descent Algorithms,

K. Lu, H. Wang, H. Zhang, and L. Wang, “Convergence in High Probability of Dis- tributed Stochastic Gradient Descent Algorithms,”IEEE Transactions on Automatic Control, vol. 69, no. 4, pp. 2189–2204, 2024

work page 2024
[58]

Online Distributed Algorithms for Online Noncooperative Games With Stochas- tic Cost Functions: High Probability Bound of Regrets,

K. Lu, “Online Distributed Algorithms for Online Noncooperative Games With Stochas- tic Cost Functions: High Probability Bound of Regrets,”IEEE Transactions on Auto- matic Control, vol. 69, no. 12, pp. 8860–8867, 2024

work page 2024
[59]

Online distributed nonconvex optimization with stochas- tic objective functions: High probability bound analysis of dynamic regrets,

H. Xu, K. Lu, and Y.-L. Wang, “Online distributed nonconvex optimization with stochas- tic objective functions: High probability bound analysis of dynamic regrets,”Automatica, vol. 170, p. 111863, 2024

work page 2024
[60]

High Probability Convergence of Clipped Dis- tributed Dual Averaging With Heavy-Tailed Noises,

Y. Qin, K. Lu, H. Xu, and X. Chen, “High Probability Convergence of Clipped Dis- tributed Dual Averaging With Heavy-Tailed Noises,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 4, pp. 2624–2632, 2025

work page 2025
[61]

Online distributed optimization with clipped stochastic gradients: High probability bound of regrets,

Y. Yang, K. Lu, and L. Wang, “Online distributed optimization with clipped stochastic gradients: High probability bound of regrets,”Automatica, vol. 182, p. 112525, 2025

work page 2025
[62]

High Probability Convergence of Distributed Clipped Stochastic Gradient Descent with Heavy-tailed Noise,

Y. Yang, K. Lu, and L. Wang, “High Probability Convergence of Distributed Clipped Stochastic Gradient Descent with Heavy-tailed Noise,”arXiv:2506.11647, 2025. 18

work page arXiv 2025
[63]

Distributed Detection via Gaussian Running Consensus: Large Deviations Asymptotic Analysis,

D. Bajovi´ c, D. Jakoveti´ c, J. Xavier, B. Sinopoli, and J. M. F. Moura, “Distributed Detection via Gaussian Running Consensus: Large Deviations Asymptotic Analysis,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4381–4396, 2011

work page 2011
[64]

Large Devia- tions Performance of Consensus+Innovations Distributed Detection With Non-Gaussian Observations,

D. Bajovi´ c, D. Jakoveti´ c, J. M. F. Moura, J. Xavier, and B. Sinopoli, “Large Devia- tions Performance of Consensus+Innovations Distributed Detection With Non-Gaussian Observations,”IEEE Transactions on Signal Processing, vol. 60, no. 11, pp. 5987–6002, 2012

work page 2012
[65]

Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime,

V. Matta, P. Braca, S. Marano, and A. H. Sayed, “Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime,”IEEE Transac- tions on Information Theory, vol. 62, no. 8, pp. 4710–4732, 2016

work page 2016
[66]

Distributed Detection Over Adaptive Networks: Refined Asymptotics and the Role of Connectivity,

V. Matta, P. Braca, S. Marano, and A. H. Sayed, “Distributed Detection Over Adaptive Networks: Refined Asymptotics and the Role of Connectivity,”IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 442–460, 2016

work page 2016
[67]

Inaccuracy Rates for Distributed Inference Over Random Networks With Applications to Social Learning,

D. Bajovi´ c, “Inaccuracy Rates for Distributed Inference Over Random Networks With Applications to Social Learning,”IEEE Transactions on Information Theory, vol. 70, no. 1, pp. 415–435, 2024

work page 2024
[68]

A unified and refined convergence analysis for non-convex decentralized learning,

S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022

work page 2022
[69]

Diffusion Least-Mean Squares Over Adaptive Net- works: Formulation and Performance Analysis,

C. G. Lopes and A. H. Sayed, “Diffusion Least-Mean Squares Over Adaptive Net- works: Formulation and Performance Analysis,”IEEE Transactions on Signal Process- ing, vol. 56, no. 7, pp. 3122–3136, 2008

work page 2008
[70]

Diffusion LMS Strategies for Distributed Estimation,

F. S. Cattivelli and A. H. Sayed, “Diffusion LMS Strategies for Distributed Estimation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1035–1048, 2010

work page 2010
[71]

Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks,

J. Chen and A. H. Sayed, “Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks,”IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4289–4305, 2012

work page 2012
[72]

Consensus + innovations distributed inference over net- works: cooperation and sensing in networked systems,

S. Kar and J. M. Moura, “Consensus + innovations distributed inference over net- works: cooperation and sensing in networked systems,”IEEE Signal Processing Mag- azine, vol. 30, no. 3, pp. 99–109, 2013

work page 2013
[73]

R. A. Horn and C. R. Johnson,Matrix Analysis. Cambridge University Press, 2 ed., 2012

work page 2012
[74]

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent,

X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent,” inAdvances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 3...

work page 2017
[75]

Nesterov,Lectures on Convex Optimization

Y. Nesterov,Lectures on Convex Optimization. Springer Publishing Company, Incorpo- rated, 2nd ed., 2018. 19

work page 2018
[76]

A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “A Short Note on Concen- tration Inequalities for Random Vectors with SubGaussian Norm,”arXiv:1902.03736, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[77]

Gradient Convergence in Gradient methods with Errors,

D. P. Bertsekas and J. N. Tsitsiklis, “Gradient Convergence in Gradient methods with Errors,”SIAM Journal on Optimization, vol. 10, no. 3, pp. 627–642, 2000

work page 2000
[78]

Vershynin,High-Dimensional Probability: An Introduction with Applications in Data Science

R. Vershynin,High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge Uni- versity Press, 2018

work page 2018
[79]

Fast linear iterations for distributed averaging,

L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004

work page 2004
[80]

Cvetkovic, P

D. Cvetkovic, P. Rowlinson, and S. Simic,Eigenspaces of Graphs. Encyclopedia of Math- ematics and its Applications, Cambridge University Press, 1997

work page 1997

Showing first 80 references.

[1] [1]

Adaptive Networks,

A. H. Sayed, “Adaptive Networks,”Proceedings of the IEEE, vol. 102, no. 4, pp. 460–497, 2014

work page 2014

[2] [2]

Communication- Efficient Learning of Deep Networks from Decentralized Data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication- Efficient Learning of Deep Networks from Decentralized Data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics(A. Singh and J. Zhu, eds.), vol. 54 ofProceedings of Machine Learning Research, pp. 1273–1282, PMLR, 20–22 Apr 2017

work page 2017

[3] [3]

Networked Signal and Information Processing: Learning by multiagent systems,

S. Vlaski, S. Kar, A. H. Sayed, and J. M. Moura, “Networked Signal and Information Processing: Learning by multiagent systems,”IEEE Signal Processing Magazine, vol. 40, no. 5, pp. 92–105, 2023

work page 2023

[4] [4]

Practical secure aggregation for privacy-preserving machine learning,

K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ram- age, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” inproceedings of the 2017 ACM SIGSAC Conference on Computer and Com- munications Security, pp. 1175–1191, 2017

work page 2017

[5] [5]

Secure Distributed Optimization Under Gradient Attacks,

S. Yu and S. Kar, “Secure Distributed Optimization Under Gradient Attacks,”IEEE Transactions on Signal Processing, vol. 71, pp. 1802–1816, 2023

work page 2023

[6] [6]

Federated Learning: Strategies for Improving Communication Efficiency

J. Koneˇ cn` y, H. B. McMahan, F. X. Yu, P. Richt´ arik, A. T. Suresh, and D. Bacon, “Fed- erated learning: Strategies for improving communication efficiency,”arXiv:1610.05492, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Bullo, J

F. Bullo, J. Cort´ es, and S. Martinez,Distributed Control of Robotic Networks: A Mathe- matical Approach to Motion Coordination Algorithms. Princeton University Press, 2009

work page 2009

[8] [8]

Distributed Learning in the Nonconvex World: From batch data to streaming and beyond,

T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed Learning in the Nonconvex World: From batch data to streaming and beyond,”IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 26–38, 2020

work page 2020

[9] [9]

Local SGD Converges Fast and Communicates Little,

S. U. Stich, “Local SGD Converges Fast and Communicates Little,” inInternational Conference on Learning Representations, 2019

work page 2019

[10] [10]

Towards Understanding Biased Client Selection in Federated Learning ,

Y. J. Cho, J. Wang, and G. Joshi, “ Towards Understanding Biased Client Selection in Federated Learning ,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics(G. Camps-Valls, F. J. R. Ruiz, and I. Valera, eds.), vol. 151 ofProceedings of Machine Learning Research, pp. 10351–10375, PMLR, 28–30 Mar 2022. 14

work page 2022

[11] [11]

A Unified Theory of Decen- tralized SGD with Changing Topology and Local Updates,

A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A Unified Theory of Decen- tralized SGD with Changing Topology and Local Updates,” inProceedings of the 37th International Conference on Machine Learning(H. D. III and A. Singh, eds.), vol. 119 ofProceedings of Machine Learning Research, pp. 5381–5393, PMLR, 13–18 Jul 2020

work page 2020

[12] [12]

Asynchronous parallel algo- rithms for nonconvex optimization,

L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronous parallel algo- rithms for nonconvex optimization,”Mathematical Programming, vol. 184, pp. 121–154, Nov 2020

work page 2020

[13] [13]

Distributed Subgradient Methods for Multi-Agent Opti- mization,

A. Nedi´ c and A. Ozdaglar, “Distributed Subgradient Methods for Multi-Agent Opti- mization,”IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009

work page 2009

[14] [14]

Fast Distributed Gradient Methods,

D. Jakoveti´ c, J. Xavier, and J. M. F. Moura, “Fast Distributed Gradient Methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014

work page 2014

[15] [15]

EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,

W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,”SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

work page 2015

[16] [16]

NEXT: In-Network Nonconvex Optimization,

P. D. Lorenzo and G. Scutari, “NEXT: In-Network Nonconvex Optimization,”IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120– 136, 2016

work page 2016

[17] [17]

On the Convergence of Decentralized Gradient Descent,

K. Yuan, Q. Ling, and W. Yin, “On the Convergence of Decentralized Gradient Descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016

work page 2016

[18] [18]

A Unification and Generalization of Exact Distributed First-Order Meth- ods,

D. Jakoveti´ c, “A Unification and Generalization of Exact Distributed First-Order Meth- ods,”IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 1, pp. 31–46, 2019

work page 2019

[19] [19]

Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima,

B. Swenson, R. Murray, H. V. Poor, and S. Kar, “Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima,”JMLR, vol. 23, jan 2022

work page 2022

[20] [20]

Convergence Rates for Distributed Stochastic Optimization Over Random Networks,

D. Jakoveti´ c, D. Bajovic, A. K. Sahu, and S. Kar, “Convergence Rates for Distributed Stochastic Optimization Over Random Networks,” in2018 IEEE Conference on Decision and Control (CDC), pp. 4238–4245, 2018

work page 2018

[21] [21]

Distributed stochastic gradient tracking methods,

S. Pu and A. Nedi´ c, “Distributed stochastic gradient tracking methods,”Mathematical Programming, vol. 187, pp. 409–457, May 2021

work page 2021

[22] [22]

Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate,

S. Vlaski and A. H. Sayed, “Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate,”IEEE Transactions on Signal Processing, vol. 69, pp. 1242– 1256, 2021

work page 2021

[23] [23]

Distributed Learning in Non-Convex Environments— Part II: Polynomial Escape From Saddle-Points,

S. Vlaski and A. H. Sayed, “Distributed Learning in Non-Convex Environments— Part II: Polynomial Escape From Saddle-Points,”IEEE Transactions on Signal Processing, vol. 69, pp. 1257–1270, 2021

work page 2021

[24] [24]

Variance-Reduced Decentralized Stochastic Optimiza- tion With Accelerated Convergence,

R. Xin, U. A. Khan, and S. Kar, “Variance-Reduced Decentralized Stochastic Optimiza- tion With Accelerated Convergence,”IEEE Transactions on Signal Processing, vol. 68, pp. 6255–6271, 2020. 15

work page 2020

[25] [25]

An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization,

R. Xin, U. A. Khan, and S. Kar, “An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization,”IEEE Transactions on Signal Processing, vol. 69, pp. 1842–1858, 2021

work page 2021

[26] [26]

Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms,

J. Wang and G. Joshi, “Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms,”Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021

work page 2021

[27] [27]

Adaptation, Learning, and Optimization over Networks,

A. H. Sayed, “Adaptation, Learning, and Optimization over Networks,”Foundations and Trends®in Machine Learning, vol. 7, no. 4-5, pp. 311–801, 2014

work page 2014

[28] [28]

Robust stochastic approximation approach to stochastic programming,

A. Nemirovski˘ ı, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,”SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009

work page 2009

[29] [29]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

work page 2013

[30] [30]

On the Convergence of Stochastic Gradient Descent with Adap- tive Stepsizes,

X. Li and F. Orabona, “On the Convergence of Stochastic Gradient Descent with Adap- tive Stepsizes,” inProceedings of the Twenty-Second International Conference on Ar- tificial Intelligence and Statistics(K. Chaudhuri and M. Sugiyama, eds.), vol. 89 of Proceedings of Machine Learning Research, pp. 983–992, PMLR, 16–18 Apr 2019

work page 2019

[31] [31]

Tight analyses for non-smooth stochastic gradient descent,

N. J. A. Harvey, C. Liaw, Y. Plan, and S. Randhawa, “Tight analyses for non-smooth stochastic gradient descent,” inProceedings of the Thirty-Second Conference on Learning Theory(A. Beygelzimer and D. Hsu, eds.), vol. 99 ofProceedings of Machine Learning Research, pp. 1579–1613, PMLR, 25–28 Jun 2019

work page 2019

[32] [32]

Large deviations rates for stochastic gradient descent with strongly convex functions,

D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Large deviations rates for stochastic gradient descent with strongly convex functions,” inProceedings of The 26th International Con- ference on Artificial Intelligence and Statistics(F. Ruiz, J. Dy, and J.-W. van de Meent, eds.), vol. 206 ofProceedings of Machine Learning Research, pp. 10095–10111, PMLR, 25–27 Apr 2023

work page 2023

[33] [33]

High Probability Conver- gence of Stochastic Gradient Methods,

Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “High Probability Conver- gence of Stochastic Gradient Methods,” inProceedings of the 40th International Confer- ence on Machine Learning(A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, eds.), vol. 202 ofProceedings of Machine Learning Research, pp. 21884– 21914, PMLR, ...

work page 2023

[34] [34]

Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise,

T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise,” inAdvances in Neural Information Processing Systems(A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds.), vol. 36, pp. 24191–24222, Curran Associates, Inc., 2023

work page 2023

[35] [35]

Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise,

Z. Liu, J. Zhang, and Z. Zhou, “Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise,” inPro- ceedings of Thirty Sixth Conference on Learning Theory(G. Neu and L. Rosasco, eds.), vol. 195 ofProceedings of Machine Learning Research, pp. 2266–2290, PMLR, 12–15 Jul 2023. 16

work page 2023

[36] [36]

From Gradient Clipping to Normalization for Heavy Tailed SGD,

F. H¨ ubler, I. Fatkhullin, and N. He, “From Gradient Clipping to Normalization for Heavy Tailed SGD,” inThe 28th International Conference on Artificial Intelligence and Statistics, vol. 258, 2025

work page 2025

[37] [37]

Sign operator for coping with heavy-tailed noise in non-convex optimization: High probability bounds under(l 0, l1)-smoothness.arXiv preprint arXiv:2502.07923,

N. Kornilov, P. Zmushko, A. Semenov, A. Gasnikov, and A. Beznosikov, “Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds with Exten- sions to Distributed Optimization and Comparison Oracle,”arXiv:2502.07923, 2025

work page arXiv 2025

[38] [38]

High- probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent under Heavy-tailed Noise,

A. Armacki, S. Yu, P. Sharma, G. Joshi, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “High- probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent under Heavy-tailed Noise,” inThe 28th International Conference on Artificial Intelligence and Statistics(Y. Li, S. Mandt, S. Agrawal, and E. Khan, eds.), vol. 258 ofProceedings of Machine ...

work page 2025

[39] [39]

A high probability analysis of adaptive sgd with momentum,

X. Li and F. Orabona, “A high probability analysis of adaptive sgd with momentum,” Workshop on “Beyond first-order methods in ML systems”, 37th International Confer- ence on Machine Learning, 2020

work page 2020

[40] [40]

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods,

Z. Liu and Z. Zhou, “Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[41] [41]

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise,

L. Madden, E. Dall’Anese, and S. Becker, “High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise,”Journal of Machine Learning Research, vol. 25, no. 241, pp. 1–36, 2024

work page 2024

[42] [42]

J. Nair, A. Wierman, and B. Zwart,The Fundamentals of Heavy Tails: Properties, Emer- gence, and Estimation. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, 2022

work page 2022

[43] [43]

High-Probability Bounds for Stochastic Optimization and Vari- ational Inequalities: the Case of Unbounded Variance,

A. Sadiev, M. Danilova, E. Gorbunov, S. Horv´ ath, G. Gidel, P. Dvurechensky, A. Gas- nikov, and P. Richt´ arik, “High-Probability Bounds for Stochastic Optimization and Vari- ational Inequalities: the Case of Unbounded Variance,” inInternational Conference on Machine Learning, pp. 29563–29648, PMLR, 2023

work page 2023

[44] [44]

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems,

N. Puchkin, E. Gorbunov, N. Kutuzov, and A. Gasnikov, “Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems,”arXiv:2311.04161, 2023

work page arXiv 2023

[45] [45]

Large Deviations and Im- proved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry,

A. Armacki, S. Yu, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Large Deviations and Im- proved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry,”arXiv:2410.15637, 2024

work page arXiv 2024

[46] [46]

Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization,

A. Armacki, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization,” arXiv:2507.09093, 2025

work page arXiv 2025

[47] [47]

Better Theory for SGD in the Nonconvex World,

A. Khaled and P. Richt´ arik, “Better Theory for SGD in the Nonconvex World,”Trans- actions on Machine Learning Research, 2023

work page 2023

[48] [48]

Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs,

A. Nedi´ c, A. Olshevsky, and W. Shi, “Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs,”SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. 17

work page 2017

[49] [49]

Harnessing Smoothness to Accelerate Distributed Optimization,

G. Qu and N. Li, “Harnessing Smoothness to Accelerate Distributed Optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2018

work page 2018

[50] [50]

Second-Order Guarantees of Stochastic Gradient Descent in Nonconvex Optimization,

S. Vlaski and A. H. Sayed, “Second-Order Guarantees of Stochastic Gradient Descent in Nonconvex Optimization,”IEEE Transactions on Automatic Control, vol. 67, no. 12, pp. 6489–6504, 2022

work page 2022

[51] [51]

Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees,

A. Koloskova, H. Hendrikx, and S. U. Stich, “Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees,”arXiv:2305.01588, 2023

work page arXiv 2023

[52] [52]

Linear convergence of gradient and proximal- gradient methods under the Polyak- Lojasiewicz condition,

H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal- gradient methods under the Polyak- Lojasiewicz condition,” inMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp. 795–811, Springer, 2016

work page 2016

[53] [53]

Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence

S. Yu, D. Jakoveti´ c, and S. Kar, “Decentralized Nonconvex Optimization under Heavy- Tailed Noise: Normalization and Optimal Convergence,”arXiv:2505.03736, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication,

S. Kar, J. M. F. Moura, and K. Ramanan, “Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication,”IEEE Trans- actions on Information Theory, vol. 58, no. 6, pp. 3575–3605, 2012

work page 2012

[55] [55]

Distributed Pareto Optimization via Diffusion Strategies,

J. Chen and A. H. Sayed, “Distributed Pareto Optimization via Diffusion Strategies,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 2, pp. 205–220, 2013

work page 2013

[56] [56]

Multitask learning over graphs: An approach for distributed, streaming machine learning,

R. Nassif, S. Vlaski, C. Richard, J. Chen, and A. H. Sayed, “Multitask learning over graphs: An approach for distributed, streaming machine learning,”IEEE Signal Pro- cessing Magazine, vol. 37, no. 3, pp. 14–25, 2020

work page 2020

[57] [57]

Convergence in High Probability of Dis- tributed Stochastic Gradient Descent Algorithms,

K. Lu, H. Wang, H. Zhang, and L. Wang, “Convergence in High Probability of Dis- tributed Stochastic Gradient Descent Algorithms,”IEEE Transactions on Automatic Control, vol. 69, no. 4, pp. 2189–2204, 2024

work page 2024

[58] [58]

Online Distributed Algorithms for Online Noncooperative Games With Stochas- tic Cost Functions: High Probability Bound of Regrets,

K. Lu, “Online Distributed Algorithms for Online Noncooperative Games With Stochas- tic Cost Functions: High Probability Bound of Regrets,”IEEE Transactions on Auto- matic Control, vol. 69, no. 12, pp. 8860–8867, 2024

work page 2024

[59] [59]

Online distributed nonconvex optimization with stochas- tic objective functions: High probability bound analysis of dynamic regrets,

H. Xu, K. Lu, and Y.-L. Wang, “Online distributed nonconvex optimization with stochas- tic objective functions: High probability bound analysis of dynamic regrets,”Automatica, vol. 170, p. 111863, 2024

work page 2024

[60] [60]

High Probability Convergence of Clipped Dis- tributed Dual Averaging With Heavy-Tailed Noises,

Y. Qin, K. Lu, H. Xu, and X. Chen, “High Probability Convergence of Clipped Dis- tributed Dual Averaging With Heavy-Tailed Noises,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 4, pp. 2624–2632, 2025

work page 2025

[61] [61]

Online distributed optimization with clipped stochastic gradients: High probability bound of regrets,

Y. Yang, K. Lu, and L. Wang, “Online distributed optimization with clipped stochastic gradients: High probability bound of regrets,”Automatica, vol. 182, p. 112525, 2025

work page 2025

[62] [62]

High Probability Convergence of Distributed Clipped Stochastic Gradient Descent with Heavy-tailed Noise,

Y. Yang, K. Lu, and L. Wang, “High Probability Convergence of Distributed Clipped Stochastic Gradient Descent with Heavy-tailed Noise,”arXiv:2506.11647, 2025. 18

work page arXiv 2025

[63] [63]

Distributed Detection via Gaussian Running Consensus: Large Deviations Asymptotic Analysis,

D. Bajovi´ c, D. Jakoveti´ c, J. Xavier, B. Sinopoli, and J. M. F. Moura, “Distributed Detection via Gaussian Running Consensus: Large Deviations Asymptotic Analysis,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4381–4396, 2011

work page 2011

[64] [64]

Large Devia- tions Performance of Consensus+Innovations Distributed Detection With Non-Gaussian Observations,

D. Bajovi´ c, D. Jakoveti´ c, J. M. F. Moura, J. Xavier, and B. Sinopoli, “Large Devia- tions Performance of Consensus+Innovations Distributed Detection With Non-Gaussian Observations,”IEEE Transactions on Signal Processing, vol. 60, no. 11, pp. 5987–6002, 2012

work page 2012

[65] [65]

Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime,

V. Matta, P. Braca, S. Marano, and A. H. Sayed, “Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime,”IEEE Transac- tions on Information Theory, vol. 62, no. 8, pp. 4710–4732, 2016

work page 2016

[66] [66]

Distributed Detection Over Adaptive Networks: Refined Asymptotics and the Role of Connectivity,

V. Matta, P. Braca, S. Marano, and A. H. Sayed, “Distributed Detection Over Adaptive Networks: Refined Asymptotics and the Role of Connectivity,”IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 442–460, 2016

work page 2016

[67] [67]

Inaccuracy Rates for Distributed Inference Over Random Networks With Applications to Social Learning,

D. Bajovi´ c, “Inaccuracy Rates for Distributed Inference Over Random Networks With Applications to Social Learning,”IEEE Transactions on Information Theory, vol. 70, no. 1, pp. 415–435, 2024

work page 2024

[68] [68]

A unified and refined convergence analysis for non-convex decentralized learning,

S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022

work page 2022

[69] [69]

Diffusion Least-Mean Squares Over Adaptive Net- works: Formulation and Performance Analysis,

C. G. Lopes and A. H. Sayed, “Diffusion Least-Mean Squares Over Adaptive Net- works: Formulation and Performance Analysis,”IEEE Transactions on Signal Process- ing, vol. 56, no. 7, pp. 3122–3136, 2008

work page 2008

[70] [70]

Diffusion LMS Strategies for Distributed Estimation,

F. S. Cattivelli and A. H. Sayed, “Diffusion LMS Strategies for Distributed Estimation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1035–1048, 2010

work page 2010

[71] [71]

Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks,

J. Chen and A. H. Sayed, “Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks,”IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4289–4305, 2012

work page 2012

[72] [72]

Consensus + innovations distributed inference over net- works: cooperation and sensing in networked systems,

S. Kar and J. M. Moura, “Consensus + innovations distributed inference over net- works: cooperation and sensing in networked systems,”IEEE Signal Processing Mag- azine, vol. 30, no. 3, pp. 99–109, 2013

work page 2013

[73] [73]

R. A. Horn and C. R. Johnson,Matrix Analysis. Cambridge University Press, 2 ed., 2012

work page 2012

[74] [74]

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent,

X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent,” inAdvances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 3...

work page 2017

[75] [75]

Nesterov,Lectures on Convex Optimization

Y. Nesterov,Lectures on Convex Optimization. Springer Publishing Company, Incorpo- rated, 2nd ed., 2018. 19

work page 2018

[76] [76]

A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “A Short Note on Concen- tration Inequalities for Random Vectors with SubGaussian Norm,”arXiv:1902.03736, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[77] [77]

Gradient Convergence in Gradient methods with Errors,

D. P. Bertsekas and J. N. Tsitsiklis, “Gradient Convergence in Gradient methods with Errors,”SIAM Journal on Optimization, vol. 10, no. 3, pp. 627–642, 2000

work page 2000

[78] [78]

Vershynin,High-Dimensional Probability: An Introduction with Applications in Data Science

R. Vershynin,High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge Uni- versity Press, 2018

work page 2018

[79] [79]

Fast linear iterations for distributed averaging,

L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004

work page 2004

[80] [80]

Cvetkovic, P

D. Cvetkovic, P. Rowlinson, and S. Simic,Eigenspaces of Graphs. Encyclopedia of Math- ematics and its Applications, Cambridge University Press, 1997

work page 1997