Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence

arxiv: 2505.03736 · v2 · submitted 2025-05-06 · 🧮 math.OC · cs.DC

Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence

Shuhua Yu , Dusan Jakovetic , Soummya Kar This is my paper

classification 🧮 math.OC cs.DC

keywords gradientnoiseheavy-tailedgt-nsgdmnonconvexstochasticdecentralizedoptimization

0 comments p. Extension

read the original abstract

Heavy-tailed noise in nonconvex stochastic optimization has garnered increasing research interest, as empirical studies, including those on training attention models, suggest it is a more realistic gradient noise condition. This paper studies first-order nonconvex stochastic optimization under heavy-tailed gradient noise in a decentralized setup, where each node can only communicate with its direct neighbors in a predefined graph. Specifically, we consider a class of heavy-tailed gradient noise that is zero-mean and has only $p$-th moment for $p \in (1, 2]$. We propose GT-NSGDm, Gradient Tracking based Normalized Stochastic Gradient Descent with momentum, that utilizes normalization, in conjunction with gradient tracking and momentum, to cope with heavy-tailed noise on distributed nodes. We show that, when the communication graph admits primitive and doubly stochastic weights, GT-NSGDm guarantees, for the \textit{first} time in the literature, that the expected gradient norm converges at an optimal non-asymptotic rate $O\big(1/T^{(p-1)/(3p-2)}\big)$, which matches the lower bound in the centralized setup. When tail index $p$ is unknown, GT-NSGDm attains a non-asymptotic rate $O\big( 1/T^{(p-1)/(2p)} \big)$ that is, for $p < 2$, topology independent and has a speedup factor $n^{1-1/p}$ in terms of the number of nodes $n$. Finally, experiments on nonconvex linear regression with tokenized synthetic data and decentralized training of language models on a real-world corpus demonstrate that GT-NSGDm is more robust and efficient than baselines.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

High-probability Convergence Guarantees of Decentralized SGD
cs.LG 2025-10 unverdicted novelty 7.0

Decentralized SGD achieves high-probability convergence with order-optimal rates and linear speedup in the number of users under standard smoothness and convexity conditions on the cost function.