Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates

Dmitry Kovalev; Ekaterina Borodich

arxiv: 2511.11466 · v2 · pith:ODUSAXQJnew · submitted 2025-11-14 · 🧮 math.OC · cs.LG

Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates

Dmitry Kovalev , Ekaterina Borodich This is my paper

classification 🧮 math.OC cs.LG

keywords convergencenon-euclideanoptimizationanalysisdevelopinggradientnoiserates

0 comments

read the original abstract

Recently, several instances of non-Euclidean SGD, including SignSGD, Lion, and Muon, have attracted significant interest from the optimization community due to their practical success in training deep neural networks. Consequently, a number of works have attempted to explain this success by developing theoretical convergence analyses. Unfortunately, these results cannot properly justify the superior performance of these methods, as they could not beat the convergence rate of vanilla Euclidean SGD. We resolve this important open problem by developing a new unified convergence analysis under the structured smoothness and gradient noise assumption. In particular, our results indicate that non-Euclidean SGD (i) can exploit the sparsity or low-rank structure of the upper bounds on the Hessian and gradient noise, (ii) can provably benefit from popular algorithmic tools such as extrapolation or momentum variance reduction, and (iii) can match the state-of-the-art convergence rates of adaptive and more complex optimization algorithms such as AdaGrad and Shampoo.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stochastic Non-Smooth Convex Optimization with Unbounded Gradients
math.OC 2026-05 unverdicted novelty 8.0

Introduces generalized Lipschitz class and shows clipped AdamW outperforms SGD and AdaGrad for stochastic convex optimization under this and related assumptions.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
math.OC 2026-05 conditional novelty 7.0

Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
Convergence Analysis of Muon-type Methods with Inexact LMO in the Degenerate Case
math.OC 2026-06 unverdicted novelty 6.0

Convergence rates are derived for Muon-type methods with inexact LMO in the degenerate case under novel assumptions and layer-wise (L^0, L^1)-smoothness for non-convex and star-convex objectives with weight decay.
Communication-Efficient Gluon in Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.