SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under ( L0, L1)-Smoothness
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 2polarities
background 2representative citing papers
Decentralized SGD achieves high-probability convergence with order-optimal rates and linear speedup under standard cost assumptions matching those for MSE convergence.
LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict the optimal period.
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
citing papers explorer
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
High-Probability Convergence Guarantees of Decentralized SGD
Decentralized SGD achieves high-probability convergence with order-optimal rates and linear speedup under standard cost assumptions matching those for MSE convergence.
-
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict the optimal period.
-
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.