Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
DADA is a parameter-free dual averaging method for convex optimization that adapts to local function growth and applies to nonsmooth, smooth, Holder-smooth, and other classes for both constrained and unbounded domains without prior knowledge of iteration count or accuracy.
citing papers explorer
-
Muon Does Not Converge on Convex Lipschitz Functions
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
-
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
-
DADA: Dual Averaging with Distance Adaptation
DADA is a parameter-free dual averaging method for convex optimization that adapts to local function growth and applies to nonsmooth, smooth, Holder-smooth, and other classes for both constrained and unbounded domains without prior knowledge of iteration count or accuracy.