Optimizers like Adam reduce to steepest descent under particular norms, opening a design space of norm assignments tailored to layer roles.
arXiv preprint arXiv:2304.05187 , year =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2verdicts
UNVERDICTED 2representative citing papers
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
citing papers explorer
-
Old Optimizer, New Norm: An Anthology
Optimizers like Adam reduce to steepest descent under particular norms, opening a design space of norm assignments tailored to layer roles.
-
Muon Does Not Converge on Convex Lipschitz Functions
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.