OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
A high probability analysis of adaptive sgd with momentum
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
PS-Clip-SGD achieves optimal in-expectation convergence rates for non-convex optimization under heavy-tailed gradient noise, with matching high-probability guarantees, and outperforms standard methods on AlexNet trained on CIFAR-100.
MGUP augments momentum optimizers with selective larger steps on a fixed proportion of parameters per iteration, claiming convergence guarantees for MGUP-AdamW and superior empirical performance on pretraining and fine-tuning.
citing papers explorer
-
OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
-
Robust and Fast Training via Per-Sample Clipping
PS-Clip-SGD achieves optimal in-expectation convergence rates for non-convex optimization under heavy-tailed gradient noise, with matching high-probability guarantees, and outperforms standard methods on AlexNet trained on CIFAR-100.
-
MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization
MGUP augments momentum optimizers with selective larger steps on a fixed proportion of parameters per iteration, claiming convergence guarantees for MGUP-AdamW and superior empirical performance on pretraining and fine-tuning.