On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Abhishek Panigrahi; Kaifeng Lyu; Sadhika Malladi; Sanjeev Arora

arxiv: 2205.10287 · v3 · pith:QBJWHBWPnew · submitted 2022-05-20 · 💻 cs.LG

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Sadhika Malladi , Kaifeng Lyu , Abhishek Panigrahi , Sanjeev Arora This is my paper

classification 💻 cs.LG

keywords adamgradientrmspropadaptiveapproximationsmethodsoptimizationscaling

0 comments

read the original abstract

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate
math.OC 2026-04 unverdicted novelty 8.0

Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.
Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise
cs.LG 2026-06 unverdicted novelty 7.0

Fixed-clock optimizer memory turns equal-multiset data shuffle order into an O(η) source of fine-tuning noise, larger than the O(η²) effect in memoryless cases, with a fit-free sizing method derived.
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
cs.LG 2026-06 unverdicted novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.