SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
Less regret via online conditioning.arXiv preprint arXiv:1002.4862
3 Pith papers cite this work. Polarity classification is still indexing.
abstract
We analyze and evaluate an online gradient descent algorithm with adaptive per-coordinate adjustment of learning rates. Our algorithm can be thought of as an online version of batch gradient descent with a diagonal preconditioner. This approach leads to regret bounds that are stronger than those of standard online gradient descent for general online convex optimization problems. Experimentally, we show that our algorithm is competitive with state-of-the-art algorithms for large scale machine learning problems.
years
2026 3representative citing papers
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
citing papers explorer
-
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
- Stochastic Non-Smooth Convex Optimization with Unbounded Gradients