Stochastic Gradient Descent as Approximate Bayesian Inference
read the original abstract
Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
Contribution of task-irrelevant stimuli to drift of neural representations
Task-irrelevant stimuli create long-term representational drift in task-relevant features, with drift rate increasing with variance and dimension of the irrelevant subspace, across Hebbian and gradient-based learning.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Scaling Laws for Transfer
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
-
Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks
At the critical step-size scaling for SGD in high-dimensional single-layer networks, effective dynamics gain a diffusive correction term that changes the phase diagram and reduces to an Ornstein-Uhlenbeck process near...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.