Stochastic Gradient Descent as Approximate Bayesian Inference

David M. Blei; Matthew D. Hoffman; Stephan Mandt

arxiv: 1704.04289 · v2 · pith:UTTGU3KInew · submitted 2017-04-13 · 📊 stat.ML · cs.LG

Stochastic Gradient Descent as Approximate Bayesian Inference

Stephan Mandt , Matthew D. Hoffman , David M. Blei This is my paper

classification 📊 stat.ML cs.LG

keywords constantstochasticgradientalgorithmapproximateadjustbayesiandescent

0 comments

read the original abstract

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contribution of task-irrelevant stimuli to drift of neural representations
q-bio.NC 2025-10 unverdicted novelty 6.0

Task-irrelevant stimuli create long-term representational drift in task-relevant features, with drift rate increasing with variance and dimension of the irrelevant subspace, across Hebbian and gradient-based learning.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks
stat.ML 2025-11 unverdicted novelty 5.0

At the critical step-size scaling for SGD in high-dimensional single-layer networks, effective dynamics gain a diffusive correction term that changes the phase diagram and reduces to an Ornstein-Uhlenbeck process near...