pith. sign in

Stochastic Gradient Descent as Approximate Bayesian Inference

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it
abstract

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.

representative citing papers

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

citing papers explorer

Showing 5 of 5 citing papers.

  • Contribution of task-irrelevant stimuli to drift of neural representations q-bio.NC · 2025-10-24 · unverdicted · none · ref 28 · internal anchor

    Task-irrelevant stimuli create long-term representational drift in task-relevant features, with drift rate increasing with variance and dimension of the irrelevant subspace, across Hebbian and gradient-based learning.

  • Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 150 · internal anchor

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

  • Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 270

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  • A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 192

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  • Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks stat.ML · 2025-11-04 · unverdicted · none · ref 21 · internal anchor

    At the critical step-size scaling for SGD in high-dimensional single-layer networks, effective dynamics gain a diffusive correction term that changes the phase diagram and reduces to an Ornstein-Uhlenbeck process near fixed points, with the information exponent governing sample complexity.