Stochastic Gradient Descent as Approximate Bayesian Inference

Stephan Mandt, Matthew D · 2017 · stat.ML · arXiv 1704.04289

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.

representative citing papers

Balancing structure and randomness: maximum entropy networks for context-dependent computations

q-bio.NC · 2026-05-25 · unverdicted · novelty 7.0

Maximum entropy inference on weight distributions under context-dependent task constraints produces neuron populations with contextual gain modulation whose connectivity matches gradient-descent trained networks, with transitions to random structure as context count or weight scale increases.

Spectral phase transitions and trainability in neural network learning dynamics

cond-mat.dis-nn · 2026-06-26 · unverdicted · novelty 6.0

SGD on neural network weights induces a BBP phase transition that detaches signal eigenvalues from the random bulk, yielding an analytically solvable phase diagram for trainability in a linear teacher-student model.

Contribution of task-irrelevant stimuli to drift of neural representations

q-bio.NC · 2025-10-24 · unverdicted · novelty 6.0

Task-irrelevant stimuli create long-term representational drift in task-relevant features, with drift rate increasing with variance and dimension of the irrelevant subspace, across Hebbian and gradient-based learning.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

q-bio.NC · 2026-06-01 · unverdicted · novelty 5.0

Derives optimality constraints for nonnegative joint dictionary learning that explain observed SAE behaviors such as feature splitting, absorption, and dense antipodal features.

Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

stat.ML · 2025-11-04 · unverdicted · novelty 5.0

At the critical step-size scaling for SGD in high-dimensional single-layer networks, effective dynamics gain a diffusive correction term that changes the phase diagram and reduces to an Ornstein-Uhlenbeck process near fixed points, with the information exponent governing sample complexity.

Statistical Properties of Training & Generalization

stat.ML · 2026-06-18 · unverdicted · novelty 1.0 · 2 refs

Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Balancing structure and randomness: maximum entropy networks for context-dependent computations q-bio.NC · 2026-05-25 · unverdicted · none · ref 64 · internal anchor
Maximum entropy inference on weight distributions under context-dependent task constraints produces neuron populations with contextual gain modulation whose connectivity matches gradient-descent trained networks, with transitions to random structure as context count or weight scale increases.
Spectral phase transitions and trainability in neural network learning dynamics cond-mat.dis-nn · 2026-06-26 · unverdicted · none · ref 58 · internal anchor
SGD on neural network weights induces a BBP phase transition that detaches signal eigenvalues from the random bulk, yielding an analytically solvable phase diagram for trainability in a linear teacher-student model.
How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations q-bio.NC · 2026-06-01 · unverdicted · none · ref 130 · internal anchor
Derives optimality constraints for nonnegative joint dictionary learning that explain observed SAE behaviors such as feature splitting, absorption, and dense antipodal features.
Statistical Properties of Training & Generalization stat.ML · 2026-06-18 · unverdicted · none · ref 170 · 2 links · internal anchor
Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.

Stochastic Gradient Descent as Approximate Bayesian Inference

fields

years

verdicts

representative citing papers

citing papers explorer