Maximum entropy inference on weight distributions under context-dependent task constraints produces neuron populations with contextual gain modulation whose connectivity matches gradient-descent trained networks, with transitions to random structure as context count or weight scale increases.
Stochastic Gradient Descent as Approximate Bayesian Inference
9 Pith papers cite this work. Polarity classification is still indexing.
abstract
Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.
representative citing papers
SGD on neural network weights induces a BBP phase transition that detaches signal eigenvalues from the random bulk, yielding an analytically solvable phase diagram for trainability in a linear teacher-student model.
Task-irrelevant stimuli create long-term representational drift in task-relevant features, with drift rate increasing with variance and dimension of the irrelevant subspace, across Hebbian and gradient-based learning.
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Derives optimality constraints for nonnegative joint dictionary learning that explain observed SAE behaviors such as feature splitting, absorption, and dense antipodal features.
At the critical step-size scaling for SGD in high-dimensional single-layer networks, effective dynamics gain a diffusive correction term that changes the phase diagram and reduces to an Ornstein-Uhlenbeck process near fixed points, with the information exponent governing sample complexity.
Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.
citing papers explorer
-
Balancing structure and randomness: maximum entropy networks for context-dependent computations
Maximum entropy inference on weight distributions under context-dependent task constraints produces neuron populations with contextual gain modulation whose connectivity matches gradient-descent trained networks, with transitions to random structure as context count or weight scale increases.
-
Spectral phase transitions and trainability in neural network learning dynamics
SGD on neural network weights induces a BBP phase transition that detaches signal eigenvalues from the random bulk, yielding an analytically solvable phase diagram for trainability in a linear teacher-student model.
-
How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations
Derives optimality constraints for nonnegative joint dictionary learning that explain observed SAE behaviors such as feature splitting, absorption, and dense antipodal features.
-
Statistical Properties of Training & Generalization
Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.