Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
Pith reviewed 2026-05-22 08:02 UTC · model grok-4.3
The pith
SGD in flat directions produces growing variance and diffusion proportional to the learning rate instead of reaching a stationary distribution like Brownian motion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting directly from the discrete SGD update, we derive a master equation for the parameter distribution and obtain a discrete Fokker-Planck equation that differs from the standard Langevin form at order eta squared. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate.
What carries the argument
Master equation for the parameter distribution obtained directly from the discrete SGD step in a minibatch-induced fluctuating loss landscape, which yields a discrete Fokker-Planck equation differing from Langevin dynamics at order eta squared.
If this is right
- Along eigenvectors with near-zero curvature the parameter variance increases linearly with time at a rate set by the learning rate.
- The dynamics split into confined motion in directions of negative or positive curvature and unbounded diffusion in nearly flat directions.
- Standard continuous-time Langevin simulations omit the eta-squared corrections that control the diffusive regime.
- Empirical runs on vision and language models exhibit a clear separation between confined and diffusive eigenmodes consistent with the derived equation.
Where Pith is reading between the lines
- The same discrete derivation could be applied to other first-order optimizers to obtain their own eta-squared corrections.
- If flat-direction diffusion scales with learning rate, then larger rates may systematically increase exploration along valleys even after the loss has largely flattened.
- The framework suggests testing whether the observed separation of modes persists when the minibatch size or the curvature of the loss is varied in a controlled quadratic setting.
Load-bearing premise
Minibatch sampling can be modeled as producing a fluctuating loss landscape whose statistics allow a master equation to be written for the parameter distribution and approximated by a discrete Fokker-Planck equation that deviates from the continuous Langevin equation at second order in the learning rate.
What would settle it
Run SGD on a quadratic loss possessing one exactly flat direction and measure whether the variance along that direction grows linearly with iteration count at a slope proportional to the learning rate or instead saturates to a finite stationary value.
Figures
read the original abstract
Stochastic Gradient Descent (SGD) is commonly modeled as a Langevin process, assuming that minibatch noise acts as Brownian motion. However, this approximation relies on a continuous-time limit and a sqrt(eta) noise scaling that does not match the discrete SGD update at finite learning rate. In this work, we propose an alternative formulation of SGD as deterministic dynamics in a fluctuating loss landscape induced by minibatch sampling. Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker--Planck equation that differs from the standard Langevin form at order eta^2. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate. We provide empirical evidence supporting these predictions on neural network models in computer vision and natural language processing, observing a clear qualitative separation between confined and diffusive modes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the standard Langevin approximation for SGD is inaccurate at finite learning rates because it relies on a continuous-time limit and √η noise scaling that mismatches the discrete update. Starting from the discrete SGD step, the authors derive a master equation for the parameter distribution under minibatch-induced fluctuating loss landscapes, yielding a discrete Fokker-Planck equation that differs from the Langevin form at O(η²). Near critical points, dynamics are decomposed along the eigenbasis of the mean Hessian; nearly-flat directions lack a stationary distribution, with variance growing linearly in time at a rate set by an effective diffusion coefficient proportional to the learning rate. Empirical support is shown on vision and language models, with observed separation between confined and diffusive modes.
Significance. If the derivation and eigenbasis decomposition are robust, the work supplies a concrete alternative to Brownian-motion models of SGD that makes falsifiable predictions about linear variance growth in flat directions. This could clarify mechanisms behind SGD's ability to traverse valleys and has potential implications for generalization and optimization theory. The direct derivation from discrete updates and the empirical tests on real models are strengths; however, the significance is tempered by the need to confirm that fluctuation-induced mode couplings do not alter the long-time behavior at the retained perturbative order.
major comments (2)
- [Derivation of discrete Fokker-Planck equation and analysis near critical points] The section deriving the discrete Fokker-Planck equation from the master equation averages the transition kernel over minibatch fluctuations but does not explicitly demonstrate that the eigenbasis of the mean Hessian remains invariant under the retained O(η²) terms. Instantaneous Hessian fluctuations can generate off-diagonal couplings or time-dependent eigenvalues at the same order, which would mix nominally flat modes with stiffer directions and potentially saturate variance growth; this assumption is load-bearing for the central claim of unbounded diffusion in near-zero eigenvalue directions.
- [Analysis near critical points] In the eigenbasis decomposition (the paragraph beginning 'we show that the behavior decomposes along the eigenbasis of the mean Hessian'), the paper retains η² corrections from the discrete update but does not bound the error arising from non-commuting fluctuation operators. A concrete test—e.g., computing the leading correction to the variance evolution equation when the instantaneous Hessian is expanded around the mean—would be required to confirm that the qualitative separation between confined and diffusive regimes survives.
minor comments (2)
- [Empirical evidence] The empirical section would benefit from explicit description of how variance is measured (e.g., which parameters or layers are tracked, number of independent runs, and how 'nearly-flat' directions are identified from the Hessian spectrum).
- [Preliminaries] Notation for the fluctuating loss landscape L(θ; ξ) and the transition kernel could be introduced with a single displayed equation early in the derivation to improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments raise important questions about the perturbative consistency of the eigenbasis decomposition and the robustness of the variance growth prediction. We address each point below and will revise the manuscript accordingly to strengthen the derivation.
read point-by-point responses
-
Referee: The section deriving the discrete Fokker-Planck equation from the master equation averages the transition kernel over minibatch fluctuations but does not explicitly demonstrate that the eigenbasis of the mean Hessian remains invariant under the retained O(η²) terms. Instantaneous Hessian fluctuations can generate off-diagonal couplings or time-dependent eigenvalues at the same order, which would mix nominally flat modes with stiffer directions and potentially saturate variance growth; this assumption is load-bearing for the central claim of unbounded diffusion in near-zero eigenvalue directions.
Authors: We agree that an explicit demonstration of eigenbasis invariance at the retained order would clarify the argument. The master equation is constructed by averaging the transition kernel over the zero-mean minibatch fluctuations, so the mean Hessian enters as the first moment. The O(η²) corrections to the discrete Fokker-Planck equation are obtained by expanding the update and retaining terms up to second order in the fluctuation moments; these corrections are then expressed in the eigenbasis of the mean Hessian. Off-diagonal contributions from instantaneous Hessian fluctuations average to zero at this order because the minibatch samples are drawn independently at each step. We will revise the derivation section to include a short expansion explicitly showing that non-commuting fluctuation operators contribute only at O(η³) to the variance evolution equation within the approximation kept in the paper. revision: yes
-
Referee: In the eigenbasis decomposition (the paragraph beginning 'we show that the behavior decomposes along the eigenbasis of the mean Hessian'), the paper retains η² corrections from the discrete update but does not bound the error arising from non-commuting fluctuation operators. A concrete test—e.g., computing the leading correction to the variance evolution equation when the instantaneous Hessian is expanded around the mean—would be required to confirm that the qualitative separation between confined and diffusive regimes survives.
Authors: We thank the referee for suggesting this concrete test. Expanding the instantaneous Hessian as H = H_mean + δH and inserting into the variance evolution, the cross terms involving δH average to zero upon taking the expectation over independent minibatches. The leading correction to the diffusion coefficient along near-zero eigenvalues remains proportional to η and does not introduce saturation at the perturbative order retained. We will add this explicit calculation, together with the resulting bound on the error, as a new subsection or appendix to confirm that the separation between confined (stiff) and diffusive (flat) regimes is preserved. revision: yes
Circularity Check
Derivation self-contained from discrete SGD update with no reduction to inputs
full rationale
The paper begins from the discrete SGD update rule and derives a master equation and discrete Fokker-Planck equation at order eta^2 without any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The subsequent decomposition along the eigenbasis of the mean Hessian follows directly from the derived equation as an analysis step rather than a circular premise. No quoted reduction equates any claimed result to its own inputs by construction, and the framework remains independent of its outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Minibatch sampling induces a fluctuating loss landscape from which a master equation can be derived directly from the discrete SGD update rule.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker–Planck equation that differs from the standard Langevin form at order η².
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
nearly-flat directions do not admit a stationary distribution: the variance grows over time
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems , volume=
Principles of risk minimization for learning theory , author=. Advances in neural information processing systems , volume=
- [2]
-
[3]
2018 IEEE/ACM 26th international symposium on quality of service (IWQoS) , pages=
Improved adam optimizer for deep neural networks , author=. 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS) , pages=. 2018 , organization=
work page 2018
-
[5]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Backpropagation and stochastic gradient descent method , author=. Neurocomputing , volume=. 1993 , publisher=
work page 1993
-
[7]
arXiv preprint arXiv:2306.06101 , year=
Prodigy: An expeditiously adaptive parameter-free learner , author=. arXiv preprint arXiv:2306.06101 , year=
-
[8]
SOAP: Improving and Stabilizing Shampoo using Adam
Soap: Improving and stabilizing shampoo using adam , author=. arXiv preprint arXiv:2409.11321 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2105.02470 , year=
Generalized multimodal ELBO , author=. arXiv preprint arXiv:2105.02470 , year=
-
[10]
Auto-Encoding Variational Bayes
Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [11]
-
[12]
Proceedings of 2nd Berkeley Symposium , pages=
Proceedings of 2nd berkeley symposium , author=. Proceedings of 2nd Berkeley Symposium , pages=
-
[13]
Advances in Neural Information Processing Systems , volume=
On the convergence of single-call stochastic extra-gradient methods , author=. Advances in Neural Information Processing Systems , volume=
-
[14]
Comptes Rendus Hebdomadaires Des Seances De L Academie Des Sciences , volume=
Formes bilineaires coercitives sur les ensembles convexes , author=. Comptes Rendus Hebdomadaires Des Seances De L Academie Des Sciences , volume=. 1964 , publisher=
work page 1964
-
[15]
Some problems and results in fixed point theory , author=. Contemp. Math , volume=
-
[16]
Journal of Scientific Computing , volume=
Inertial-type algorithm for solving split common fixed point problems in Banach spaces , author=. Journal of Scientific Computing , volume=. 2021 , publisher=
work page 2021
- [17]
-
[18]
Proceedings of the National Academy of Sciences , volume=
Existence and approximation of solutions of nonlinear variational inequalities , author=. Proceedings of the National Academy of Sciences , volume=. 1966 , publisher=
work page 1966
-
[19]
Theory and applications of monotone operators , pages=
Convex functions, monotone operators and variational inequalities , author=. Theory and applications of monotone operators , pages=. 1969 , organization=
work page 1969
-
[20]
International Journal of Information Management Data Insights , volume=
Generative adversarial network: An overview of theory and applications , author=. International Journal of Information Management Data Insights , volume=. 2021 , publisher=
work page 2021
-
[21]
International conference on machine learning , pages=
Wasserstein generative adversarial networks , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[22]
Communications of the ACM , volume=
Generative adversarial networks , author=. Communications of the ACM , volume=. 2020 , publisher=
work page 2020
-
[23]
International Conference on Machine Learning , pages=
Deep decentralized multi-task multi-agent reinforcement learning under partial observability , author=. International Conference on Machine Learning , pages=. 2017 , organization=
work page 2017
-
[24]
Towards deep learning models resistant to adversarial attacks , author=. stat , volume=
-
[25]
Princeton University Press google schola , volume=
Robust Optimization , author=. Princeton University Press google schola , volume=
-
[26]
International Conference on Machine Learning , pages=
Efficiently solving MDPs with stochastic mirror descent , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[27]
Mathematical programming , volume=
Smooth minimization of non-smooth functions , author=. Mathematical programming , volume=. 2005 , publisher=
work page 2005
-
[28]
Convex Sparse Matrix Factorizations
Convex sparse matrix factorizations , author=. arXiv preprint arXiv:0812.1869 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Journal of mathematical imaging and vision , volume=
A first-order primal-dual algorithm for convex problems with applications to imaging , author=. Journal of mathematical imaging and vision , volume=. 2011 , publisher=
work page 2011
-
[30]
SIAM Journal on Imaging Sciences , volume=
A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science , author=. SIAM Journal on Imaging Sciences , volume=. 2010 , publisher=
work page 2010
-
[31]
Proceedings of the 22nd international conference on Machine learning , pages=
A support vector method for multivariate performance measures , author=. Proceedings of the 22nd international conference on Machine learning , pages=
-
[32]
The extragradient method for finding saddle points and other problems , author=. Matecon , volume=
-
[33]
Journal of Computational and Applied Mathematics , volume=
On linear convergence of iterative methods for the variational inequality problem , author=. Journal of Computational and Applied Mathematics , volume=. 1995 , publisher=
work page 1995
-
[34]
USSR Computational Mathematics and Mathematical Physics , volume=
Modification of the extra-gradient method for solving variational inequalities and certain optimization problems , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1987 , publisher=
work page 1987
-
[35]
International Conference on Artificial Intelligence and Statistics , pages=
A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=
work page 2020
-
[36]
G. M. Korpelevich , title =. Ekonomika Mat. Metody , year =
-
[37]
Sibony, Mo. M. Calcolo , volume=. 1970 , publisher=
work page 1970
-
[38]
Proceedings of the IEEE , volume=
Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=
work page 1998
-
[39]
Advances in neural information processing systems , volume=
Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
-
[40]
Advances in neural information processing systems , volume=
Improved techniques for training gans , author=. Advances in neural information processing systems , volume=
-
[41]
Advances in neural information processing systems , volume=
Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=
-
[42]
A modification of the Arrow-Hurwitz method of search for saddle points , author=. Mat. Zametki , volume=
-
[43]
arXiv preprint arXiv:1802.10551 , year=
A variational inequality perspective on generative adversarial networks , author=. arXiv preprint arXiv:1802.10551 , year=
-
[44]
Solving variational inequalities with stochastic mirror-prox algorithm , author=. Stochastic Systems , volume=. 2011 , publisher=
work page 2011
-
[45]
arXiv preprint arXiv:2010.13112 , year=
Distributed saddle-point problems: Lower bounds, near-optimal and robust algorithms , author=. arXiv preprint arXiv:2010.13112 , year=
-
[46]
Mathematical Programming , volume=
On lower iteration complexity bounds for the convex concave saddle point problems , author=. Mathematical Programming , volume=. 2022 , publisher=
work page 2022
-
[47]
Computational Mathematics and Mathematical Physics , volume=
A unified analysis of variational inequality methods: Variance reduction, sampling, quantization, and coordinate descent , author=. Computational Mathematics and Mathematical Physics , volume=. 2023 , publisher=
work page 2023
-
[48]
SIAM Journal on Optimization , volume=
Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation , author=. SIAM Journal on Optimization , volume=. 2022 , publisher=
work page 2022
-
[49]
Advances in Neural Information Processing Systems , volume=
Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling , author=. Advances in Neural Information Processing Systems , volume=
-
[50]
International Conference on Artificial Intelligence and Statistics , pages=
Stochastic extragradient: General analysis and improved rates , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
work page 2022
-
[51]
International Conference on Artificial Intelligence and Statistics , pages=
Revisiting stochastic extragradient , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=
work page 2020
-
[52]
Wiley statsRef: Statistics reference online , pages=
Variance reduction , author=. Wiley statsRef: Statistics reference online , pages=. 2017 , publisher=
work page 2017
-
[53]
Dynamics-aware loss for learning with label noise , author=. Pattern Recognition , volume=. 2023 , publisher=
work page 2023
-
[54]
arXiv preprint arXiv:2111.05428 , year=
Constrained instance and class reweighting for robust learning under label noise , author=. arXiv preprint arXiv:2111.05428 , year=
-
[55]
arXiv preprint arXiv:2211.02556 , year=
Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast , author=. arXiv preprint arXiv:2211.02556 , year=
-
[56]
IEEE Transactions on knowledge and data engineering , volume=
Learning from imbalanced data , author=. IEEE Transactions on knowledge and data engineering , volume=. 2009 , publisher=
work page 2009
-
[57]
Focal Loss for Dense Object Detection
Focal Loss for Dense Object Detection , author=. arXiv preprint arXiv:1708.02002 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Libra r-cnn: Towards balanced learning for object detection , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[59]
Advances in neural information processing systems , volume=
What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , volume=
-
[60]
International conference on machine learning , pages=
Learning to reweight examples for robust deep learning , author=. International conference on machine learning , pages=. 2018 , organization=
work page 2018
-
[61]
Journal of the American statistical association , volume=
The monte carlo method , author=. Journal of the American statistical association , volume=. 1949 , publisher=
work page 1949
-
[62]
Elements of survey sampling , pages=
Stratified sampling , author=. Elements of survey sampling , pages=. 1996 , publisher=
work page 1996
-
[63]
Statistics in Medicine , volume=
On variance estimation of the inverse probability-of-treatment weighting estimator: A tutorial for different types of propensity score weights , author=. Statistics in Medicine , volume=. 2024 , publisher=
work page 2024
-
[64]
U-net: Convolutional networks for biomedical image segmentation , author=. Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 , pages=. 2015 , organization=
work page 2015
-
[65]
IEEE transactions on neural networks and learning systems , year=
Deep neural networks and tabular data: A survey , author=. IEEE transactions on neural networks and learning systems , year=
-
[66]
Advances in Neural Information Processing Systems , volume=
On embeddings for numerical features in tabular deep learning , author=. Advances in Neural Information Processing Systems , volume=
-
[67]
The Twelfth International Conference on Learning Representations , year=
TabR: Tabular Deep Learning Meets Nearest Neighbors , author=. The Twelfth International Conference on Learning Representations , year=
-
[68]
arXiv preprint arXiv:2406.19380 , year=
TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks , author=. arXiv preprint arXiv:2406.19380 , year=
-
[69]
arXiv preprint arXiv:2410.24210 , year=
TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling , author=. arXiv preprint arXiv:2410.24210 , year=
-
[70]
the Journal of machine Learning research , volume=
Scikit-learn: Machine learning in Python , author=. the Journal of machine Learning research , volume=. 2011 , publisher=
work page 2011
-
[71]
Optuna: A next-generation hyperparameter optimization framework , author=. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , pages=
-
[72]
arXiv preprint arXiv:2411.14601 , year=
On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms , author=. arXiv preprint arXiv:2411.14601 , year=
-
[73]
arXiv preprint arXiv:2307.12946 , year=
Optimal algorithm with complexity separation for strongly convex-strongly concave composite saddle point problems , author=. arXiv preprint arXiv:2307.12946 , year=
-
[74]
arXiv preprint arXiv:2103.09344 , year=
On accelerated methods for saddle-point problems with composite structure , author=. arXiv preprint arXiv:2103.09344 , year=
-
[75]
Chaos, Solitons & Fractals , volume=
New aspects of black box conditional gradient: Variance reduction and one point feedback , author=. Chaos, Solitons & Fractals , volume=. 2024 , publisher=
work page 2024
-
[76]
arXiv preprint arXiv:2408.01848 , year=
Methods for Optimization Problems with Markovian Stochasticity and Non-Euclidean Geometry , author=. arXiv preprint arXiv:2408.01848 , year=
-
[77]
Proceedings of the IEEE international conference on computer vision , pages=
Class rectification hard mining for imbalanced deep learning , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[78]
International conference on machine learning , pages=
Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels , author=. International conference on machine learning , pages=. 2018 , organization=
work page 2018
-
[79]
Journal of the Operations Research Society of America , volume=
Methods of reducing sample size in Monte Carlo computations , author=. Journal of the Operations Research Society of America , volume=. 1953 , publisher=
work page 1953
-
[80]
Journal of computer and system sciences , volume=
A decision-theoretic generalization of on-line learning and an application to boosting , author=. Journal of computer and system sciences , volume=. 1997 , publisher=
work page 1997
-
[81]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.