Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics

Aleksandr Beznosikov; Aleksandr Kovalenko; Aleksandr Shestakov; Andrew Semenov; Andrey Leonidov; Andrey Veprikov; Anna Radovskaya; Egor Lopatin; Igor Ignashin; Stanislav Potapov

arxiv: 2605.22644 · v1 · pith:P64LSTQDnew · submitted 2026-05-21 · 💻 cs.LG

Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics

Igor Ignashin , Anna Radovskaya , Andrew Semenov , Egor Lopatin , Stanislav Potapov , Aleksandr Kovalenko , Andrey Veprikov , Aleksandr Shestakov

show 2 more authors

Andrey Leonidov Aleksandr Beznosikov

This is my paper

Pith reviewed 2026-05-22 08:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords SGDFokker-Planck equationflat directionsstochastic dynamicsdiffusionlearning rateneural network optimizationHessian eigenmodes

0 comments

The pith

SGD in flat directions produces growing variance and diffusion proportional to the learning rate instead of reaching a stationary distribution like Brownian motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the common modeling of SGD as a continuous Langevin process equivalent to Brownian motion. It instead treats SGD as deterministic motion inside a loss surface that fluctuates because of minibatch sampling. Starting from the exact discrete parameter update, the authors derive a master equation whose continuum limit yields a Fokker-Planck equation that agrees with the usual Langevin form only through order eta and deviates at order eta squared. This difference produces qualitatively new behavior near critical points: when the mean Hessian is nearly zero along some eigenvector, the variance of the parameters grows without bound rather than saturating. The result matters for neural-network training because it indicates that common continuous approximations miss the mechanism by which SGD can keep diffusing along valleys at a rate set directly by the learning rate.

Core claim

Starting directly from the discrete SGD update, we derive a master equation for the parameter distribution and obtain a discrete Fokker-Planck equation that differs from the standard Langevin form at order eta squared. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate.

What carries the argument

Master equation for the parameter distribution obtained directly from the discrete SGD step in a minibatch-induced fluctuating loss landscape, which yields a discrete Fokker-Planck equation differing from Langevin dynamics at order eta squared.

If this is right

Along eigenvectors with near-zero curvature the parameter variance increases linearly with time at a rate set by the learning rate.
The dynamics split into confined motion in directions of negative or positive curvature and unbounded diffusion in nearly flat directions.
Standard continuous-time Langevin simulations omit the eta-squared corrections that control the diffusive regime.
Empirical runs on vision and language models exhibit a clear separation between confined and diffusive eigenmodes consistent with the derived equation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same discrete derivation could be applied to other first-order optimizers to obtain their own eta-squared corrections.
If flat-direction diffusion scales with learning rate, then larger rates may systematically increase exploration along valleys even after the loss has largely flattened.
The framework suggests testing whether the observed separation of modes persists when the minibatch size or the curvature of the loss is varied in a controlled quadratic setting.

Load-bearing premise

Minibatch sampling can be modeled as producing a fluctuating loss landscape whose statistics allow a master equation to be written for the parameter distribution and approximated by a discrete Fokker-Planck equation that deviates from the continuous Langevin equation at second order in the learning rate.

What would settle it

Run SGD on a quadratic loss possessing one exactly flat direction and measure whether the variance along that direction grows linearly with iteration count at a slope proportional to the learning rate or instead saturates to a finite stationary value.

Figures

Figures reproduced from arXiv: 2605.22644 by Aleksandr Beznosikov, Aleksandr Kovalenko, Aleksandr Shestakov, Andrew Semenov, Andrey Leonidov, Andrey Veprikov, Anna Radovskaya, Egor Lopatin, Igor Ignashin, Stanislav Potapov.

**Figure 2.** Figure 2: compares the theoretical diagonal covariance profile from Eq. (19) with the empirical covariance in the mean-Hessian eigenbasis. Since the overall multiplicative constant γ is not fixed by the theory, the comparison is structural rather than absolute. We observe good agreement, including the predicted separation between saturating directions and directions that remain non-stationary over the observed ti… view at source ↗

**Figure 3.** Figure 3: Discrete SGD vs. Langevin approximation: saturation level (NanoGPT 6.6M). Empirical plateau Πˆ∞ i vs. Hessian eigenvalue λi for the top-20 sharp directions and three learning rates. Circles: empirical means; crosses: discrete prediction (Eq. 19); diamonds: Langevin prediction (Eq. (15)). The Langevin approximation increasingly underestimates the plateau at larger η, while the discrete prediction remains c… view at source ↗

**Figure 4.** Figure 4: MLP experiment. Left: eigenvalue spectrum of the mean Hessian. Right: variance matrix of SGD [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: MLP experiment. Left: theoretical prediction for the diagonal elements of the variance matrix from [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: NanoGPT language model. Left: eigenvalue spectrum of the mean Hessian. Right: variance matrix of [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: MLP experiment: trajectories in distinct eigendirections of the mean Hessian. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Shakespeare model: trajectories in distinct eigendirections of the mean Hessian. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: NanoGPT 6.6M on WikiText-2. Empirical variance [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Mean saturation level Πˆ∞ (averaged over 20 sharp directions) as a function of learning rate η. Left: linear axes. Right: log–log axes with the theoretical slope-1 line. The single γˆ estimated from η = 0.001 predicts the other two points without refitting (R2 = 0.90) [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Quantitative validation of the discrete theory on MLP-386/MNIST. Predicted and measured variances [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Quantitative validation of the discrete theory on NanoGPT/Shakespeare. The same covariance predic [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Effect of sampling strategy on SGD dynamics. Both ensembles start from the same reference point and [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

read the original abstract

Stochastic Gradient Descent (SGD) is commonly modeled as a Langevin process, assuming that minibatch noise acts as Brownian motion. However, this approximation relies on a continuous-time limit and a sqrt(eta) noise scaling that does not match the discrete SGD update at finite learning rate. In this work, we propose an alternative formulation of SGD as deterministic dynamics in a fluctuating loss landscape induced by minibatch sampling. Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker--Planck equation that differs from the standard Langevin form at order eta^2. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate. We provide empirical evidence supporting these predictions on neural network models in computer vision and natural language processing, observing a clear qualitative separation between confined and diffusive modes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SGD flat directions show growing variance from discrete eta^2 effects, with experiments backing the confined vs diffusive split.

read the letter

Your colleague should know that this paper derives a Fokker-Planck equation for SGD that includes corrections at order eta squared from the discrete updates, and uses it to predict that nearly flat directions will show growing variance over time rather than a stationary distribution. They start from the discrete parameter update with minibatch noise and build a master equation for the distribution. This leads to a continuous description that differs from the standard Langevin dynamics used in much of the literature. The analysis then projects onto the eigenbasis of the mean Hessian, separating stiff directions that stay confined from flat ones where diffusion dominates with strength proportional to the learning rate. What stands out is the empirical check on actual neural nets for vision and language tasks. They observe modes that remain localized versus ones that spread, which matches the predicted split. The potential issue is with the Hessian fluctuations. Since each minibatch sees a slightly different curvature, those variations could introduce mixing between modes at the perturbative order kept in the equation. That might prevent the clean separation and cap the variance growth in what look like flat directions. The paper averages the kernel, but without seeing how they justify keeping the eigenbasis fixed, it's hard to know if this is fully controlled. This kind of work is for people doing theoretical analysis of optimizers who are okay moving past the Brownian motion shortcut. It gives a way to think about why SGD might explore valleys differently than expected. I would recommend sending it for peer review. The idea is solid enough and the experiments give something concrete to evaluate, even if the fluctuation analysis needs more detail to close the loop.

Referee Report

2 major / 2 minor

Summary. The paper claims that the standard Langevin approximation for SGD is inaccurate at finite learning rates because it relies on a continuous-time limit and √η noise scaling that mismatches the discrete update. Starting from the discrete SGD step, the authors derive a master equation for the parameter distribution under minibatch-induced fluctuating loss landscapes, yielding a discrete Fokker-Planck equation that differs from the Langevin form at O(η²). Near critical points, dynamics are decomposed along the eigenbasis of the mean Hessian; nearly-flat directions lack a stationary distribution, with variance growing linearly in time at a rate set by an effective diffusion coefficient proportional to the learning rate. Empirical support is shown on vision and language models, with observed separation between confined and diffusive modes.

Significance. If the derivation and eigenbasis decomposition are robust, the work supplies a concrete alternative to Brownian-motion models of SGD that makes falsifiable predictions about linear variance growth in flat directions. This could clarify mechanisms behind SGD's ability to traverse valleys and has potential implications for generalization and optimization theory. The direct derivation from discrete updates and the empirical tests on real models are strengths; however, the significance is tempered by the need to confirm that fluctuation-induced mode couplings do not alter the long-time behavior at the retained perturbative order.

major comments (2)

[Derivation of discrete Fokker-Planck equation and analysis near critical points] The section deriving the discrete Fokker-Planck equation from the master equation averages the transition kernel over minibatch fluctuations but does not explicitly demonstrate that the eigenbasis of the mean Hessian remains invariant under the retained O(η²) terms. Instantaneous Hessian fluctuations can generate off-diagonal couplings or time-dependent eigenvalues at the same order, which would mix nominally flat modes with stiffer directions and potentially saturate variance growth; this assumption is load-bearing for the central claim of unbounded diffusion in near-zero eigenvalue directions.
[Analysis near critical points] In the eigenbasis decomposition (the paragraph beginning 'we show that the behavior decomposes along the eigenbasis of the mean Hessian'), the paper retains η² corrections from the discrete update but does not bound the error arising from non-commuting fluctuation operators. A concrete test—e.g., computing the leading correction to the variance evolution equation when the instantaneous Hessian is expanded around the mean—would be required to confirm that the qualitative separation between confined and diffusive regimes survives.

minor comments (2)

[Empirical evidence] The empirical section would benefit from explicit description of how variance is measured (e.g., which parameters or layers are tracked, number of independent runs, and how 'nearly-flat' directions are identified from the Hessian spectrum).
[Preliminaries] Notation for the fluctuating loss landscape L(θ; ξ) and the transition kernel could be introduced with a single displayed equation early in the derivation to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments raise important questions about the perturbative consistency of the eigenbasis decomposition and the robustness of the variance growth prediction. We address each point below and will revise the manuscript accordingly to strengthen the derivation.

read point-by-point responses

Referee: The section deriving the discrete Fokker-Planck equation from the master equation averages the transition kernel over minibatch fluctuations but does not explicitly demonstrate that the eigenbasis of the mean Hessian remains invariant under the retained O(η²) terms. Instantaneous Hessian fluctuations can generate off-diagonal couplings or time-dependent eigenvalues at the same order, which would mix nominally flat modes with stiffer directions and potentially saturate variance growth; this assumption is load-bearing for the central claim of unbounded diffusion in near-zero eigenvalue directions.

Authors: We agree that an explicit demonstration of eigenbasis invariance at the retained order would clarify the argument. The master equation is constructed by averaging the transition kernel over the zero-mean minibatch fluctuations, so the mean Hessian enters as the first moment. The O(η²) corrections to the discrete Fokker-Planck equation are obtained by expanding the update and retaining terms up to second order in the fluctuation moments; these corrections are then expressed in the eigenbasis of the mean Hessian. Off-diagonal contributions from instantaneous Hessian fluctuations average to zero at this order because the minibatch samples are drawn independently at each step. We will revise the derivation section to include a short expansion explicitly showing that non-commuting fluctuation operators contribute only at O(η³) to the variance evolution equation within the approximation kept in the paper. revision: yes
Referee: In the eigenbasis decomposition (the paragraph beginning 'we show that the behavior decomposes along the eigenbasis of the mean Hessian'), the paper retains η² corrections from the discrete update but does not bound the error arising from non-commuting fluctuation operators. A concrete test—e.g., computing the leading correction to the variance evolution equation when the instantaneous Hessian is expanded around the mean—would be required to confirm that the qualitative separation between confined and diffusive regimes survives.

Authors: We thank the referee for suggesting this concrete test. Expanding the instantaneous Hessian as H = H_mean + δH and inserting into the variance evolution, the cross terms involving δH average to zero upon taking the expectation over independent minibatches. The leading correction to the diffusion coefficient along near-zero eigenvalues remains proportional to η and does not introduce saturation at the perturbative order retained. We will add this explicit calculation, together with the resulting bound on the error, as a new subsection or appendix to confirm that the separation between confined (stiff) and diffusive (flat) regimes is preserved. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained from discrete SGD update with no reduction to inputs

full rationale

The paper begins from the discrete SGD update rule and derives a master equation and discrete Fokker-Planck equation at order eta^2 without any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The subsequent decomposition along the eigenbasis of the mean Hessian follows directly from the derived equation as an analysis step rather than a circular premise. No quoted reduction equates any claimed result to its own inputs by construction, and the framework remains independent of its outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that minibatch sampling produces a fluctuating landscape permitting a master equation derivation from the discrete update, plus the decomposition of dynamics along the mean Hessian eigenbasis; no free parameters or invented entities are indicated in the abstract.

axioms (1)

domain assumption Minibatch sampling induces a fluctuating loss landscape from which a master equation can be derived directly from the discrete SGD update rule.
This premise enables the alternative formulation and the subsequent discrete Fokker-Planck equation as stated in the abstract.

pith-pipeline@v0.9.0 · 5770 in / 1429 out tokens · 73786 ms · 2026-05-22T08:02:25.076447+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker–Planck equation that differs from the standard Langevin form at order η².
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

nearly-flat directions do not admit a stationary distribution: the variance grows over time

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

233 extracted references · 233 canonical work pages · 18 internal anchors

[1]

Advances in neural information processing systems , volume=

Principles of risk minimization for learning theory , author=. Advances in neural information processing systems , volume=

work page
[2]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[3]

2018 IEEE/ACM 26th international symposium on quality of service (IWQoS) , pages=

Improved adam optimizer for deep neural networks , author=. 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS) , pages=. 2018 , organization=

work page 2018
[5]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Neurocomputing , volume=

Backpropagation and stochastic gradient descent method , author=. Neurocomputing , volume=. 1993 , publisher=

work page 1993
[7]

arXiv preprint arXiv:2306.06101 , year=

Prodigy: An expeditiously adaptive parameter-free learner , author=. arXiv preprint arXiv:2306.06101 , year=

work page arXiv
[8]

SOAP: Improving and Stabilizing Shampoo using Adam

Soap: Improving and stabilizing shampoo using adam , author=. arXiv preprint arXiv:2409.11321 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2105.02470 , year=

Generalized multimodal ELBO , author=. arXiv preprint arXiv:2105.02470 , year=

work page arXiv
[10]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

, author=

On general minimax theorems. , author=

work page
[12]

Proceedings of 2nd Berkeley Symposium , pages=

Proceedings of 2nd berkeley symposium , author=. Proceedings of 2nd Berkeley Symposium , pages=

work page
[13]

Advances in Neural Information Processing Systems , volume=

On the convergence of single-call stochastic extra-gradient methods , author=. Advances in Neural Information Processing Systems , volume=

work page
[14]

Comptes Rendus Hebdomadaires Des Seances De L Academie Des Sciences , volume=

Formes bilineaires coercitives sur les ensembles convexes , author=. Comptes Rendus Hebdomadaires Des Seances De L Academie Des Sciences , volume=. 1964 , publisher=

work page 1964
[15]

Some problems and results in fixed point theory , author=. Contemp. Math , volume=

work page
[16]

Journal of Scientific Computing , volume=

Inertial-type algorithm for solving split common fixed point problems in Banach spaces , author=. Journal of Scientific Computing , volume=. 2021 , publisher=

work page 2021
[17]

1984 , publisher=

Linear and nonlinear programming , author=. 1984 , publisher=

work page 1984
[18]

Proceedings of the National Academy of Sciences , volume=

Existence and approximation of solutions of nonlinear variational inequalities , author=. Proceedings of the National Academy of Sciences , volume=. 1966 , publisher=

work page 1966
[19]

Theory and applications of monotone operators , pages=

Convex functions, monotone operators and variational inequalities , author=. Theory and applications of monotone operators , pages=. 1969 , organization=

work page 1969
[20]

International Journal of Information Management Data Insights , volume=

Generative adversarial network: An overview of theory and applications , author=. International Journal of Information Management Data Insights , volume=. 2021 , publisher=

work page 2021
[21]

International conference on machine learning , pages=

Wasserstein generative adversarial networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[22]

Communications of the ACM , volume=

Generative adversarial networks , author=. Communications of the ACM , volume=. 2020 , publisher=

work page 2020
[23]

International Conference on Machine Learning , pages=

Deep decentralized multi-task multi-agent reinforcement learning under partial observability , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[24]

stat , volume=

Towards deep learning models resistant to adversarial attacks , author=. stat , volume=

work page
[25]

Princeton University Press google schola , volume=

Robust Optimization , author=. Princeton University Press google schola , volume=

work page
[26]

International Conference on Machine Learning , pages=

Efficiently solving MDPs with stochastic mirror descent , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[27]

Mathematical programming , volume=

Smooth minimization of non-smooth functions , author=. Mathematical programming , volume=. 2005 , publisher=

work page 2005
[28]

Convex Sparse Matrix Factorizations

Convex sparse matrix factorizations , author=. arXiv preprint arXiv:0812.1869 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Journal of mathematical imaging and vision , volume=

A first-order primal-dual algorithm for convex problems with applications to imaging , author=. Journal of mathematical imaging and vision , volume=. 2011 , publisher=

work page 2011
[30]

SIAM Journal on Imaging Sciences , volume=

A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science , author=. SIAM Journal on Imaging Sciences , volume=. 2010 , publisher=

work page 2010
[31]

Proceedings of the 22nd international conference on Machine learning , pages=

A support vector method for multivariate performance measures , author=. Proceedings of the 22nd international conference on Machine learning , pages=

work page
[32]

Matecon , volume=

The extragradient method for finding saddle points and other problems , author=. Matecon , volume=

work page
[33]

Journal of Computational and Applied Mathematics , volume=

On linear convergence of iterative methods for the variational inequality problem , author=. Journal of Computational and Applied Mathematics , volume=. 1995 , publisher=

work page 1995
[34]

USSR Computational Mathematics and Mathematical Physics , volume=

Modification of the extra-gradient method for solving variational inequalities and certain optimization problems , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1987 , publisher=

work page 1987
[35]

International Conference on Artificial Intelligence and Statistics , pages=

A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020
[36]

G. M. Korpelevich , title =. Ekonomika Mat. Metody , year =

work page
[37]

Sibony, Mo. M. Calcolo , volume=. 1970 , publisher=

work page 1970
[38]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

work page 1998
[39]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[40]

Advances in neural information processing systems , volume=

Improved techniques for training gans , author=. Advances in neural information processing systems , volume=

work page
[41]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

work page
[42]

A modification of the Arrow-Hurwitz method of search for saddle points , author=. Mat. Zametki , volume=

work page
[43]

arXiv preprint arXiv:1802.10551 , year=

A variational inequality perspective on generative adversarial networks , author=. arXiv preprint arXiv:1802.10551 , year=

work page arXiv
[44]

Stochastic Systems , volume=

Solving variational inequalities with stochastic mirror-prox algorithm , author=. Stochastic Systems , volume=. 2011 , publisher=

work page 2011
[45]

arXiv preprint arXiv:2010.13112 , year=

Distributed saddle-point problems: Lower bounds, near-optimal and robust algorithms , author=. arXiv preprint arXiv:2010.13112 , year=

work page arXiv 2010
[46]

Mathematical Programming , volume=

On lower iteration complexity bounds for the convex concave saddle point problems , author=. Mathematical Programming , volume=. 2022 , publisher=

work page 2022
[47]

Computational Mathematics and Mathematical Physics , volume=

A unified analysis of variational inequality methods: Variance reduction, sampling, quantization, and coordinate descent , author=. Computational Mathematics and Mathematical Physics , volume=. 2023 , publisher=

work page 2023
[48]

SIAM Journal on Optimization , volume=

Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation , author=. SIAM Journal on Optimization , volume=. 2022 , publisher=

work page 2022
[49]

Advances in Neural Information Processing Systems , volume=

Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling , author=. Advances in Neural Information Processing Systems , volume=

work page
[50]

International Conference on Artificial Intelligence and Statistics , pages=

Stochastic extragradient: General analysis and improved rates , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[51]

International Conference on Artificial Intelligence and Statistics , pages=

Revisiting stochastic extragradient , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020
[52]

Wiley statsRef: Statistics reference online , pages=

Variance reduction , author=. Wiley statsRef: Statistics reference online , pages=. 2017 , publisher=

work page 2017
[53]

Pattern Recognition , volume=

Dynamics-aware loss for learning with label noise , author=. Pattern Recognition , volume=. 2023 , publisher=

work page 2023
[54]

arXiv preprint arXiv:2111.05428 , year=

Constrained instance and class reweighting for robust learning under label noise , author=. arXiv preprint arXiv:2111.05428 , year=

work page arXiv
[55]

arXiv preprint arXiv:2211.02556 , year=

Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast , author=. arXiv preprint arXiv:2211.02556 , year=

work page arXiv
[56]

IEEE Transactions on knowledge and data engineering , volume=

Learning from imbalanced data , author=. IEEE Transactions on knowledge and data engineering , volume=. 2009 , publisher=

work page 2009
[57]

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection , author=. arXiv preprint arXiv:1708.02002 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Libra r-cnn: Towards balanced learning for object detection , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[59]

Advances in neural information processing systems , volume=

What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , volume=

work page
[60]

International conference on machine learning , pages=

Learning to reweight examples for robust deep learning , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[61]

Journal of the American statistical association , volume=

The monte carlo method , author=. Journal of the American statistical association , volume=. 1949 , publisher=

work page 1949
[62]

Elements of survey sampling , pages=

Stratified sampling , author=. Elements of survey sampling , pages=. 1996 , publisher=

work page 1996
[63]

Statistics in Medicine , volume=

On variance estimation of the inverse probability-of-treatment weighting estimator: A tutorial for different types of propensity score weights , author=. Statistics in Medicine , volume=. 2024 , publisher=

work page 2024
[64]

Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 , pages=

U-net: Convolutional networks for biomedical image segmentation , author=. Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 , pages=. 2015 , organization=

work page 2015
[65]

IEEE transactions on neural networks and learning systems , year=

Deep neural networks and tabular data: A survey , author=. IEEE transactions on neural networks and learning systems , year=

work page
[66]

Advances in Neural Information Processing Systems , volume=

On embeddings for numerical features in tabular deep learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[67]

The Twelfth International Conference on Learning Representations , year=

TabR: Tabular Deep Learning Meets Nearest Neighbors , author=. The Twelfth International Conference on Learning Representations , year=

work page
[68]

arXiv preprint arXiv:2406.19380 , year=

TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks , author=. arXiv preprint arXiv:2406.19380 , year=

work page arXiv
[69]

arXiv preprint arXiv:2410.24210 , year=

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling , author=. arXiv preprint arXiv:2410.24210 , year=

work page arXiv
[70]

the Journal of machine Learning research , volume=

Scikit-learn: Machine learning in Python , author=. the Journal of machine Learning research , volume=. 2011 , publisher=

work page 2011
[71]

Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

Optuna: A next-generation hyperparameter optimization framework , author=. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

work page
[72]

arXiv preprint arXiv:2411.14601 , year=

On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms , author=. arXiv preprint arXiv:2411.14601 , year=

work page arXiv
[73]

arXiv preprint arXiv:2307.12946 , year=

Optimal algorithm with complexity separation for strongly convex-strongly concave composite saddle point problems , author=. arXiv preprint arXiv:2307.12946 , year=

work page arXiv
[74]

arXiv preprint arXiv:2103.09344 , year=

On accelerated methods for saddle-point problems with composite structure , author=. arXiv preprint arXiv:2103.09344 , year=

work page arXiv
[75]

Chaos, Solitons & Fractals , volume=

New aspects of black box conditional gradient: Variance reduction and one point feedback , author=. Chaos, Solitons & Fractals , volume=. 2024 , publisher=

work page 2024
[76]

arXiv preprint arXiv:2408.01848 , year=

Methods for Optimization Problems with Markovian Stochasticity and Non-Euclidean Geometry , author=. arXiv preprint arXiv:2408.01848 , year=

work page arXiv
[77]

Proceedings of the IEEE international conference on computer vision , pages=

Class rectification hard mining for imbalanced deep learning , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[78]

International conference on machine learning , pages=

Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[79]

Journal of the Operations Research Society of America , volume=

Methods of reducing sample size in Monte Carlo computations , author=. Journal of the Operations Research Society of America , volume=. 1953 , publisher=

work page 1953
[80]

Journal of computer and system sciences , volume=

A decision-theoretic generalization of on-line learning and an application to boosting , author=. Journal of computer and system sciences , volume=. 1997 , publisher=

work page 1997
[81]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009

Showing first 80 references.

[1] [1]

Advances in neural information processing systems , volume=

Principles of risk minimization for learning theory , author=. Advances in neural information processing systems , volume=

work page

[2] [2]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[3] [3]

2018 IEEE/ACM 26th international symposium on quality of service (IWQoS) , pages=

Improved adam optimizer for deep neural networks , author=. 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS) , pages=. 2018 , organization=

work page 2018

[4] [5]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

Neurocomputing , volume=

Backpropagation and stochastic gradient descent method , author=. Neurocomputing , volume=. 1993 , publisher=

work page 1993

[6] [7]

arXiv preprint arXiv:2306.06101 , year=

Prodigy: An expeditiously adaptive parameter-free learner , author=. arXiv preprint arXiv:2306.06101 , year=

work page arXiv

[7] [8]

SOAP: Improving and Stabilizing Shampoo using Adam

Soap: Improving and stabilizing shampoo using adam , author=. arXiv preprint arXiv:2409.11321 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

arXiv preprint arXiv:2105.02470 , year=

Generalized multimodal ELBO , author=. arXiv preprint arXiv:2105.02470 , year=

work page arXiv

[9] [10]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [11]

, author=

On general minimax theorems. , author=

work page

[11] [12]

Proceedings of 2nd Berkeley Symposium , pages=

Proceedings of 2nd berkeley symposium , author=. Proceedings of 2nd Berkeley Symposium , pages=

work page

[12] [13]

Advances in Neural Information Processing Systems , volume=

On the convergence of single-call stochastic extra-gradient methods , author=. Advances in Neural Information Processing Systems , volume=

work page

[13] [14]

Comptes Rendus Hebdomadaires Des Seances De L Academie Des Sciences , volume=

Formes bilineaires coercitives sur les ensembles convexes , author=. Comptes Rendus Hebdomadaires Des Seances De L Academie Des Sciences , volume=. 1964 , publisher=

work page 1964

[14] [15]

Some problems and results in fixed point theory , author=. Contemp. Math , volume=

work page

[15] [16]

Journal of Scientific Computing , volume=

Inertial-type algorithm for solving split common fixed point problems in Banach spaces , author=. Journal of Scientific Computing , volume=. 2021 , publisher=

work page 2021

[16] [17]

1984 , publisher=

Linear and nonlinear programming , author=. 1984 , publisher=

work page 1984

[17] [18]

Proceedings of the National Academy of Sciences , volume=

Existence and approximation of solutions of nonlinear variational inequalities , author=. Proceedings of the National Academy of Sciences , volume=. 1966 , publisher=

work page 1966

[18] [19]

Theory and applications of monotone operators , pages=

Convex functions, monotone operators and variational inequalities , author=. Theory and applications of monotone operators , pages=. 1969 , organization=

work page 1969

[19] [20]

International Journal of Information Management Data Insights , volume=

Generative adversarial network: An overview of theory and applications , author=. International Journal of Information Management Data Insights , volume=. 2021 , publisher=

work page 2021

[20] [21]

International conference on machine learning , pages=

Wasserstein generative adversarial networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017

[21] [22]

Communications of the ACM , volume=

Generative adversarial networks , author=. Communications of the ACM , volume=. 2020 , publisher=

work page 2020

[22] [23]

International Conference on Machine Learning , pages=

Deep decentralized multi-task multi-agent reinforcement learning under partial observability , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017

[23] [24]

stat , volume=

Towards deep learning models resistant to adversarial attacks , author=. stat , volume=

work page

[24] [25]

Princeton University Press google schola , volume=

Robust Optimization , author=. Princeton University Press google schola , volume=

work page

[25] [26]

International Conference on Machine Learning , pages=

Efficiently solving MDPs with stochastic mirror descent , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020

[26] [27]

Mathematical programming , volume=

Smooth minimization of non-smooth functions , author=. Mathematical programming , volume=. 2005 , publisher=

work page 2005

[27] [28]

Convex Sparse Matrix Factorizations

Convex sparse matrix factorizations , author=. arXiv preprint arXiv:0812.1869 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [29]

Journal of mathematical imaging and vision , volume=

A first-order primal-dual algorithm for convex problems with applications to imaging , author=. Journal of mathematical imaging and vision , volume=. 2011 , publisher=

work page 2011

[29] [30]

SIAM Journal on Imaging Sciences , volume=

A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science , author=. SIAM Journal on Imaging Sciences , volume=. 2010 , publisher=

work page 2010

[30] [31]

Proceedings of the 22nd international conference on Machine learning , pages=

A support vector method for multivariate performance measures , author=. Proceedings of the 22nd international conference on Machine learning , pages=

work page

[31] [32]

Matecon , volume=

The extragradient method for finding saddle points and other problems , author=. Matecon , volume=

work page

[32] [33]

Journal of Computational and Applied Mathematics , volume=

On linear convergence of iterative methods for the variational inequality problem , author=. Journal of Computational and Applied Mathematics , volume=. 1995 , publisher=

work page 1995

[33] [34]

USSR Computational Mathematics and Mathematical Physics , volume=

Modification of the extra-gradient method for solving variational inequalities and certain optimization problems , author=. USSR Computational Mathematics and Mathematical Physics , volume=. 1987 , publisher=

work page 1987

[34] [35]

International Conference on Artificial Intelligence and Statistics , pages=

A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020

[35] [36]

G. M. Korpelevich , title =. Ekonomika Mat. Metody , year =

work page

[36] [37]

Sibony, Mo. M. Calcolo , volume=. 1970 , publisher=

work page 1970

[37] [38]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

work page 1998

[38] [39]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page

[39] [40]

Advances in neural information processing systems , volume=

Improved techniques for training gans , author=. Advances in neural information processing systems , volume=

work page

[40] [41]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=

work page

[41] [42]

A modification of the Arrow-Hurwitz method of search for saddle points , author=. Mat. Zametki , volume=

work page

[42] [43]

arXiv preprint arXiv:1802.10551 , year=

A variational inequality perspective on generative adversarial networks , author=. arXiv preprint arXiv:1802.10551 , year=

work page arXiv

[43] [44]

Stochastic Systems , volume=

Solving variational inequalities with stochastic mirror-prox algorithm , author=. Stochastic Systems , volume=. 2011 , publisher=

work page 2011

[44] [45]

arXiv preprint arXiv:2010.13112 , year=

Distributed saddle-point problems: Lower bounds, near-optimal and robust algorithms , author=. arXiv preprint arXiv:2010.13112 , year=

work page arXiv 2010

[45] [46]

Mathematical Programming , volume=

On lower iteration complexity bounds for the convex concave saddle point problems , author=. Mathematical Programming , volume=. 2022 , publisher=

work page 2022

[46] [47]

Computational Mathematics and Mathematical Physics , volume=

A unified analysis of variational inequality methods: Variance reduction, sampling, quantization, and coordinate descent , author=. Computational Mathematics and Mathematical Physics , volume=. 2023 , publisher=

work page 2023

[47] [48]

SIAM Journal on Optimization , volume=

Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation , author=. SIAM Journal on Optimization , volume=. 2022 , publisher=

work page 2022

[48] [49]

Advances in Neural Information Processing Systems , volume=

Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling , author=. Advances in Neural Information Processing Systems , volume=

work page

[49] [50]

International Conference on Artificial Intelligence and Statistics , pages=

Stochastic extragradient: General analysis and improved rates , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022

[50] [51]

International Conference on Artificial Intelligence and Statistics , pages=

Revisiting stochastic extragradient , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020

[51] [52]

Wiley statsRef: Statistics reference online , pages=

Variance reduction , author=. Wiley statsRef: Statistics reference online , pages=. 2017 , publisher=

work page 2017

[52] [53]

Pattern Recognition , volume=

Dynamics-aware loss for learning with label noise , author=. Pattern Recognition , volume=. 2023 , publisher=

work page 2023

[53] [54]

arXiv preprint arXiv:2111.05428 , year=

Constrained instance and class reweighting for robust learning under label noise , author=. arXiv preprint arXiv:2111.05428 , year=

work page arXiv

[54] [55]

arXiv preprint arXiv:2211.02556 , year=

Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast , author=. arXiv preprint arXiv:2211.02556 , year=

work page arXiv

[55] [56]

IEEE Transactions on knowledge and data engineering , volume=

Learning from imbalanced data , author=. IEEE Transactions on knowledge and data engineering , volume=. 2009 , publisher=

work page 2009

[56] [57]

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection , author=. arXiv preprint arXiv:1708.02002 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [58]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Libra r-cnn: Towards balanced learning for object detection , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[58] [59]

Advances in neural information processing systems , volume=

What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , volume=

work page

[59] [60]

International conference on machine learning , pages=

Learning to reweight examples for robust deep learning , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018

[60] [61]

Journal of the American statistical association , volume=

The monte carlo method , author=. Journal of the American statistical association , volume=. 1949 , publisher=

work page 1949

[61] [62]

Elements of survey sampling , pages=

Stratified sampling , author=. Elements of survey sampling , pages=. 1996 , publisher=

work page 1996

[62] [63]

Statistics in Medicine , volume=

On variance estimation of the inverse probability-of-treatment weighting estimator: A tutorial for different types of propensity score weights , author=. Statistics in Medicine , volume=. 2024 , publisher=

work page 2024

[63] [64]

Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 , pages=

U-net: Convolutional networks for biomedical image segmentation , author=. Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 , pages=. 2015 , organization=

work page 2015

[64] [65]

IEEE transactions on neural networks and learning systems , year=

Deep neural networks and tabular data: A survey , author=. IEEE transactions on neural networks and learning systems , year=

work page

[65] [66]

Advances in Neural Information Processing Systems , volume=

On embeddings for numerical features in tabular deep learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[66] [67]

The Twelfth International Conference on Learning Representations , year=

TabR: Tabular Deep Learning Meets Nearest Neighbors , author=. The Twelfth International Conference on Learning Representations , year=

work page

[67] [68]

arXiv preprint arXiv:2406.19380 , year=

TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks , author=. arXiv preprint arXiv:2406.19380 , year=

work page arXiv

[68] [69]

arXiv preprint arXiv:2410.24210 , year=

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling , author=. arXiv preprint arXiv:2410.24210 , year=

work page arXiv

[69] [70]

the Journal of machine Learning research , volume=

Scikit-learn: Machine learning in Python , author=. the Journal of machine Learning research , volume=. 2011 , publisher=

work page 2011

[70] [71]

Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

Optuna: A next-generation hyperparameter optimization framework , author=. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

work page

[71] [72]

arXiv preprint arXiv:2411.14601 , year=

On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms , author=. arXiv preprint arXiv:2411.14601 , year=

work page arXiv

[72] [73]

arXiv preprint arXiv:2307.12946 , year=

Optimal algorithm with complexity separation for strongly convex-strongly concave composite saddle point problems , author=. arXiv preprint arXiv:2307.12946 , year=

work page arXiv

[73] [74]

arXiv preprint arXiv:2103.09344 , year=

On accelerated methods for saddle-point problems with composite structure , author=. arXiv preprint arXiv:2103.09344 , year=

work page arXiv

[74] [75]

Chaos, Solitons & Fractals , volume=

New aspects of black box conditional gradient: Variance reduction and one point feedback , author=. Chaos, Solitons & Fractals , volume=. 2024 , publisher=

work page 2024

[75] [76]

arXiv preprint arXiv:2408.01848 , year=

Methods for Optimization Problems with Markovian Stochasticity and Non-Euclidean Geometry , author=. arXiv preprint arXiv:2408.01848 , year=

work page arXiv

[76] [77]

Proceedings of the IEEE international conference on computer vision , pages=

Class rectification hard mining for imbalanced deep learning , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

[77] [78]

International conference on machine learning , pages=

Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018

[78] [79]

Journal of the Operations Research Society of America , volume=

Methods of reducing sample size in Monte Carlo computations , author=. Journal of the Operations Research Society of America , volume=. 1953 , publisher=

work page 1953

[79] [80]

Journal of computer and system sciences , volume=

A decision-theoretic generalization of on-line learning and an application to boosting , author=. Journal of computer and system sciences , volume=. 1997 , publisher=

work page 1997

[80] [81]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009