Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

Micha{\l} Derezi\'nski; Sachin Garg

arxiv: 2605.18609 · v1 · pith:3EHDOES6new · submitted 2026-05-18 · 💻 cs.LG

Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

Sachin Garg , Micha{\l} Derezi\'nski This is my paper

Pith reviewed 2026-05-20 12:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords mini-batch SGDclassical momentumaccelerationparallelizationquadratic optimizationinterpolation regime

0 comments

The pith

Classical momentum acceleration in mini-batch SGD scales proportionally with batch size

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when using classical momentum with mini-batch stochastic gradient descent on quadratic functions in the interpolation regime, the acceleration benefit is directly proportional to the mini-batch size up to a saturation point. This is established under minimal assumptions on the stochastic noise and for both heavy ball and Nesterov momentum. A reader would care because it theoretically justifies the practical success of large-batch training with momentum and indicates how to achieve perfect parallelization of computations. The work also provides a simple effective choice for the momentum parameter.

Core claim

In the interpolation regime for quadratic optimization, the acceleration from classical momentum is directly proportional to the gradient mini-batch size, up to a natural saturation point. This enables perfect parallelization of mini-batch computations. The theory makes minimal assumptions on stochastic noise and applies to arbitrary mini-batch sizes, including special cases like randomized Kaczmarz and coordinate descent.

What carries the argument

The general theory of stochastic momentum acceleration that demonstrates the direct proportionality of acceleration to mini-batch size.

If this is right

The acceleration scales linearly with mini-batch size until saturation.
This scaling enables perfect parallelization of mini-batch computations.
The results hold for arbitrary mini-batch sizes with minimal noise assumptions.
A simple momentum parameter choice is effective empirically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the proportionality holds approximately in non-quadratic settings, it could inform batch size selection in deep learning.
The saturation point might be characterized further to optimize resource use in parallel training.
Connections to other stochastic methods could be explored for similar acceleration properties.

Load-bearing premise

The analysis is limited to quadratic functions in the interpolation regime.

What would settle it

Observing that the effective acceleration does not increase proportionally as the mini-batch size grows in quadratic interpolation problems would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.18609 by Micha{\l} Derezi\'nski, Sachin Garg.

**Figure 2.** Figure 2: Convergence behavior for CDM with different [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Convergence of block coordinate descent with NAG momentum and baselines. Top [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗

read the original abstract

Accelerating stochastic gradient methods with classical momentum schemes, such as Polyak's heavy ball, has proven highly successful in training large-scale machine learning models, particularly when combined with the hardware acceleration of large mini-batch computations. Yet, the effect of classical momentum on stochastic mini-batch optimization has been poorly understood theoretically, with prior works requiring strong noise assumptions and extremely large mini-batches. In this work, we develop a general theory of stochastic momentum acceleration for optimizing over quadratics in the interpolation regime, a popular abstraction for studying deep learning dynamics which also includes classical methods such as randomized Kaczmarz and coordinate descent. Our framework encompasses both heavy ball and Nesterov-style momentum, allows for arbitrary mini-batch sizes, and makes minimal assumptions on the stochastic noise. In particular, we show that acceleration from classical momentum is directly proportional to the gradient mini-batch size (up to a natural saturation point), thereby enabling perfect parallelization of mini-batch computations. Our theory also provides a simple choice for the momentum parameter, which is shown to be effective empirically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows classical momentum acceleration scales directly with mini-batch size for quadratics in the interpolation regime under minimal noise, enabling the parallelization claim there.

read the letter

The core result is that for quadratic objectives in the interpolation regime, the acceleration from heavy-ball or Nesterov momentum grows proportionally with the mini-batch size up to saturation. This is the punchline the authors deliver cleanly from the abstract onward. They build a framework that covers arbitrary batch sizes without the extreme-batch or strong-noise requirements of earlier analyses, and they connect it to methods like randomized Kaczmarz and coordinate descent. The simple momentum-parameter rule they suggest also gets some empirical support in the work. That combination is the useful piece: a more general theory than prior stochastic momentum papers under the stated assumptions. The math appears to derive the bound from the quadratic interpolation setup rather than reverse-engineering it, which keeps the circularity low. No machine-checked proof is mentioned, but the derivation is presented as parameter-free within the regime. The main limitation is exactly the one the stress-test flags. Everything rests on the quadratic geometry plus noise that vanishes at the optimum. Once the loss is non-quadratic or the interpolation condition fails, the noise covariance no longer decouples the same way and the direct proportionality can disappear. The paper is upfront about working only in this setting, so it does not overclaim, but that still caps how far the parallelization insight travels to real deep-learning losses. Readers working on theoretical SGD analysis or on scaling rules for distributed training will find the clean proportionality result worth their time. It is narrow enough that most practitioners will treat it as a special case rather than a general recipe. The work shows clear engagement with the literature and honest scope, so it deserves a serious referee. I would send it out for review rather than desk-reject, expecting the referees to press on extensions beyond quadratics.

Referee Report

2 major / 2 minor

Summary. The paper develops a general theory of stochastic momentum acceleration (heavy ball and Nesterov-style) for mini-batch SGD on quadratic objectives in the interpolation regime. Under minimal assumptions on stochastic noise, it establishes that the acceleration factor from classical momentum is directly proportional to the mini-batch size (up to a natural saturation point), which enables perfect parallelization of mini-batch gradient computations. The framework also supplies a simple, effective choice for the momentum parameter, supported by empirical validation.

Significance. If the central proportionality result holds, the work offers a clean theoretical explanation for the empirical success of combining classical momentum with large mini-batches in deep learning training. By restricting to the interpolation regime (a standard abstraction that includes randomized Kaczmarz and coordinate descent), the analysis achieves minimal noise assumptions while still recovering the practical benefit of batch-size-proportional acceleration. This strengthens the case for momentum as a tool for hardware-efficient parallelization rather than merely a heuristic.

major comments (2)

[Main theorem on acceleration proportionality] The proportionality result (central claim) is derived under the interpolation regime where stochastic noise vanishes at the minimizer. While the paper correctly restricts its scope to quadratics, the derivation relies on the specific noise covariance structure induced by this regime (proportional to the Hessian residual). A brief remark on whether the same linear scaling survives when the loss does not reach zero at the optimum would clarify the boundary of applicability.
[Discussion of saturation and parallelization] The saturation point beyond which additional batch size yields no further acceleration is described as 'natural,' yet the precise condition (in terms of problem parameters or noise level) is not stated explicitly. This quantity is load-bearing for the 'perfect parallelization' interpretation, as practitioners need to know when the proportionality ceases to hold.

minor comments (2)

[Abstract and §1] The abstract and introduction would benefit from a one-sentence statement of the precise momentum parameter choice derived in the theory, rather than deferring it entirely to the empirical section.
[Preliminaries] Notation for the mini-batch gradient estimator and the noise term should be introduced once and used consistently; occasional redefinition of symbols across sections slightly reduces readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and constructive comments on our manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Main theorem on acceleration proportionality] The proportionality result (central claim) is derived under the interpolation regime where stochastic noise vanishes at the minimizer. While the paper correctly restricts its scope to quadratics, the derivation relies on the specific noise covariance structure induced by this regime (proportional to the Hessian residual). A brief remark on whether the same linear scaling survives when the loss does not reach zero at the optimum would clarify the boundary of applicability.

Authors: We appreciate the referee's suggestion to clarify the boundary of our results. Our analysis is intentionally scoped to the interpolation regime, which permits minimal noise assumptions and encompasses important special cases such as randomized Kaczmarz and coordinate descent. The noise covariance structure (proportional to the Hessian residual) is essential for obtaining the clean batch-size proportionality. In non-interpolation regimes, where stochastic noise does not vanish at the optimum, the same linear scaling would generally require stronger noise assumptions that we deliberately avoided. We will add a concise remark in the introduction and discussion sections noting this scope limitation and the conditions under which the result may fail to extend. revision: partial
Referee: [Discussion of saturation and parallelization] The saturation point beyond which additional batch size yields no further acceleration is described as 'natural,' yet the precise condition (in terms of problem parameters or noise level) is not stated explicitly. This quantity is load-bearing for the 'perfect parallelization' interpretation, as practitioners need to know when the proportionality ceases to hold.

Authors: We agree that an explicit characterization of the saturation point would improve the practical utility of the 'perfect parallelization' claim. In our derivations, saturation occurs when the mini-batch size exceeds a threshold determined by the momentum parameter, the condition number of the Hessian, and the noise variance; beyond this point the effective acceleration factor no longer increases. We will revise the main theorem statement and add a short paragraph in the discussion to provide the precise condition in terms of these problem parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under quadratic interpolation assumptions

full rationale

The paper develops a theoretical framework for momentum acceleration specifically restricted to quadratics in the interpolation regime with minimal noise assumptions. The claimed direct proportionality of acceleration to mini-batch size follows from analysis of the stochastic dynamics in this model, without any quoted reduction of the central result to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. No equations or steps in the provided abstract or description exhibit the result being equivalent to its inputs by construction. The analysis is presented as first-principles within the stated setting and is externally falsifiable outside the fitted values.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the quadratic interpolation regime and minimal stochastic noise assumptions; no free parameters beyond the simple momentum choice are mentioned, and no new entities are introduced.

free parameters (1)

momentum parameter
Simple choice provided by the theory and shown effective empirically; treated as a tunable but not data-fitted constant in the abstract.

axioms (2)

domain assumption Optimization occurs over quadratics in the interpolation regime
Stated as the setting that includes randomized Kaczmarz and coordinate descent; central to the proportionality result.
domain assumption Minimal assumptions on the stochastic noise
Explicitly contrasted with prior works requiring strong noise assumptions.

pith-pipeline@v0.9.0 · 5713 in / 1242 out tokens · 28518 ms · 2026-05-20T12:21:19.658366+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 8 internal anchors

[1]

Randomized kaczmarz with geometrically smoothed momentum.SIAM Journal on Matrix Analysis and Ap- plications, 45(4):2287–2313, 2024

Seth J Alderman, Roan W Luikart, and Nicholas F Marshall. Randomized kaczmarz with geometrically smoothed momentum.SIAM Journal on Matrix Analysis and Ap- plications, 45(4):2287–2313, 2024. 15

work page 2024
[2]

Katyusha: The first direct acceleration of stochastic gradient meth- ods.Journal of Machine Learning Research, 18(221):1–51, 2018

Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient meth- ods.Journal of Machine Learning Research, 18(221):1–51, 2018

work page 2018
[3]

Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

work page arXiv 2025
[4]

Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate.Advances in neural in- formation processing systems, 31, 2018

Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate.Advances in neural in- formation processing systems, 31, 2018

work page 2018
[5]

On the fast convergence of mini- batch heavy ball momentum.IMA Journal of Numerical Analysis, 45(3):1397–1424, 2025

Raghu Bollapragada, Tyler Chen, and Rachel Ward. On the fast convergence of mini- batch heavy ball momentum.IMA Journal of Numerical Analysis, 45(3):1397–1424, 2025

work page 2025
[6]

A geometric alternative to Nesterov's accelerated gradient descent

Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov’s accelerated gradient descent.arXiv preprint arXiv:1506.08187, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Sample size selection in optimization methods for machine learning.Mathematical programming, 134(1):127– 155, 2012

Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning.Mathematical programming, 134(1):127– 155, 2012

work page 2012
[8]

Accelerated linear convergence of stochastic momentum methods in wasserstein distances

Bugra Can, Mert Gurbuzbalaban, and Lingjiong Zhu. Accelerated linear convergence of stochastic momentum methods in wasserstein distances. InInternational Conference on Machine Learning, pages 891–901. PMLR, 2019

work page 2019
[9]

Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size

Michał Dereziński and Xiaoyu Dong. Last-iterate convergence of randomized kaczmarz and sgd with greedy step size.arXiv preprint arXiv:2604.09909, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Fine- grained analysis and faster algorithms for iteratively solving linear systems.Journal of Machine Learning Research, 26(144):1–49, 2025

Michal Dereziński, Daniel LeJeune, Deanna Needell, and Elizaveta Rebrova. Fine- grained analysis and faster algorithms for iteratively solving linear systems.Journal of Machine Learning Research, 26(144):1–49, 2025

work page 2025
[11]

Randomized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

Michał Dereziński, Deanna Needell, Elizaveta Rebrova, and Jiaming Yang. Randomized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

work page 2025
[12]

Sharp analysis of sketch-and-project meth- ods via a connection to randomized singular value decomposition.SIAM Journal on Mathematics of Data Science, 6(1):127–153, 2024

Michał Dereziński and Elizaveta Rebrova. Sharp analysis of sketch-and-project meth- ods via a connection to randomized singular value decomposition.SIAM Journal on Mathematics of Data Science, 6(1):127–153, 2024

work page 2024
[13]

Solving dense linear systems faster than via preconditioning

Michał Dereziński and Jiaming Yang. Solving dense linear systems faster than via preconditioning. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1118–1129, 2024. 16

work page 2024
[14]

Bert: Pre- training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computa- tional linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[15]

From averaging to acceleration, there is only a step-size

Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. InConference on learning theory, pages 658–695. PMLR, 2015

work page 2015
[16]

When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000, 2023

Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, and Nanning Zheng. When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000, 2023

work page arXiv 2023
[17]

Stochastic heavy ball

Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. 2018

work page 2018
[18]

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W Mahoney, and Joseph Gonzalez. On the computational inefficiency of large batch sizes for stochastic gradient descent.arXiv preprint arXiv:1811.12941, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Accelerated stochastic matrix inversion: general theory and speeding up bfgs rules for faster second- order optimization.Advances in Neural Information Processing Systems, 31, 2018

Robert Gower, Filip Hanzely, Peter Richtárik, and Sebastian U Stich. Accelerated stochastic matrix inversion: general theory and speeding up bfgs rules for faster second- order optimization.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[20]

Randomized iterative methods for linear systems

Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015

work page 2015
[21]

Nesterov acceleration despite very noisy gradients.Advances in Neural Information Processing Systems, 37:20694–20744, 2024

Kanan Gupta, Jonathan W Siegel, and Stephan Wojtowytsch. Nesterov acceleration despite very noisy gradients.Advances in Neural Information Processing Systems, 37:20694–20744, 2024

work page 2024
[22]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–

work page
[23]

On pseudoinverse-free randomized methods for linear sys- tems: Unified framework and acceleration.Optimization Methods and Software, pages 1–36, 2025

Deren Han and Jiaxin Xie. On pseudoinverse-free randomized methods for linear sys- tems: Unified framework and acceleration.Optimization Methods and Software, pages 1–36, 2025

work page 2025
[24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[25]

Matrix concentration for products.Foundations of Computational Mathematics, 22(6):1767–1799, 2022

De Huang, Jonathan Niles-Weed, Joel A Tropp, and Rachel Ward. Matrix concentration for products.Foundations of Computational Mathematics, 22(6):1767–1799, 2022. 17

work page 2022
[26]

Accelerating Stochastic Gradient Descent For Least Squares Regression

PrateekJain, ShamMKakade, RahulKidambi, PraneethNetrapalli, andAaronSidford. Accelerating stochastic gradient descent.arXiv preprint arXiv:1704.08227, 3:8, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Accelerated gradient descent es- capes saddle points faster than gradient descent

Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent es- capes saddle points faster than gradient descent. InConference on learning theory, pages 1042–1085. PMLR, 2018

work page 2018
[28]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

work page 2024
[29]

On the insuffi- ciency of existing momentum schemes for stochastic optimization

Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insuffi- ciency of existing momentum schemes for stochastic optimization. In2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018

work page 2018
[30]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[31]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012
[32]

Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems, 35:36944–36957, 2022

Kiwon Lee, Andrew Cheng, Elliot Paquette, and Courtney Paquette. Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems, 35:36944–36957, 2022

work page 2022
[33]

Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

work page 2010
[34]

Efficient mini-batch train- ing for stochastic optimization

Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch train- ing for stochastic optimization. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670, 2014

work page 2014
[35]

An accelerated randomized kaczmarz algorithm.Mathe- matics of Computation, 85(297):153–178, 2016

Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm.Mathe- matics of Computation, 85(297):153–178, 2016

work page 2016
[36]

Linearly convergent stochastic heavy ball method for minimizing generalization error

Nicolas Loizou and Peter Richtárik. Linearly convergent stochastic heavy ball method for minimizing generalization error.arXiv preprint arXiv:1710.10737, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Momentum and stochastic momentum for stochas- tic gradient, newton, proximal point and subspace descent methods.Computational Optimization and Applications, 77(3):653–710, 2020

Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochas- tic gradient, newton, proximal point and subspace descent methods.Computational Optimization and Applications, 77(3):653–710, 2020

work page 2020
[38]

The power of interpolation: Understand- ing the effectiveness of sgd in modern over-parametrized learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understand- ing the effectiveness of sgd in modern over-parametrized learning. InInternational Conference on Machine Learning, pages 3325–3334. PMLR, 2018. 18

work page 2018
[39]

Paved with good intentions: analysis of a randomized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

Deanna Needell and Joel A Tropp. Paved with good intentions: analysis of a randomized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

work page 2014
[40]

Amethodforsolvingtheconvexprogrammingproblemwithconvergence rate o (1/k2)

YuriiNesterov. Amethodforsolvingtheconvexprogrammingproblemwithconvergence rate o (1/k2). InDokl akad nauk Sssr, volume 269, page 543, 1983

work page 1983
[41]

Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567, 2023

Rui Pan, Yuxing Liu, Xiaoyu Wang, and Tong Zhang. Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567, 2023

work page arXiv 2023
[42]

Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

work page 2011
[43]

Matrix analysis (roger a

Robert J Plemmons. Matrix analysis (roger a. horn and charles r. johnson), 1988

work page 1988
[44]

Some methods of speeding up the convergence of iteration methods

Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964

work page 1964
[45]

Turbocharging gaussian process inference with approximate sketch-and-project.Advances in Neural Information Processing Systems, 38:141691– 141728, 2026

Pratik Rathore, Zachary Frangella, Sachin Garg, Shaghayegh Fazliani, Michal Derezin- ski, and Madeleine Udell. Turbocharging gaussian process inference with approximate sketch-and-project.Advances in Neural Information Processing Systems, 38:141691– 141728, 2026

work page 2026
[46]

Have askotch: A neat solution for large-scale kernel ridge regression.arXiv preprint arXiv:2407.10070, 2024

Pratik Rathore, Zachary Frangella, Jiaming Yang, Michał Dereziński, and Madeleine Udell. Have askotch: A neat solution for large-scale kernel ridge regression.arXiv preprint arXiv:2407.10070, 2024

work page arXiv 2024
[47]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

work page 1951
[48]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018

work page 2018
[49]

A randomized kaczmarz algorithm with ex- ponential convergence.Journal of Fourier Analysis and Applications, 15(2):262–278, 2009

Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with ex- ponential convergence.Journal of Fourier Analysis and Applications, 15(2):262–278, 2009

work page 2009
[50]

On the importance ofinitializationandmomentumindeeplearning

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance ofinitializationandmomentumindeeplearning. InInternational conference on machine learning, pages 1139–1147. pmlr, 2013

work page 2013
[51]

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665, 2023

Kejie Tang, Weidong Liu, Yichen Zhang, and Xi Chen. Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665, 2023. 19

work page arXiv 2023
[52]

The fastest known globally convergent first-order method for minimizing strongly convex functions.IEEE Control Systems Letters, 2(1):49–54, 2017

Bryan Van Scoy, Randy A Freeman, and Kevin M Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions.IEEE Control Systems Letters, 2(1):49–54, 2017

work page 2017
[53]

Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

JoaquinVanschoren, JanNVanRijn, BerndBischl, andLuisTorgo. Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

work page 2014
[54]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[55]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfon- brener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196, 2023

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, and Zhiyuan Li. The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196, 2023

work page arXiv 2023
[57]

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, and Yi Yang. A unified analysis of stochastic momentum methods for deep learning.arXiv preprint arXiv:1808.10396, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[58]

On adaptive stochastic heavy ball momentum for solving linear systems.SIAM Journal on Matrix Analysis and Applications, 45(3):1259–1286, 2024

Yun Zeng, Deren Han, Yansheng Su, and Jiaxin Xie. On adaptive stochastic heavy ball momentum for solving linear systems.SIAM Journal on Matrix Analysis and Applications, 45(3):1259–1286, 2024. A Convergence analysis for expected iterates We start by recalling the matrix transition rule (11), V⊤∆t+1 V⊤∆t = (1 +β)I−(1 +βω)V ⊤ Πt V−β·(I−ωV ⊤ Πt−1 V) I 0 | {z...

work page 2024
[59]

We had, pt+1 ≤f t+1 < ρ t+1 + tX j=0 ρj ·max i αi(t+ 1−j) Due to Lemma 13 we know, ρ <1− ϕλr 4

Recall the bound onpt+1 in (26). We had, pt+1 ≤f t+1 < ρ t+1 + tX j=0 ρj ·max i αi(t+ 1−j) Due to Lemma 13 we know, ρ <1− ϕλr 4 . Furthermore,α i(t+ 1−j) = (T⊤ i )t+1−jTt+1−j i , and we know from Lemma 9 that (T ⊤ i )t+1−jTt+1−j i ≤ 3(t+ 1−j) 2 + 2 · |γi1|t−j < 3(t+ 1) 2 + 2 · |γi1|t−j < 3(t+ 1) 2 + 2 ·ρ t−j. Therefore, pt+1 < ρ t+1 + (t+ 1)· 3(t+ 1) 2 + ...

work page
[60]

In particular,f(x)< 3 2 · (1−β)2 (1+β)2

Over the set of allx≥1satisfying1− 1 x ≤f(x), the functionf(x)is maximized at x= 1. In particular,f(x)< 3 2 · (1−β)2 (1+β)2

work page
[61]

Over the set of allx≥1satisfying1− 1 x = (1−βx)2 (1+βx)2, we have (1−βx)2 (1+βx)2 ≤ (1−β)2 (1+β)2

work page
[62]

Over the set of allx≥1satisfying1− 1 x < (1−βx)2 (1+βx)2,inf x g(x)> 4 5 · (1−β)2 (1+β)2

work page
[63]

Over all the set of allx≥1satisfying1− 1 x < g(x),inf x (1−βx)2 (1+βx)2 > 1 2 · (1−β)2 (1+β)2. C Additional experiments In this section, we provide additional details about our experimental setup, and include numerical results on two datasets to complement the results in Section 5. Finally, we also explore the dependence of classical momentum on theωparam...

work page

[1] [1]

Randomized kaczmarz with geometrically smoothed momentum.SIAM Journal on Matrix Analysis and Ap- plications, 45(4):2287–2313, 2024

Seth J Alderman, Roan W Luikart, and Nicholas F Marshall. Randomized kaczmarz with geometrically smoothed momentum.SIAM Journal on Matrix Analysis and Ap- plications, 45(4):2287–2313, 2024. 15

work page 2024

[2] [2]

Katyusha: The first direct acceleration of stochastic gradient meth- ods.Journal of Machine Learning Research, 18(221):1–51, 2018

Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient meth- ods.Journal of Machine Learning Research, 18(221):1–51, 2018

work page 2018

[3] [3]

Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

work page arXiv 2025

[4] [4]

Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate.Advances in neural in- formation processing systems, 31, 2018

Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate.Advances in neural in- formation processing systems, 31, 2018

work page 2018

[5] [5]

On the fast convergence of mini- batch heavy ball momentum.IMA Journal of Numerical Analysis, 45(3):1397–1424, 2025

Raghu Bollapragada, Tyler Chen, and Rachel Ward. On the fast convergence of mini- batch heavy ball momentum.IMA Journal of Numerical Analysis, 45(3):1397–1424, 2025

work page 2025

[6] [6]

A geometric alternative to Nesterov's accelerated gradient descent

Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov’s accelerated gradient descent.arXiv preprint arXiv:1506.08187, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

Sample size selection in optimization methods for machine learning.Mathematical programming, 134(1):127– 155, 2012

Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning.Mathematical programming, 134(1):127– 155, 2012

work page 2012

[8] [8]

Accelerated linear convergence of stochastic momentum methods in wasserstein distances

Bugra Can, Mert Gurbuzbalaban, and Lingjiong Zhu. Accelerated linear convergence of stochastic momentum methods in wasserstein distances. InInternational Conference on Machine Learning, pages 891–901. PMLR, 2019

work page 2019

[9] [9]

Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size

Michał Dereziński and Xiaoyu Dong. Last-iterate convergence of randomized kaczmarz and sgd with greedy step size.arXiv preprint arXiv:2604.09909, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Fine- grained analysis and faster algorithms for iteratively solving linear systems.Journal of Machine Learning Research, 26(144):1–49, 2025

Michal Dereziński, Daniel LeJeune, Deanna Needell, and Elizaveta Rebrova. Fine- grained analysis and faster algorithms for iteratively solving linear systems.Journal of Machine Learning Research, 26(144):1–49, 2025

work page 2025

[11] [11]

Randomized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

Michał Dereziński, Deanna Needell, Elizaveta Rebrova, and Jiaming Yang. Randomized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

work page 2025

[12] [12]

Sharp analysis of sketch-and-project meth- ods via a connection to randomized singular value decomposition.SIAM Journal on Mathematics of Data Science, 6(1):127–153, 2024

Michał Dereziński and Elizaveta Rebrova. Sharp analysis of sketch-and-project meth- ods via a connection to randomized singular value decomposition.SIAM Journal on Mathematics of Data Science, 6(1):127–153, 2024

work page 2024

[13] [13]

Solving dense linear systems faster than via preconditioning

Michał Dereziński and Jiaming Yang. Solving dense linear systems faster than via preconditioning. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1118–1129, 2024. 16

work page 2024

[14] [14]

Bert: Pre- training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computa- tional linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019

[15] [15]

From averaging to acceleration, there is only a step-size

Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. InConference on learning theory, pages 658–695. PMLR, 2015

work page 2015

[16] [16]

When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000, 2023

Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, and Nanning Zheng. When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000, 2023

work page arXiv 2023

[17] [17]

Stochastic heavy ball

Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. 2018

work page 2018

[18] [18]

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W Mahoney, and Joseph Gonzalez. On the computational inefficiency of large batch sizes for stochastic gradient descent.arXiv preprint arXiv:1811.12941, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Accelerated stochastic matrix inversion: general theory and speeding up bfgs rules for faster second- order optimization.Advances in Neural Information Processing Systems, 31, 2018

Robert Gower, Filip Hanzely, Peter Richtárik, and Sebastian U Stich. Accelerated stochastic matrix inversion: general theory and speeding up bfgs rules for faster second- order optimization.Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[20] [20]

Randomized iterative methods for linear systems

Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015

work page 2015

[21] [21]

Nesterov acceleration despite very noisy gradients.Advances in Neural Information Processing Systems, 37:20694–20744, 2024

Kanan Gupta, Jonathan W Siegel, and Stephan Wojtowytsch. Nesterov acceleration despite very noisy gradients.Advances in Neural Information Processing Systems, 37:20694–20744, 2024

work page 2024

[22] [22]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–

work page

[23] [23]

On pseudoinverse-free randomized methods for linear sys- tems: Unified framework and acceleration.Optimization Methods and Software, pages 1–36, 2025

Deren Han and Jiaxin Xie. On pseudoinverse-free randomized methods for linear sys- tems: Unified framework and acceleration.Optimization Methods and Software, pages 1–36, 2025

work page 2025

[24] [24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[25] [25]

Matrix concentration for products.Foundations of Computational Mathematics, 22(6):1767–1799, 2022

De Huang, Jonathan Niles-Weed, Joel A Tropp, and Rachel Ward. Matrix concentration for products.Foundations of Computational Mathematics, 22(6):1767–1799, 2022. 17

work page 2022

[26] [26]

Accelerating Stochastic Gradient Descent For Least Squares Regression

PrateekJain, ShamMKakade, RahulKidambi, PraneethNetrapalli, andAaronSidford. Accelerating stochastic gradient descent.arXiv preprint arXiv:1704.08227, 3:8, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Accelerated gradient descent es- capes saddle points faster than gradient descent

Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent es- capes saddle points faster than gradient descent. InConference on learning theory, pages 1042–1085. PMLR, 2018

work page 2018

[28] [28]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

work page 2024

[29] [29]

On the insuffi- ciency of existing momentum schemes for stochastic optimization

Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insuffi- ciency of existing momentum schemes for stochastic optimization. In2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018

work page 2018

[30] [30]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[31] [31]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012

[32] [32]

Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems, 35:36944–36957, 2022

Kiwon Lee, Andrew Cheng, Elliot Paquette, and Courtney Paquette. Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems, 35:36944–36957, 2022

work page 2022

[33] [33]

Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

work page 2010

[34] [34]

Efficient mini-batch train- ing for stochastic optimization

Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch train- ing for stochastic optimization. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670, 2014

work page 2014

[35] [35]

An accelerated randomized kaczmarz algorithm.Mathe- matics of Computation, 85(297):153–178, 2016

Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm.Mathe- matics of Computation, 85(297):153–178, 2016

work page 2016

[36] [36]

Linearly convergent stochastic heavy ball method for minimizing generalization error

Nicolas Loizou and Peter Richtárik. Linearly convergent stochastic heavy ball method for minimizing generalization error.arXiv preprint arXiv:1710.10737, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Momentum and stochastic momentum for stochas- tic gradient, newton, proximal point and subspace descent methods.Computational Optimization and Applications, 77(3):653–710, 2020

Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochas- tic gradient, newton, proximal point and subspace descent methods.Computational Optimization and Applications, 77(3):653–710, 2020

work page 2020

[38] [38]

The power of interpolation: Understand- ing the effectiveness of sgd in modern over-parametrized learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understand- ing the effectiveness of sgd in modern over-parametrized learning. InInternational Conference on Machine Learning, pages 3325–3334. PMLR, 2018. 18

work page 2018

[39] [39]

Paved with good intentions: analysis of a randomized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

Deanna Needell and Joel A Tropp. Paved with good intentions: analysis of a randomized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

work page 2014

[40] [40]

Amethodforsolvingtheconvexprogrammingproblemwithconvergence rate o (1/k2)

YuriiNesterov. Amethodforsolvingtheconvexprogrammingproblemwithconvergence rate o (1/k2). InDokl akad nauk Sssr, volume 269, page 543, 1983

work page 1983

[41] [41]

Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567, 2023

Rui Pan, Yuxing Liu, Xiaoyu Wang, and Tong Zhang. Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567, 2023

work page arXiv 2023

[42] [42]

Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

work page 2011

[43] [43]

Matrix analysis (roger a

Robert J Plemmons. Matrix analysis (roger a. horn and charles r. johnson), 1988

work page 1988

[44] [44]

Some methods of speeding up the convergence of iteration methods

Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964

work page 1964

[45] [45]

Turbocharging gaussian process inference with approximate sketch-and-project.Advances in Neural Information Processing Systems, 38:141691– 141728, 2026

Pratik Rathore, Zachary Frangella, Sachin Garg, Shaghayegh Fazliani, Michal Derezin- ski, and Madeleine Udell. Turbocharging gaussian process inference with approximate sketch-and-project.Advances in Neural Information Processing Systems, 38:141691– 141728, 2026

work page 2026

[46] [46]

Have askotch: A neat solution for large-scale kernel ridge regression.arXiv preprint arXiv:2407.10070, 2024

Pratik Rathore, Zachary Frangella, Jiaming Yang, Michał Dereziński, and Madeleine Udell. Have askotch: A neat solution for large-scale kernel ridge regression.arXiv preprint arXiv:2407.10070, 2024

work page arXiv 2024

[47] [47]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

work page 1951

[48] [48]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018

work page 2018

[49] [49]

A randomized kaczmarz algorithm with ex- ponential convergence.Journal of Fourier Analysis and Applications, 15(2):262–278, 2009

Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with ex- ponential convergence.Journal of Fourier Analysis and Applications, 15(2):262–278, 2009

work page 2009

[50] [50]

On the importance ofinitializationandmomentumindeeplearning

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance ofinitializationandmomentumindeeplearning. InInternational conference on machine learning, pages 1139–1147. pmlr, 2013

work page 2013

[51] [51]

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665, 2023

Kejie Tang, Weidong Liu, Yichen Zhang, and Xi Chen. Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665, 2023. 19

work page arXiv 2023

[52] [52]

The fastest known globally convergent first-order method for minimizing strongly convex functions.IEEE Control Systems Letters, 2(1):49–54, 2017

Bryan Van Scoy, Randy A Freeman, and Kevin M Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions.IEEE Control Systems Letters, 2(1):49–54, 2017

work page 2017

[53] [53]

Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

JoaquinVanschoren, JanNVanRijn, BerndBischl, andLuisTorgo. Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

work page 2014

[54] [54]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[55] [55]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfon- brener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196, 2023

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, and Zhiyuan Li. The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196, 2023

work page arXiv 2023

[57] [57]

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, and Yi Yang. A unified analysis of stochastic momentum methods for deep learning.arXiv preprint arXiv:1808.10396, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[58] [58]

On adaptive stochastic heavy ball momentum for solving linear systems.SIAM Journal on Matrix Analysis and Applications, 45(3):1259–1286, 2024

Yun Zeng, Deren Han, Yansheng Su, and Jiaxin Xie. On adaptive stochastic heavy ball momentum for solving linear systems.SIAM Journal on Matrix Analysis and Applications, 45(3):1259–1286, 2024. A Convergence analysis for expected iterates We start by recalling the matrix transition rule (11), V⊤∆t+1 V⊤∆t = (1 +β)I−(1 +βω)V ⊤ Πt V−β·(I−ωV ⊤ Πt−1 V) I 0 | {z...

work page 2024

[59] [59]

We had, pt+1 ≤f t+1 < ρ t+1 + tX j=0 ρj ·max i αi(t+ 1−j) Due to Lemma 13 we know, ρ <1− ϕλr 4

Recall the bound onpt+1 in (26). We had, pt+1 ≤f t+1 < ρ t+1 + tX j=0 ρj ·max i αi(t+ 1−j) Due to Lemma 13 we know, ρ <1− ϕλr 4 . Furthermore,α i(t+ 1−j) = (T⊤ i )t+1−jTt+1−j i , and we know from Lemma 9 that (T ⊤ i )t+1−jTt+1−j i ≤ 3(t+ 1−j) 2 + 2 · |γi1|t−j < 3(t+ 1) 2 + 2 · |γi1|t−j < 3(t+ 1) 2 + 2 ·ρ t−j. Therefore, pt+1 < ρ t+1 + (t+ 1)· 3(t+ 1) 2 + ...

work page

[60] [60]

In particular,f(x)< 3 2 · (1−β)2 (1+β)2

Over the set of allx≥1satisfying1− 1 x ≤f(x), the functionf(x)is maximized at x= 1. In particular,f(x)< 3 2 · (1−β)2 (1+β)2

work page

[61] [61]

Over the set of allx≥1satisfying1− 1 x = (1−βx)2 (1+βx)2, we have (1−βx)2 (1+βx)2 ≤ (1−β)2 (1+β)2

work page

[62] [62]

Over the set of allx≥1satisfying1− 1 x < (1−βx)2 (1+βx)2,inf x g(x)> 4 5 · (1−β)2 (1+β)2

work page

[63] [63]

Over all the set of allx≥1satisfying1− 1 x < g(x),inf x (1−βx)2 (1+βx)2 > 1 2 · (1−β)2 (1+β)2. C Additional experiments In this section, we provide additional details about our experimental setup, and include numerical results on two datasets to complement the results in Section 5. Finally, we also explore the dependence of classical momentum on theωparam...

work page