pith. sign in

arxiv: 2605.18609 · v1 · pith:3EHDOES6new · submitted 2026-05-18 · 💻 cs.LG

Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

Pith reviewed 2026-05-20 12:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords mini-batch SGDclassical momentumaccelerationparallelizationquadratic optimizationinterpolation regime
0
0 comments X

The pith

Classical momentum acceleration in mini-batch SGD scales proportionally with batch size

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when using classical momentum with mini-batch stochastic gradient descent on quadratic functions in the interpolation regime, the acceleration benefit is directly proportional to the mini-batch size up to a saturation point. This is established under minimal assumptions on the stochastic noise and for both heavy ball and Nesterov momentum. A reader would care because it theoretically justifies the practical success of large-batch training with momentum and indicates how to achieve perfect parallelization of computations. The work also provides a simple effective choice for the momentum parameter.

Core claim

In the interpolation regime for quadratic optimization, the acceleration from classical momentum is directly proportional to the gradient mini-batch size, up to a natural saturation point. This enables perfect parallelization of mini-batch computations. The theory makes minimal assumptions on stochastic noise and applies to arbitrary mini-batch sizes, including special cases like randomized Kaczmarz and coordinate descent.

What carries the argument

The general theory of stochastic momentum acceleration that demonstrates the direct proportionality of acceleration to mini-batch size.

If this is right

  • The acceleration scales linearly with mini-batch size until saturation.
  • This scaling enables perfect parallelization of mini-batch computations.
  • The results hold for arbitrary mini-batch sizes with minimal noise assumptions.
  • A simple momentum parameter choice is effective empirically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the proportionality holds approximately in non-quadratic settings, it could inform batch size selection in deep learning.
  • The saturation point might be characterized further to optimize resource use in parallel training.
  • Connections to other stochastic methods could be explored for similar acceleration properties.

Load-bearing premise

The analysis is limited to quadratic functions in the interpolation regime.

What would settle it

Observing that the effective acceleration does not increase proportionally as the mini-batch size grows in quadratic interpolation problems would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.18609 by Micha{\l} Derezi\'nski, Sachin Garg.

Figure 1
Figure 1. Figure 1: Convergence of block coordinate descent with NAG momentum and baselines. Top [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Convergence behavior for CDM with different [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Convergence of block coordinate descent with NAG momentum and baselines. Top [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗
read the original abstract

Accelerating stochastic gradient methods with classical momentum schemes, such as Polyak's heavy ball, has proven highly successful in training large-scale machine learning models, particularly when combined with the hardware acceleration of large mini-batch computations. Yet, the effect of classical momentum on stochastic mini-batch optimization has been poorly understood theoretically, with prior works requiring strong noise assumptions and extremely large mini-batches. In this work, we develop a general theory of stochastic momentum acceleration for optimizing over quadratics in the interpolation regime, a popular abstraction for studying deep learning dynamics which also includes classical methods such as randomized Kaczmarz and coordinate descent. Our framework encompasses both heavy ball and Nesterov-style momentum, allows for arbitrary mini-batch sizes, and makes minimal assumptions on the stochastic noise. In particular, we show that acceleration from classical momentum is directly proportional to the gradient mini-batch size (up to a natural saturation point), thereby enabling perfect parallelization of mini-batch computations. Our theory also provides a simple choice for the momentum parameter, which is shown to be effective empirically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a general theory of stochastic momentum acceleration (heavy ball and Nesterov-style) for mini-batch SGD on quadratic objectives in the interpolation regime. Under minimal assumptions on stochastic noise, it establishes that the acceleration factor from classical momentum is directly proportional to the mini-batch size (up to a natural saturation point), which enables perfect parallelization of mini-batch gradient computations. The framework also supplies a simple, effective choice for the momentum parameter, supported by empirical validation.

Significance. If the central proportionality result holds, the work offers a clean theoretical explanation for the empirical success of combining classical momentum with large mini-batches in deep learning training. By restricting to the interpolation regime (a standard abstraction that includes randomized Kaczmarz and coordinate descent), the analysis achieves minimal noise assumptions while still recovering the practical benefit of batch-size-proportional acceleration. This strengthens the case for momentum as a tool for hardware-efficient parallelization rather than merely a heuristic.

major comments (2)
  1. [Main theorem on acceleration proportionality] The proportionality result (central claim) is derived under the interpolation regime where stochastic noise vanishes at the minimizer. While the paper correctly restricts its scope to quadratics, the derivation relies on the specific noise covariance structure induced by this regime (proportional to the Hessian residual). A brief remark on whether the same linear scaling survives when the loss does not reach zero at the optimum would clarify the boundary of applicability.
  2. [Discussion of saturation and parallelization] The saturation point beyond which additional batch size yields no further acceleration is described as 'natural,' yet the precise condition (in terms of problem parameters or noise level) is not stated explicitly. This quantity is load-bearing for the 'perfect parallelization' interpretation, as practitioners need to know when the proportionality ceases to hold.
minor comments (2)
  1. [Abstract and §1] The abstract and introduction would benefit from a one-sentence statement of the precise momentum parameter choice derived in the theory, rather than deferring it entirely to the empirical section.
  2. [Preliminaries] Notation for the mini-batch gradient estimator and the noise term should be introduced once and used consistently; occasional redefinition of symbols across sections slightly reduces readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and constructive comments on our manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Main theorem on acceleration proportionality] The proportionality result (central claim) is derived under the interpolation regime where stochastic noise vanishes at the minimizer. While the paper correctly restricts its scope to quadratics, the derivation relies on the specific noise covariance structure induced by this regime (proportional to the Hessian residual). A brief remark on whether the same linear scaling survives when the loss does not reach zero at the optimum would clarify the boundary of applicability.

    Authors: We appreciate the referee's suggestion to clarify the boundary of our results. Our analysis is intentionally scoped to the interpolation regime, which permits minimal noise assumptions and encompasses important special cases such as randomized Kaczmarz and coordinate descent. The noise covariance structure (proportional to the Hessian residual) is essential for obtaining the clean batch-size proportionality. In non-interpolation regimes, where stochastic noise does not vanish at the optimum, the same linear scaling would generally require stronger noise assumptions that we deliberately avoided. We will add a concise remark in the introduction and discussion sections noting this scope limitation and the conditions under which the result may fail to extend. revision: partial

  2. Referee: [Discussion of saturation and parallelization] The saturation point beyond which additional batch size yields no further acceleration is described as 'natural,' yet the precise condition (in terms of problem parameters or noise level) is not stated explicitly. This quantity is load-bearing for the 'perfect parallelization' interpretation, as practitioners need to know when the proportionality ceases to hold.

    Authors: We agree that an explicit characterization of the saturation point would improve the practical utility of the 'perfect parallelization' claim. In our derivations, saturation occurs when the mini-batch size exceeds a threshold determined by the momentum parameter, the condition number of the Hessian, and the noise variance; beyond this point the effective acceleration factor no longer increases. We will revise the main theorem statement and add a short paragraph in the discussion to provide the precise condition in terms of these problem parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under quadratic interpolation assumptions

full rationale

The paper develops a theoretical framework for momentum acceleration specifically restricted to quadratics in the interpolation regime with minimal noise assumptions. The claimed direct proportionality of acceleration to mini-batch size follows from analysis of the stochastic dynamics in this model, without any quoted reduction of the central result to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. No equations or steps in the provided abstract or description exhibit the result being equivalent to its inputs by construction. The analysis is presented as first-principles within the stated setting and is externally falsifiable outside the fitted values.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the quadratic interpolation regime and minimal stochastic noise assumptions; no free parameters beyond the simple momentum choice are mentioned, and no new entities are introduced.

free parameters (1)
  • momentum parameter
    Simple choice provided by the theory and shown effective empirically; treated as a tunable but not data-fitted constant in the abstract.
axioms (2)
  • domain assumption Optimization occurs over quadratics in the interpolation regime
    Stated as the setting that includes randomized Kaczmarz and coordinate descent; central to the proportionality result.
  • domain assumption Minimal assumptions on the stochastic noise
    Explicitly contrasted with prior works requiring strong noise assumptions.

pith-pipeline@v0.9.0 · 5713 in / 1242 out tokens · 28518 ms · 2026-05-20T12:21:19.658366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 8 internal anchors

  1. [1]

    Randomized kaczmarz with geometrically smoothed momentum.SIAM Journal on Matrix Analysis and Ap- plications, 45(4):2287–2313, 2024

    Seth J Alderman, Roan W Luikart, and Nicholas F Marshall. Randomized kaczmarz with geometrically smoothed momentum.SIAM Journal on Matrix Analysis and Ap- plications, 45(4):2287–2313, 2024. 15

  2. [2]

    Katyusha: The first direct acceleration of stochastic gradient meth- ods.Journal of Machine Learning Research, 18(221):1–51, 2018

    Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient meth- ods.Journal of Machine Learning Research, 18(221):1–51, 2018

  3. [3]

    Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

    Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025

  4. [4]

    Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate.Advances in neural in- formation processing systems, 31, 2018

    Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate.Advances in neural in- formation processing systems, 31, 2018

  5. [5]

    On the fast convergence of mini- batch heavy ball momentum.IMA Journal of Numerical Analysis, 45(3):1397–1424, 2025

    Raghu Bollapragada, Tyler Chen, and Rachel Ward. On the fast convergence of mini- batch heavy ball momentum.IMA Journal of Numerical Analysis, 45(3):1397–1424, 2025

  6. [6]

    A geometric alternative to Nesterov's accelerated gradient descent

    Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov’s accelerated gradient descent.arXiv preprint arXiv:1506.08187, 2015

  7. [7]

    Sample size selection in optimization methods for machine learning.Mathematical programming, 134(1):127– 155, 2012

    Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning.Mathematical programming, 134(1):127– 155, 2012

  8. [8]

    Accelerated linear convergence of stochastic momentum methods in wasserstein distances

    Bugra Can, Mert Gurbuzbalaban, and Lingjiong Zhu. Accelerated linear convergence of stochastic momentum methods in wasserstein distances. InInternational Conference on Machine Learning, pages 891–901. PMLR, 2019

  9. [9]

    Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size

    Michał Dereziński and Xiaoyu Dong. Last-iterate convergence of randomized kaczmarz and sgd with greedy step size.arXiv preprint arXiv:2604.09909, 2026

  10. [10]

    Fine- grained analysis and faster algorithms for iteratively solving linear systems.Journal of Machine Learning Research, 26(144):1–49, 2025

    Michal Dereziński, Daniel LeJeune, Deanna Needell, and Elizaveta Rebrova. Fine- grained analysis and faster algorithms for iteratively solving linear systems.Journal of Machine Learning Research, 26(144):1–49, 2025

  11. [11]

    Randomized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

    Michał Dereziński, Deanna Needell, Elizaveta Rebrova, and Jiaming Yang. Randomized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

  12. [12]

    Sharp analysis of sketch-and-project meth- ods via a connection to randomized singular value decomposition.SIAM Journal on Mathematics of Data Science, 6(1):127–153, 2024

    Michał Dereziński and Elizaveta Rebrova. Sharp analysis of sketch-and-project meth- ods via a connection to randomized singular value decomposition.SIAM Journal on Mathematics of Data Science, 6(1):127–153, 2024

  13. [13]

    Solving dense linear systems faster than via preconditioning

    Michał Dereziński and Jiaming Yang. Solving dense linear systems faster than via preconditioning. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1118–1129, 2024. 16

  14. [14]

    Bert: Pre- training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computa- tional linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  15. [15]

    From averaging to acceleration, there is only a step-size

    Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. InConference on learning theory, pages 658–695. PMLR, 2015

  16. [16]

    When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000, 2023

    Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, and Nanning Zheng. When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000, 2023

  17. [17]

    Stochastic heavy ball

    Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. 2018

  18. [18]

    On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

    Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W Mahoney, and Joseph Gonzalez. On the computational inefficiency of large batch sizes for stochastic gradient descent.arXiv preprint arXiv:1811.12941, 2018

  19. [19]

    Accelerated stochastic matrix inversion: general theory and speeding up bfgs rules for faster second- order optimization.Advances in Neural Information Processing Systems, 31, 2018

    Robert Gower, Filip Hanzely, Peter Richtárik, and Sebastian U Stich. Accelerated stochastic matrix inversion: general theory and speeding up bfgs rules for faster second- order optimization.Advances in Neural Information Processing Systems, 31, 2018

  20. [20]

    Randomized iterative methods for linear systems

    Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015

  21. [21]

    Nesterov acceleration despite very noisy gradients.Advances in Neural Information Processing Systems, 37:20694–20744, 2024

    Kanan Gupta, Jonathan W Siegel, and Stephan Wojtowytsch. Nesterov acceleration despite very noisy gradients.Advances in Neural Information Processing Systems, 37:20694–20744, 2024

  22. [22]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–

  23. [23]

    On pseudoinverse-free randomized methods for linear sys- tems: Unified framework and acceleration.Optimization Methods and Software, pages 1–36, 2025

    Deren Han and Jiaxin Xie. On pseudoinverse-free randomized methods for linear sys- tems: Unified framework and acceleration.Optimization Methods and Software, pages 1–36, 2025

  24. [24]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  25. [25]

    Matrix concentration for products.Foundations of Computational Mathematics, 22(6):1767–1799, 2022

    De Huang, Jonathan Niles-Weed, Joel A Tropp, and Rachel Ward. Matrix concentration for products.Foundations of Computational Mathematics, 22(6):1767–1799, 2022. 17

  26. [26]

    Accelerating Stochastic Gradient Descent For Least Squares Regression

    PrateekJain, ShamMKakade, RahulKidambi, PraneethNetrapalli, andAaronSidford. Accelerating stochastic gradient descent.arXiv preprint arXiv:1704.08227, 3:8, 2017

  27. [27]

    Accelerated gradient descent es- capes saddle points faster than gradient descent

    Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent es- capes saddle points faster than gradient descent. InConference on learning theory, pages 1042–1085. PMLR, 2018

  28. [28]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

  29. [29]

    On the insuffi- ciency of existing momentum schemes for stochastic optimization

    Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insuffi- ciency of existing momentum schemes for stochastic optimization. In2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018

  30. [30]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  31. [31]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  32. [32]

    Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems, 35:36944–36957, 2022

    Kiwon Lee, Andrew Cheng, Elliot Paquette, and Courtney Paquette. Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems, 35:36944–36957, 2022

  33. [33]

    Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

    Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

  34. [34]

    Efficient mini-batch train- ing for stochastic optimization

    Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch train- ing for stochastic optimization. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670, 2014

  35. [35]

    An accelerated randomized kaczmarz algorithm.Mathe- matics of Computation, 85(297):153–178, 2016

    Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm.Mathe- matics of Computation, 85(297):153–178, 2016

  36. [36]

    Linearly convergent stochastic heavy ball method for minimizing generalization error

    Nicolas Loizou and Peter Richtárik. Linearly convergent stochastic heavy ball method for minimizing generalization error.arXiv preprint arXiv:1710.10737, 2017

  37. [37]

    Momentum and stochastic momentum for stochas- tic gradient, newton, proximal point and subspace descent methods.Computational Optimization and Applications, 77(3):653–710, 2020

    Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochas- tic gradient, newton, proximal point and subspace descent methods.Computational Optimization and Applications, 77(3):653–710, 2020

  38. [38]

    The power of interpolation: Understand- ing the effectiveness of sgd in modern over-parametrized learning

    Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understand- ing the effectiveness of sgd in modern over-parametrized learning. InInternational Conference on Machine Learning, pages 3325–3334. PMLR, 2018. 18

  39. [39]

    Paved with good intentions: analysis of a randomized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

    Deanna Needell and Joel A Tropp. Paved with good intentions: analysis of a randomized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

  40. [40]

    Amethodforsolvingtheconvexprogrammingproblemwithconvergence rate o (1/k2)

    YuriiNesterov. Amethodforsolvingtheconvexprogrammingproblemwithconvergence rate o (1/k2). InDokl akad nauk Sssr, volume 269, page 543, 1983

  41. [41]

    Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567, 2023

    Rui Pan, Yuxing Liu, Xiaoyu Wang, and Tong Zhang. Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567, 2023

  42. [42]

    Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

  43. [43]

    Matrix analysis (roger a

    Robert J Plemmons. Matrix analysis (roger a. horn and charles r. johnson), 1988

  44. [44]

    Some methods of speeding up the convergence of iteration methods

    Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964

  45. [45]

    Turbocharging gaussian process inference with approximate sketch-and-project.Advances in Neural Information Processing Systems, 38:141691– 141728, 2026

    Pratik Rathore, Zachary Frangella, Sachin Garg, Shaghayegh Fazliani, Michal Derezin- ski, and Madeleine Udell. Turbocharging gaussian process inference with approximate sketch-and-project.Advances in Neural Information Processing Systems, 38:141691– 141728, 2026

  46. [46]

    Have askotch: A neat solution for large-scale kernel ridge regression.arXiv preprint arXiv:2407.10070, 2024

    Pratik Rathore, Zachary Frangella, Jiaming Yang, Michał Dereziński, and Madeleine Udell. Have askotch: A neat solution for large-scale kernel ridge regression.arXiv preprint arXiv:2407.10070, 2024

  47. [47]

    A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

  48. [48]

    Mobilenetv2: Inverted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018

  49. [49]

    A randomized kaczmarz algorithm with ex- ponential convergence.Journal of Fourier Analysis and Applications, 15(2):262–278, 2009

    Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with ex- ponential convergence.Journal of Fourier Analysis and Applications, 15(2):262–278, 2009

  50. [50]

    On the importance ofinitializationandmomentumindeeplearning

    Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance ofinitializationandmomentumindeeplearning. InInternational conference on machine learning, pages 1139–1147. pmlr, 2013

  51. [51]

    Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665, 2023

    Kejie Tang, Weidong Liu, Yichen Zhang, and Xi Chen. Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665, 2023. 19

  52. [52]

    The fastest known globally convergent first-order method for minimizing strongly convex functions.IEEE Control Systems Letters, 2(1):49–54, 2017

    Bryan Van Scoy, Randy A Freeman, and Kevin M Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions.IEEE Control Systems Letters, 2(1):49–54, 2017

  53. [53]

    Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

    JoaquinVanschoren, JanNVanRijn, BerndBischl, andLuisTorgo. Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

  54. [54]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  55. [55]

    SOAP: Improving and Stabilizing Shampoo using Adam

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfon- brener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024

  56. [56]

    The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196, 2023

    Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, and Zhiyuan Li. The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196, 2023

  57. [57]

    A Unified Analysis of Stochastic Momentum Methods for Deep Learning

    Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, and Yi Yang. A unified analysis of stochastic momentum methods for deep learning.arXiv preprint arXiv:1808.10396, 2018

  58. [58]

    On adaptive stochastic heavy ball momentum for solving linear systems.SIAM Journal on Matrix Analysis and Applications, 45(3):1259–1286, 2024

    Yun Zeng, Deren Han, Yansheng Su, and Jiaxin Xie. On adaptive stochastic heavy ball momentum for solving linear systems.SIAM Journal on Matrix Analysis and Applications, 45(3):1259–1286, 2024. A Convergence analysis for expected iterates We start by recalling the matrix transition rule (11), V⊤∆t+1 V⊤∆t = (1 +β)I−(1 +βω)V ⊤ Πt V−β·(I−ωV ⊤ Πt−1 V) I 0 | {z...

  59. [59]

    We had, pt+1 ≤f t+1 < ρ t+1 + tX j=0 ρj ·max i αi(t+ 1−j) Due to Lemma 13 we know, ρ <1− ϕλr 4

    Recall the bound onpt+1 in (26). We had, pt+1 ≤f t+1 < ρ t+1 + tX j=0 ρj ·max i αi(t+ 1−j) Due to Lemma 13 we know, ρ <1− ϕλr 4 . Furthermore,α i(t+ 1−j) = (T⊤ i )t+1−jTt+1−j i , and we know from Lemma 9 that (T ⊤ i )t+1−jTt+1−j i ≤ 3(t+ 1−j) 2 + 2 · |γi1|t−j < 3(t+ 1) 2 + 2 · |γi1|t−j < 3(t+ 1) 2 + 2 ·ρ t−j. Therefore, pt+1 < ρ t+1 + (t+ 1)· 3(t+ 1) 2 + ...

  60. [60]

    In particular,f(x)< 3 2 · (1−β)2 (1+β)2

    Over the set of allx≥1satisfying1− 1 x ≤f(x), the functionf(x)is maximized at x= 1. In particular,f(x)< 3 2 · (1−β)2 (1+β)2

  61. [61]

    Over the set of allx≥1satisfying1− 1 x = (1−βx)2 (1+βx)2, we have (1−βx)2 (1+βx)2 ≤ (1−β)2 (1+β)2

  62. [62]

    Over the set of allx≥1satisfying1− 1 x < (1−βx)2 (1+βx)2,inf x g(x)> 4 5 · (1−β)2 (1+β)2

  63. [63]

    Over all the set of allx≥1satisfying1− 1 x < g(x),inf x (1−βx)2 (1+βx)2 > 1 2 · (1−β)2 (1+β)2. C Additional experiments In this section, we provide additional details about our experimental setup, and include numerical results on two datasets to complement the results in Section 5. Finally, we also explore the dependence of classical momentum on theωparam...