Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration
Pith reviewed 2026-05-20 12:21 UTC · model grok-4.3
The pith
Classical momentum acceleration in mini-batch SGD scales proportionally with batch size
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the interpolation regime for quadratic optimization, the acceleration from classical momentum is directly proportional to the gradient mini-batch size, up to a natural saturation point. This enables perfect parallelization of mini-batch computations. The theory makes minimal assumptions on stochastic noise and applies to arbitrary mini-batch sizes, including special cases like randomized Kaczmarz and coordinate descent.
What carries the argument
The general theory of stochastic momentum acceleration that demonstrates the direct proportionality of acceleration to mini-batch size.
If this is right
- The acceleration scales linearly with mini-batch size until saturation.
- This scaling enables perfect parallelization of mini-batch computations.
- The results hold for arbitrary mini-batch sizes with minimal noise assumptions.
- A simple momentum parameter choice is effective empirically.
Where Pith is reading between the lines
- If the proportionality holds approximately in non-quadratic settings, it could inform batch size selection in deep learning.
- The saturation point might be characterized further to optimize resource use in parallel training.
- Connections to other stochastic methods could be explored for similar acceleration properties.
Load-bearing premise
The analysis is limited to quadratic functions in the interpolation regime.
What would settle it
Observing that the effective acceleration does not increase proportionally as the mini-batch size grows in quadratic interpolation problems would falsify the claim.
Figures
read the original abstract
Accelerating stochastic gradient methods with classical momentum schemes, such as Polyak's heavy ball, has proven highly successful in training large-scale machine learning models, particularly when combined with the hardware acceleration of large mini-batch computations. Yet, the effect of classical momentum on stochastic mini-batch optimization has been poorly understood theoretically, with prior works requiring strong noise assumptions and extremely large mini-batches. In this work, we develop a general theory of stochastic momentum acceleration for optimizing over quadratics in the interpolation regime, a popular abstraction for studying deep learning dynamics which also includes classical methods such as randomized Kaczmarz and coordinate descent. Our framework encompasses both heavy ball and Nesterov-style momentum, allows for arbitrary mini-batch sizes, and makes minimal assumptions on the stochastic noise. In particular, we show that acceleration from classical momentum is directly proportional to the gradient mini-batch size (up to a natural saturation point), thereby enabling perfect parallelization of mini-batch computations. Our theory also provides a simple choice for the momentum parameter, which is shown to be effective empirically.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a general theory of stochastic momentum acceleration (heavy ball and Nesterov-style) for mini-batch SGD on quadratic objectives in the interpolation regime. Under minimal assumptions on stochastic noise, it establishes that the acceleration factor from classical momentum is directly proportional to the mini-batch size (up to a natural saturation point), which enables perfect parallelization of mini-batch gradient computations. The framework also supplies a simple, effective choice for the momentum parameter, supported by empirical validation.
Significance. If the central proportionality result holds, the work offers a clean theoretical explanation for the empirical success of combining classical momentum with large mini-batches in deep learning training. By restricting to the interpolation regime (a standard abstraction that includes randomized Kaczmarz and coordinate descent), the analysis achieves minimal noise assumptions while still recovering the practical benefit of batch-size-proportional acceleration. This strengthens the case for momentum as a tool for hardware-efficient parallelization rather than merely a heuristic.
major comments (2)
- [Main theorem on acceleration proportionality] The proportionality result (central claim) is derived under the interpolation regime where stochastic noise vanishes at the minimizer. While the paper correctly restricts its scope to quadratics, the derivation relies on the specific noise covariance structure induced by this regime (proportional to the Hessian residual). A brief remark on whether the same linear scaling survives when the loss does not reach zero at the optimum would clarify the boundary of applicability.
- [Discussion of saturation and parallelization] The saturation point beyond which additional batch size yields no further acceleration is described as 'natural,' yet the precise condition (in terms of problem parameters or noise level) is not stated explicitly. This quantity is load-bearing for the 'perfect parallelization' interpretation, as practitioners need to know when the proportionality ceases to hold.
minor comments (2)
- [Abstract and §1] The abstract and introduction would benefit from a one-sentence statement of the precise momentum parameter choice derived in the theory, rather than deferring it entirely to the empirical section.
- [Preliminaries] Notation for the mini-batch gradient estimator and the noise term should be introduced once and used consistently; occasional redefinition of symbols across sections slightly reduces readability.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and constructive comments on our manuscript. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Main theorem on acceleration proportionality] The proportionality result (central claim) is derived under the interpolation regime where stochastic noise vanishes at the minimizer. While the paper correctly restricts its scope to quadratics, the derivation relies on the specific noise covariance structure induced by this regime (proportional to the Hessian residual). A brief remark on whether the same linear scaling survives when the loss does not reach zero at the optimum would clarify the boundary of applicability.
Authors: We appreciate the referee's suggestion to clarify the boundary of our results. Our analysis is intentionally scoped to the interpolation regime, which permits minimal noise assumptions and encompasses important special cases such as randomized Kaczmarz and coordinate descent. The noise covariance structure (proportional to the Hessian residual) is essential for obtaining the clean batch-size proportionality. In non-interpolation regimes, where stochastic noise does not vanish at the optimum, the same linear scaling would generally require stronger noise assumptions that we deliberately avoided. We will add a concise remark in the introduction and discussion sections noting this scope limitation and the conditions under which the result may fail to extend. revision: partial
-
Referee: [Discussion of saturation and parallelization] The saturation point beyond which additional batch size yields no further acceleration is described as 'natural,' yet the precise condition (in terms of problem parameters or noise level) is not stated explicitly. This quantity is load-bearing for the 'perfect parallelization' interpretation, as practitioners need to know when the proportionality ceases to hold.
Authors: We agree that an explicit characterization of the saturation point would improve the practical utility of the 'perfect parallelization' claim. In our derivations, saturation occurs when the mini-batch size exceeds a threshold determined by the momentum parameter, the condition number of the Hessian, and the noise variance; beyond this point the effective acceleration factor no longer increases. We will revise the main theorem statement and add a short paragraph in the discussion to provide the precise condition in terms of these problem parameters. revision: yes
Circularity Check
No significant circularity; derivation self-contained under quadratic interpolation assumptions
full rationale
The paper develops a theoretical framework for momentum acceleration specifically restricted to quadratics in the interpolation regime with minimal noise assumptions. The claimed direct proportionality of acceleration to mini-batch size follows from analysis of the stochastic dynamics in this model, without any quoted reduction of the central result to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. No equations or steps in the provided abstract or description exhibit the result being equivalent to its inputs by construction. The analysis is presented as first-principles within the stated setting and is externally falsifiable outside the fitted values.
Axiom & Free-Parameter Ledger
free parameters (1)
- momentum parameter
axioms (2)
- domain assumption Optimization occurs over quadratics in the interpolation regime
- domain assumption Minimal assumptions on the stochastic noise
Reference graph
Works this paper leans on
-
[1]
Seth J Alderman, Roan W Luikart, and Nicholas F Marshall. Randomized kaczmarz with geometrically smoothed momentum.SIAM Journal on Matrix Analysis and Ap- plications, 45(4):2287–2313, 2024. 15
work page 2024
-
[2]
Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient meth- ods.Journal of Machine Learning Research, 18(221):1–51, 2018
work page 2018
-
[3]
Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergenceofsgdinthesmoothinterpolationregime.arXiv preprint arXiv:2507.11274, 2025
-
[4]
Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate.Advances in neural in- formation processing systems, 31, 2018
work page 2018
-
[5]
Raghu Bollapragada, Tyler Chen, and Rachel Ward. On the fast convergence of mini- batch heavy ball momentum.IMA Journal of Numerical Analysis, 45(3):1397–1424, 2025
work page 2025
-
[6]
A geometric alternative to Nesterov's accelerated gradient descent
Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov’s accelerated gradient descent.arXiv preprint arXiv:1506.08187, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[7]
Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning.Mathematical programming, 134(1):127– 155, 2012
work page 2012
-
[8]
Accelerated linear convergence of stochastic momentum methods in wasserstein distances
Bugra Can, Mert Gurbuzbalaban, and Lingjiong Zhu. Accelerated linear convergence of stochastic momentum methods in wasserstein distances. InInternational Conference on Machine Learning, pages 891–901. PMLR, 2019
work page 2019
-
[9]
Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size
Michał Dereziński and Xiaoyu Dong. Last-iterate convergence of randomized kaczmarz and sgd with greedy step size.arXiv preprint arXiv:2604.09909, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Michal Dereziński, Daniel LeJeune, Deanna Needell, and Elizaveta Rebrova. Fine- grained analysis and faster algorithms for iteratively solving linear systems.Journal of Machine Learning Research, 26(144):1–49, 2025
work page 2025
-
[11]
Michał Dereziński, Deanna Needell, Elizaveta Rebrova, and Jiaming Yang. Randomized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025
work page 2025
-
[12]
Michał Dereziński and Elizaveta Rebrova. Sharp analysis of sketch-and-project meth- ods via a connection to randomized singular value decomposition.SIAM Journal on Mathematics of Data Science, 6(1):127–153, 2024
work page 2024
-
[13]
Solving dense linear systems faster than via preconditioning
Michał Dereziński and Jiaming Yang. Solving dense linear systems faster than via preconditioning. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1118–1129, 2024. 16
work page 2024
-
[14]
Bert: Pre- training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computa- tional linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
work page 2019
-
[15]
From averaging to acceleration, there is only a step-size
Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. InConference on learning theory, pages 658–695. PMLR, 2015
work page 2015
-
[16]
When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000, 2023
Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, and Nanning Zheng. When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000, 2023
-
[17]
Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. 2018
work page 2018
-
[18]
On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent
Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W Mahoney, and Joseph Gonzalez. On the computational inefficiency of large batch sizes for stochastic gradient descent.arXiv preprint arXiv:1811.12941, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Robert Gower, Filip Hanzely, Peter Richtárik, and Sebastian U Stich. Accelerated stochastic matrix inversion: general theory and speeding up bfgs rules for faster second- order optimization.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[20]
Randomized iterative methods for linear systems
Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015
work page 2015
-
[21]
Kanan Gupta, Jonathan W Siegel, and Stephan Wojtowytsch. Nesterov acceleration despite very noisy gradients.Advances in Neural Information Processing Systems, 37:20694–20744, 2024
work page 2024
-
[22]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–
-
[23]
Deren Han and Jiaxin Xie. On pseudoinverse-free randomized methods for linear sys- tems: Unified framework and acceleration.Optimization Methods and Software, pages 1–36, 2025
work page 2025
-
[24]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[25]
Matrix concentration for products.Foundations of Computational Mathematics, 22(6):1767–1799, 2022
De Huang, Jonathan Niles-Weed, Joel A Tropp, and Rachel Ward. Matrix concentration for products.Foundations of Computational Mathematics, 22(6):1767–1799, 2022. 17
work page 2022
-
[26]
Accelerating Stochastic Gradient Descent For Least Squares Regression
PrateekJain, ShamMKakade, RahulKidambi, PraneethNetrapalli, andAaronSidford. Accelerating stochastic gradient descent.arXiv preprint arXiv:1704.08227, 3:8, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Accelerated gradient descent es- capes saddle points faster than gradient descent
Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent es- capes saddle points faster than gradient descent. InConference on learning theory, pages 1042–1085. PMLR, 2018
work page 2018
-
[28]
Muon: An optimizer for hidden layers in neural networks, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024
work page 2024
-
[29]
On the insuffi- ciency of existing momentum schemes for stochastic optimization
Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insuffi- ciency of existing momentum schemes for stochastic optimization. In2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018
work page 2018
-
[30]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[31]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012
work page 2012
-
[32]
Kiwon Lee, Andrew Cheng, Elliot Paquette, and Courtney Paquette. Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems, 35:36944–36957, 2022
work page 2022
-
[33]
Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010
work page 2010
-
[34]
Efficient mini-batch train- ing for stochastic optimization
Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch train- ing for stochastic optimization. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670, 2014
work page 2014
-
[35]
An accelerated randomized kaczmarz algorithm.Mathe- matics of Computation, 85(297):153–178, 2016
Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm.Mathe- matics of Computation, 85(297):153–178, 2016
work page 2016
-
[36]
Linearly convergent stochastic heavy ball method for minimizing generalization error
Nicolas Loizou and Peter Richtárik. Linearly convergent stochastic heavy ball method for minimizing generalization error.arXiv preprint arXiv:1710.10737, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochas- tic gradient, newton, proximal point and subspace descent methods.Computational Optimization and Applications, 77(3):653–710, 2020
work page 2020
-
[38]
Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understand- ing the effectiveness of sgd in modern over-parametrized learning. InInternational Conference on Machine Learning, pages 3325–3334. PMLR, 2018. 18
work page 2018
-
[39]
Deanna Needell and Joel A Tropp. Paved with good intentions: analysis of a randomized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014
work page 2014
-
[40]
Amethodforsolvingtheconvexprogrammingproblemwithconvergence rate o (1/k2)
YuriiNesterov. Amethodforsolvingtheconvexprogrammingproblemwithconvergence rate o (1/k2). InDokl akad nauk Sssr, volume 269, page 543, 1983
work page 1983
-
[41]
Rui Pan, Yuxing Liu, Xiaoyu Wang, and Tong Zhang. Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567, 2023
-
[42]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011
work page 2011
-
[43]
Robert J Plemmons. Matrix analysis (roger a. horn and charles r. johnson), 1988
work page 1988
-
[44]
Some methods of speeding up the convergence of iteration methods
Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964
work page 1964
-
[45]
Pratik Rathore, Zachary Frangella, Sachin Garg, Shaghayegh Fazliani, Michal Derezin- ski, and Madeleine Udell. Turbocharging gaussian process inference with approximate sketch-and-project.Advances in Neural Information Processing Systems, 38:141691– 141728, 2026
work page 2026
-
[46]
Pratik Rathore, Zachary Frangella, Jiaming Yang, Michał Dereziński, and Madeleine Udell. Have askotch: A neat solution for large-scale kernel ridge regression.arXiv preprint arXiv:2407.10070, 2024
-
[47]
A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
work page 1951
-
[48]
Mobilenetv2: Inverted residuals and linear bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018
work page 2018
-
[49]
Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with ex- ponential convergence.Journal of Fourier Analysis and Applications, 15(2):262–278, 2009
work page 2009
-
[50]
On the importance ofinitializationandmomentumindeeplearning
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance ofinitializationandmomentumindeeplearning. InInternational conference on machine learning, pages 1139–1147. pmlr, 2013
work page 2013
-
[51]
Kejie Tang, Weidong Liu, Yichen Zhang, and Xi Chen. Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665, 2023. 19
-
[52]
Bryan Van Scoy, Randy A Freeman, and Kevin M Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions.IEEE Control Systems Letters, 2(1):49–54, 2017
work page 2017
-
[53]
Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014
JoaquinVanschoren, JanNVanRijn, BerndBischl, andLuisTorgo. Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014
work page 2014
-
[54]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[55]
SOAP: Improving and Stabilizing Shampoo using Adam
Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfon- brener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196, 2023
Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, and Zhiyuan Li. The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196, 2023
-
[57]
A Unified Analysis of Stochastic Momentum Methods for Deep Learning
Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, and Yi Yang. A unified analysis of stochastic momentum methods for deep learning.arXiv preprint arXiv:1808.10396, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[58]
Yun Zeng, Deren Han, Yansheng Su, and Jiaxin Xie. On adaptive stochastic heavy ball momentum for solving linear systems.SIAM Journal on Matrix Analysis and Applications, 45(3):1259–1286, 2024. A Convergence analysis for expected iterates We start by recalling the matrix transition rule (11), V⊤∆t+1 V⊤∆t = (1 +β)I−(1 +βω)V ⊤ Πt V−β·(I−ωV ⊤ Πt−1 V) I 0 | {z...
work page 2024
-
[59]
We had, pt+1 ≤f t+1 < ρ t+1 + tX j=0 ρj ·max i αi(t+ 1−j) Due to Lemma 13 we know, ρ <1− ϕλr 4
Recall the bound onpt+1 in (26). We had, pt+1 ≤f t+1 < ρ t+1 + tX j=0 ρj ·max i αi(t+ 1−j) Due to Lemma 13 we know, ρ <1− ϕλr 4 . Furthermore,α i(t+ 1−j) = (T⊤ i )t+1−jTt+1−j i , and we know from Lemma 9 that (T ⊤ i )t+1−jTt+1−j i ≤ 3(t+ 1−j) 2 + 2 · |γi1|t−j < 3(t+ 1) 2 + 2 · |γi1|t−j < 3(t+ 1) 2 + 2 ·ρ t−j. Therefore, pt+1 < ρ t+1 + (t+ 1)· 3(t+ 1) 2 + ...
-
[60]
In particular,f(x)< 3 2 · (1−β)2 (1+β)2
Over the set of allx≥1satisfying1− 1 x ≤f(x), the functionf(x)is maximized at x= 1. In particular,f(x)< 3 2 · (1−β)2 (1+β)2
-
[61]
Over the set of allx≥1satisfying1− 1 x = (1−βx)2 (1+βx)2, we have (1−βx)2 (1+βx)2 ≤ (1−β)2 (1+β)2
-
[62]
Over the set of allx≥1satisfying1− 1 x < (1−βx)2 (1+βx)2,inf x g(x)> 4 5 · (1−β)2 (1+β)2
-
[63]
Over all the set of allx≥1satisfying1− 1 x < g(x),inf x (1−βx)2 (1+βx)2 > 1 2 · (1−β)2 (1+β)2. C Additional experiments In this section, we provide additional details about our experimental setup, and include numerical results on two datasets to complement the results in Section 5. Finally, we also explore the dependence of classical momentum on theωparam...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.