Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise
Pith reviewed 2026-05-19 16:07 UTC · model grok-4.3
The pith
Normalized stochastic gradient descent with momentum converges under BG-0 noise with O(ε^{-6}) oracle complexity using one gradient per step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that normalized stochastic gradient descent with momentum, using only one stochastic gradient per iteration, converges under BG-0 noise with oracle complexity O(ε^{-6}). This rate holds both for standard smoothness and for α-symmetric generalized smoothness. We then study a variance-reduced normalized STORM method. Under mean-square smoothness and sharp initialization, the method achieves the minimax optimal O(ε^{-4}) complexity, matching the lower bound. Under expected α-symmetric generalized smoothness, the STORM recursion leads to complexity O(ε^{-(4+α)}) for α in (0,1) and O(ε^{-5}) for α=1.
What carries the argument
Normalized stochastic gradient descent with momentum under the Blum-Gladyshev BG-0 noise model, in which stochastic gradient variance grows quadratically with distance from initialization.
If this is right
- The convergence rate is identical under standard smoothness and under α-symmetric generalized smoothness.
- Variance-reduced normalized STORM achieves O(ε^{-4}) complexity, which is minimax optimal under mean-square smoothness.
- All stated rates reduce to the classical bounded-variance guarantees when the quadratic growth parameter in the noise model is set to zero.
- The results require neither bounded domains nor increasing batch sizes nor explicit anchoring.
Where Pith is reading between the lines
- The robustness of normalized momentum to distance-dependent noise may explain its practical success in large-scale training where parameter drift can increase gradient variance.
- Adaptive step-size rules that explicitly estimate the quadratic growth parameter could further tighten the rates in practice.
- The same noise model and normalization approach may be applicable to other first-order methods such as Adam or Nesterov acceleration.
Load-bearing premise
The stochastic gradients satisfy the Blum-Gladyshev BG-0 noise model in which variance grows quadratically with distance from the initialization point.
What would settle it
A simulation on a nonconvex test function where gradient variance is forced to grow quadratically from initialization but the normalized momentum method fails to reach an ε-stationary point within the stated O(ε^{-6}) steps would falsify the convergence claim.
Figures
read the original abstract
We study nonconvex stochastic optimization under the Blum-Gladyshev ($\mathsf{BG}$-0) noise model, where the stochastic gradient variance grows quadratically with the distance from the initialization. We consider this problem under both standard smoothness and the symmetric generalized-smoothness framework, which captures objectives whose local curvature can scale with the gradient norm. We prove that normalized stochastic gradient descent with momentum, using only one stochastic gradient per iteration, converges under $\mathsf{BG}$-0 noise with oracle complexity $O(\varepsilon^{-6})$. This rate holds both for standard smoothness and for $\alpha$-symmetric generalized smoothness, showing that generalized smoothness is rate-neutral for normalized momentum in this setting. We then study a variance-reduced normalized STORM method. Under mean-square smoothness and sharp initialization, the method achieves the minimax optimal $O(\varepsilon^{-4})$ complexity, matching the lower bound. Under expected $\alpha$-symmetric generalized smoothness, the STORM recursion couples gradient-dependent smoothness with distance-dependent noise, leading to complexity $O(\varepsilon^{-(4+\alpha)})$ for $\alpha\in(0,1)$ and $O(\varepsilon^{-5})$ for $\alpha=1$. When the distance-growth parameter in the noise model vanishes, our guarantees recover the standard bounded-variance rates: $O(\varepsilon^{-4})$ for momentum, $O(\varepsilon^{-3})$ for variance reduction, and $O(\varepsilon^{-2})$ in the deterministic case. To our knowledge, these are the first convergence guarantees for normalized methods in non-convex stochastic optimization under $\mathsf{BG}$-0 noise without bounded domains, increasing batch sizes, or explicit anchoring, covering both standard and generalized smoothness regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies nonconvex stochastic optimization under the Blum-Gladyshev (BG-0) noise model, where stochastic gradient variance grows quadratically with distance from initialization. It proves that normalized SGD with momentum (single stochastic gradient per iteration) achieves O(ε^{-6}) oracle complexity under both standard smoothness and α-symmetric generalized smoothness. It further analyzes a variance-reduced normalized STORM variant, obtaining O(ε^{-4}) under mean-square smoothness with sharp initialization, O(ε^{-(4+α)}) for α∈(0,1), and O(ε^{-5}) for α=1 under expected α-symmetric generalized smoothness. All rates recover the classical bounded-variance results when the distance-growth parameter is zero.
Significance. If the stated proofs are correct, the work is significant for providing the first convergence guarantees for normalized methods under BG-0 noise without bounded domains, batch-size growth, or anchoring. The demonstration that generalized smoothness is rate-neutral for normalized momentum, together with near-optimal rates for the variance-reduced case, extends the literature on relaxing bounded-variance assumptions. The explicit recovery of known rates as a special case is a strength.
major comments (2)
- [Main convergence theorem for normalized momentum (likely §3)] The central claim of O(ε^{-6}) for normalized momentum under BG-0 relies on controlling the quadratic distance-dependent variance via normalization. The analysis must be checked for whether the Lyapunov or recursion argument explicitly bounds the interaction between this noise growth and the momentum buffer without introducing initialization-dependent factors that are not stated in the complexity.
- [Analysis of variance-reduced STORM (likely §4)] For the STORM variant under expected α-symmetric generalized smoothness, the stated O(ε^{-(4+α)}) complexity arises from coupling gradient-dependent smoothness with distance-dependent noise. Verify that the recursion does not require additional assumptions on iterate boundedness or produce hidden constants depending on initialization sharpness beyond those already declared.
minor comments (2)
- [Problem setup and assumptions] Clarify the precise definition of 'sharp initialization' used for the O(ε^{-4}) claim and confirm it is consistent with the BG-0 model throughout the proofs.
- [Preliminaries] Add a brief remark on how the α-symmetric generalized smoothness reduces to standard smoothness when α=0, to aid readability.
Simulated Author's Rebuttal
We thank the referee for their thorough review and positive evaluation of our work on nonconvex optimization under Blum-Gladyshev noise. We address each major comment below with references to the relevant sections and proofs in the manuscript. We believe the existing analysis already resolves the points raised, but we are willing to incorporate minor clarifications for readability.
read point-by-point responses
-
Referee: [Main convergence theorem for normalized momentum (likely §3)] The central claim of O(ε^{-6}) for normalized momentum under BG-0 relies on controlling the quadratic distance-dependent variance via normalization. The analysis must be checked for whether the Lyapunov or recursion argument explicitly bounds the interaction between this noise growth and the momentum buffer without introducing initialization-dependent factors that are not stated in the complexity.
Authors: We appreciate the referee drawing attention to this aspect of the proof. In the analysis of Theorem 3.1 (Section 3), we employ a Lyapunov function combining the objective value with a term tracking the momentum buffer. Normalization bounds the effective step size independently of the gradient magnitude, which directly controls the quadratic variance growth term (proportional to distance from initialization) in the recursion. The momentum buffer interaction is bounded separately via a standard contraction argument on the momentum parameter, and any initialization-dependent quantities are absorbed into the step-size and momentum choices without appearing in the final O(ε^{-6}) complexity. The proof does not introduce unstated factors; the distance-dependent noise is mitigated precisely by the normalization, recovering the bounded-variance rate when the growth parameter is zero. We can add a short clarifying remark after the main recursion inequality if the referee finds it helpful. revision: partial
-
Referee: [Analysis of variance-reduced STORM (likely §4)] For the STORM variant under expected α-symmetric generalized smoothness, the stated O(ε^{-(4+α)}) complexity arises from coupling gradient-dependent smoothness with distance-dependent noise. Verify that the recursion does not require additional assumptions on iterate boundedness or produce hidden constants depending on initialization sharpness beyond those already declared.
Authors: Thank you for this verification request. The proof of Theorem 4.3 (Section 4) for the variance-reduced normalized STORM under expected α-symmetric generalized smoothness constructs a recursion that explicitly couples the α-dependent smoothness with the BG-0 noise term. No additional iterate boundedness assumptions are imposed; the combination of normalization, variance reduction, and the sharp initialization condition (explicitly stated in the theorem) suffices to control the distance-dependent variance without requiring bounded domains or extra constraints. All constants in the O(ε^{-(4+α)}) bound (and the special case O(ε^{-5}) for α=1) are expressed in terms of the declared parameters, including initialization sharpness, with no hidden factors. The analysis remains valid under the stated assumptions. We are happy to expand the discussion of this point in the proof for added transparency. revision: partial
Circularity Check
No significant circularity; derivation is self-contained from stated assumptions
full rationale
The paper states the BG-0 noise model (quadratic variance growth from initialization) and smoothness conditions as explicit assumptions in the problem setup. It then derives convergence of normalized SGD with momentum and the STORM variant by using normalization to bound the distance-dependent variance term, with Lyapunov-style recursions that recover the classical bounded-variance rates exactly when the growth parameter is set to zero. No parameter is fitted to data and then relabeled as a prediction, no self-citation is invoked as a load-bearing uniqueness theorem, and the analysis does not define any quantity in terms of the target rate. The claimed O(ε^{-6}) and O(ε^{-4}) complexities follow directly from the given noise model and normalization step without reduction to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Stochastic gradients obey the Blum-Gladyshev (BG-0) condition: variance grows quadratically with distance from initialization.
- domain assumption The objective satisfies either standard L-smoothness or α-symmetric generalized smoothness.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
normalized stochastic gradient descent with momentum ... converges under BG-0 noise with oracle complexity O(ε^{-6}) ... NSTORM ... O(ε^{-4}) ... under mean-square smoothness
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Blum–Gladyshev (BG-0) noise model, where the stochastic gradient variance grows quadratically with the distance from the initialization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ahmet Alacaoglu, Yura Malitsky, and Stephen J Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025
-
[2]
Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023
Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023
work page 2023
-
[3]
Non-asymptotic analysis of stochastic approximation algorithms for machine learning
Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems, volume 24, 2011
work page 2011
-
[4]
Julius R Blum. Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954
work page 1954
-
[5]
Optimization methods for large-scale learning
Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale learning. SIAM Review, 60(2):223–311, 2018
work page 2018
-
[6]
Lower bounds for finding stationary points I.Mathematical Programming, 177:193–230, 2019
Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points I.Mathematical Programming, 177:193–230, 2019
work page 2019
-
[7]
Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization
Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization. InInternational Conference on Machine Learning, pages 5396–5427. PMLR, 2023. 11
work page 2023
-
[8]
Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Ro- bustness to unbounded smoothness of generalized signsgd.Advances in neural information processing systems, 35:9955–9968, 2022
work page 2022
-
[9]
Momentum improves normalized sgd
Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InInternational conference on machine learning, pages 2260–2268. PMLR, 2020
work page 2020
-
[10]
Momentum-based variance reduction in non-convex SGD
Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. InAdvances in Neural Information Processing Systems, 2019
work page 2019
-
[11]
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradi- ent method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014
work page 2014
-
[12]
SPIDER: Near-optimal non- convex optimization via stochastic path-integrated differential estimator
Cong Fang, Chris Jiao Li, Zhouchen Lin, and Tong Zhang. SPIDER: Near-optimal non- convex optimization via stochastic path-integrated differential estimator. InAdvances in Neural Information Processing Systems, pages 689–699, 2018
work page 2018
-
[13]
Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance
Arda Fazla, Ege C Kaya, Antesh Upadhyay, and Abolfazl Hashemi. Lower bounds and proximally anchored sgd for non-convex minimization under unbounded variance.arXiv preprint arXiv:2604.16620, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013
work page 2013
-
[15]
On stochastic approximation.Theory of Probability & Its Applications, 10(2): 275–278, 1965
EG Gladyshev. On stochastic approximation.Theory of Probability & Its Applications, 10(2): 275–278, 1965
work page 1965
-
[16]
A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent
Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. InInternational Conference on Artificial Intelligence and Statistics, pages 680–690. PMLR, 2020
work page 2020
-
[17]
Benjamin Grimmer. Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019
work page 2019
-
[18]
Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73(6):957–961, 1967
Benjamin Halpern. Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73(6):957–961, 1967
work page 1967
-
[19]
Parameter-agnostic optimization under relaxed smoothness
Florian Hübler, Junchi Yang, Xiang Li, and Niao He. Parameter-agnostic optimization under relaxed smoothness. InInternational Conference on Artificial Intelligence and Statistics, pages 4861–4869. PMLR, 2024
work page 2024
-
[20]
Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan, and Tianjiao Li. Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023
-
[21]
Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification.Journal of machine learning research, 18(223):1–42, 2018
work page 2018
-
[22]
Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis.Advances in Neural Information Processing Systems, 34:2771–2782, 2021. 12
work page 2021
-
[23]
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013
work page 2013
-
[24]
Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023
Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023
work page 2023
-
[25]
Unified analysis of stochastic gradient methods for composite convex and smooth optimization
Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M Gower, and Peter Richtárik. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. Journal of Optimization Theory and Applications, 199(2):499–540, 2023
work page 2023
-
[26]
Error feedback under $(l_0,l_1)$-smoothness: Normalization and momentum
Sarit Khirirat, Abdurakhmon Sadiev, Artem Riabinin, Eduard Gorbunov, and Peter Richtárik. Error feedback under $(l_0,l_1)$-smoothness: Normalization and momentum. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps:// openreview.net/forum?id=LmcTbBvgjP
work page 2026
-
[27]
Revisiting gradient clipping: Stochastic bias and tight convergence guarantees
Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. InInternational Conference on Machine Learning, pages 17343–17363. PMLR, 2023
work page 2023
-
[28]
Two remarks on the method of successive approximations
Mark Aleksandrovich Krasnosel’ski˘ ı. Two remarks on the method of successive approximations. Uspekhi matematicheskikh nauk, 10(1):123–127, 1955
work page 1955
-
[29]
Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization.Advances in neural information processing systems, 33: 8847–8860, 2020
work page 2020
-
[30]
Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023
work page 2023
-
[31]
Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization
Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter Richtárik. Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. InInternational conference on machine learning, pages 6286–6295. PMLR, 2021
work page 2021
-
[32]
Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953
W Robert Mann. Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953
work page 1953
-
[33]
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4): 1574–1609, 2009
work page 2009
-
[34]
Springer Science & Business Media, 2004
Yurii Nesterov.Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2004
work page 2004
-
[35]
Yurii Nesterov.Lectures on convex optimization, volume 137. Springer, 2018
work page 2018
-
[36]
Sarah: A novel method for machine learning problems using stochastic recursive gradient
Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. InInternational conference on machine learning, pages 2613–2621. PMLR, 2017
work page 2017
-
[37]
Making gradient descent optimal for strongly convex stochastic optimization
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. InProceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012. 13
work page 2012
-
[38]
Variance-reduced clipping for non-convex optimization
Amirhossein Reisizadeh, Haochuan Li, Subhro Das, and Ali Jadbabaie. Variance-reduced clipping for non-convex optimization. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025
work page 2025
-
[39]
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InInternational Conference on Machine Learning, pages 71–79. PMLR, 2013
work page 2013
-
[40]
Yuki Takezawa, Han Bao, Ryoma Sato, Kenta Niwa, and Makoto Yamada. Parameter-free clipped gradient descent meets polyak.Advances in Neural Information Processing Systems, 37: 44575–44599, 2024
work page 2024
-
[41]
Provable adaptivity of adam under non-uniform smoothness
Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen. Provable adaptivity of adam under non-uniform smoothness. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2960–2969, 2024
work page 2024
-
[42]
SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization
Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization.arXiv preprint arXiv:1810.10690, 2018
-
[43]
Bohang Zhang, Jikai Jin, Cong Fang, and Liwei Wang. Improved analysis of clipping algorithms for non-convex optimization.Advances in Neural Information Processing Systems, 33:15511– 15521, 2020
work page 2020
-
[44]
arXiv preprint arXiv:1905.11881 (2019)
Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity.arXiv preprint arXiv:1905.11881, 2019
-
[45]
Shen-Yi Zhao, Yin-Peng Xie, and Wu-Jun Li. On the convergence and improvement of stochastic normalized gradient descent.Science China Information Sciences, 64(3):132103, 2021
work page 2021
-
[46]
4∆ γ0 + 8 η0 + 8(1−α)α α 1−α 2α/2K1γ0λ − α 1−α 0√η0 + 8·2 α/2K1Bαγ1+α 0√η0 + 8√η0Bγ0 # T −θ +
Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization.Journal of Machine Learning Research, 21(103):1–63, 2020. 14 Appendix Table of Contents A A Short Primer on Smoothness Classes 16 B Proofs for Normalized Momentum underBG-0 Noise 18 B.1 Auxiliary inequalities . . . . . . . . . . . . . . . . . . . . . . ....
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.