pith. sign in

arxiv: 2605.15314 · v1 · pith:6NXGUGCSnew · submitted 2026-05-14 · 💻 cs.LG · math.OC

Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise

Pith reviewed 2026-05-19 16:07 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords nonconvex stochastic optimizationBlum-Gladyshev noisenormalized SGDmomentumvariance reductiongeneralized smoothnessoracle complexity
0
0 comments X

The pith

Normalized stochastic gradient descent with momentum converges under BG-0 noise with O(ε^{-6}) oracle complexity using one gradient per step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that normalized SGD with momentum finds approximate stationary points in nonconvex stochastic optimization even when the variance of the stochastic gradients grows quadratically with distance from the initialization point. This holds with a single gradient evaluation per iteration and without assuming bounded domains or growing batch sizes. The same O(ε^{-6}) rate is shown to apply under both standard smoothness and α-symmetric generalized smoothness, where local curvature can scale with gradient norm. A variance-reduced normalized STORM method improves this to the minimax optimal O(ε^{-4}) rate under mean-square smoothness with sharp initialization. When the quadratic growth parameter in the noise model is zero, the bounds recover the classical rates known for bounded-variance settings.

Core claim

We prove that normalized stochastic gradient descent with momentum, using only one stochastic gradient per iteration, converges under BG-0 noise with oracle complexity O(ε^{-6}). This rate holds both for standard smoothness and for α-symmetric generalized smoothness. We then study a variance-reduced normalized STORM method. Under mean-square smoothness and sharp initialization, the method achieves the minimax optimal O(ε^{-4}) complexity, matching the lower bound. Under expected α-symmetric generalized smoothness, the STORM recursion leads to complexity O(ε^{-(4+α)}) for α in (0,1) and O(ε^{-5}) for α=1.

What carries the argument

Normalized stochastic gradient descent with momentum under the Blum-Gladyshev BG-0 noise model, in which stochastic gradient variance grows quadratically with distance from initialization.

If this is right

  • The convergence rate is identical under standard smoothness and under α-symmetric generalized smoothness.
  • Variance-reduced normalized STORM achieves O(ε^{-4}) complexity, which is minimax optimal under mean-square smoothness.
  • All stated rates reduce to the classical bounded-variance guarantees when the quadratic growth parameter in the noise model is set to zero.
  • The results require neither bounded domains nor increasing batch sizes nor explicit anchoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The robustness of normalized momentum to distance-dependent noise may explain its practical success in large-scale training where parameter drift can increase gradient variance.
  • Adaptive step-size rules that explicitly estimate the quadratic growth parameter could further tighten the rates in practice.
  • The same noise model and normalization approach may be applicable to other first-order methods such as Adam or Nesterov acceleration.

Load-bearing premise

The stochastic gradients satisfy the Blum-Gladyshev BG-0 noise model in which variance grows quadratically with distance from the initialization point.

What would settle it

A simulation on a nonconvex test function where gradient variance is forced to grow quadratically from initialization but the normalized momentum method fails to reach an ε-stationary point within the stated O(ε^{-6}) steps would falsify the convergence claim.

Figures

Figures reproduced from arXiv: 2605.15314 by Abolfazl Hashemi, Antesh Upadhyay, Arda Fazla.

Figure 1
Figure 1. Figure 1: Phase retrieval experiment under the BG-0 oracle (α= 2/3). Gradient norm vs. SFO calls (left), drift ∥x k−x 0∥ 2 (center), and batch size (right) [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cubic polynomial under the BG-0 oracle (α= 1/2). Same layout as [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
read the original abstract

We study nonconvex stochastic optimization under the Blum-Gladyshev ($\mathsf{BG}$-0) noise model, where the stochastic gradient variance grows quadratically with the distance from the initialization. We consider this problem under both standard smoothness and the symmetric generalized-smoothness framework, which captures objectives whose local curvature can scale with the gradient norm. We prove that normalized stochastic gradient descent with momentum, using only one stochastic gradient per iteration, converges under $\mathsf{BG}$-0 noise with oracle complexity $O(\varepsilon^{-6})$. This rate holds both for standard smoothness and for $\alpha$-symmetric generalized smoothness, showing that generalized smoothness is rate-neutral for normalized momentum in this setting. We then study a variance-reduced normalized STORM method. Under mean-square smoothness and sharp initialization, the method achieves the minimax optimal $O(\varepsilon^{-4})$ complexity, matching the lower bound. Under expected $\alpha$-symmetric generalized smoothness, the STORM recursion couples gradient-dependent smoothness with distance-dependent noise, leading to complexity $O(\varepsilon^{-(4+\alpha)})$ for $\alpha\in(0,1)$ and $O(\varepsilon^{-5})$ for $\alpha=1$. When the distance-growth parameter in the noise model vanishes, our guarantees recover the standard bounded-variance rates: $O(\varepsilon^{-4})$ for momentum, $O(\varepsilon^{-3})$ for variance reduction, and $O(\varepsilon^{-2})$ in the deterministic case. To our knowledge, these are the first convergence guarantees for normalized methods in non-convex stochastic optimization under $\mathsf{BG}$-0 noise without bounded domains, increasing batch sizes, or explicit anchoring, covering both standard and generalized smoothness regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies nonconvex stochastic optimization under the Blum-Gladyshev (BG-0) noise model, where stochastic gradient variance grows quadratically with distance from initialization. It proves that normalized SGD with momentum (single stochastic gradient per iteration) achieves O(ε^{-6}) oracle complexity under both standard smoothness and α-symmetric generalized smoothness. It further analyzes a variance-reduced normalized STORM variant, obtaining O(ε^{-4}) under mean-square smoothness with sharp initialization, O(ε^{-(4+α)}) for α∈(0,1), and O(ε^{-5}) for α=1 under expected α-symmetric generalized smoothness. All rates recover the classical bounded-variance results when the distance-growth parameter is zero.

Significance. If the stated proofs are correct, the work is significant for providing the first convergence guarantees for normalized methods under BG-0 noise without bounded domains, batch-size growth, or anchoring. The demonstration that generalized smoothness is rate-neutral for normalized momentum, together with near-optimal rates for the variance-reduced case, extends the literature on relaxing bounded-variance assumptions. The explicit recovery of known rates as a special case is a strength.

major comments (2)
  1. [Main convergence theorem for normalized momentum (likely §3)] The central claim of O(ε^{-6}) for normalized momentum under BG-0 relies on controlling the quadratic distance-dependent variance via normalization. The analysis must be checked for whether the Lyapunov or recursion argument explicitly bounds the interaction between this noise growth and the momentum buffer without introducing initialization-dependent factors that are not stated in the complexity.
  2. [Analysis of variance-reduced STORM (likely §4)] For the STORM variant under expected α-symmetric generalized smoothness, the stated O(ε^{-(4+α)}) complexity arises from coupling gradient-dependent smoothness with distance-dependent noise. Verify that the recursion does not require additional assumptions on iterate boundedness or produce hidden constants depending on initialization sharpness beyond those already declared.
minor comments (2)
  1. [Problem setup and assumptions] Clarify the precise definition of 'sharp initialization' used for the O(ε^{-4}) claim and confirm it is consistent with the BG-0 model throughout the proofs.
  2. [Preliminaries] Add a brief remark on how the α-symmetric generalized smoothness reduces to standard smoothness when α=0, to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and positive evaluation of our work on nonconvex optimization under Blum-Gladyshev noise. We address each major comment below with references to the relevant sections and proofs in the manuscript. We believe the existing analysis already resolves the points raised, but we are willing to incorporate minor clarifications for readability.

read point-by-point responses
  1. Referee: [Main convergence theorem for normalized momentum (likely §3)] The central claim of O(ε^{-6}) for normalized momentum under BG-0 relies on controlling the quadratic distance-dependent variance via normalization. The analysis must be checked for whether the Lyapunov or recursion argument explicitly bounds the interaction between this noise growth and the momentum buffer without introducing initialization-dependent factors that are not stated in the complexity.

    Authors: We appreciate the referee drawing attention to this aspect of the proof. In the analysis of Theorem 3.1 (Section 3), we employ a Lyapunov function combining the objective value with a term tracking the momentum buffer. Normalization bounds the effective step size independently of the gradient magnitude, which directly controls the quadratic variance growth term (proportional to distance from initialization) in the recursion. The momentum buffer interaction is bounded separately via a standard contraction argument on the momentum parameter, and any initialization-dependent quantities are absorbed into the step-size and momentum choices without appearing in the final O(ε^{-6}) complexity. The proof does not introduce unstated factors; the distance-dependent noise is mitigated precisely by the normalization, recovering the bounded-variance rate when the growth parameter is zero. We can add a short clarifying remark after the main recursion inequality if the referee finds it helpful. revision: partial

  2. Referee: [Analysis of variance-reduced STORM (likely §4)] For the STORM variant under expected α-symmetric generalized smoothness, the stated O(ε^{-(4+α)}) complexity arises from coupling gradient-dependent smoothness with distance-dependent noise. Verify that the recursion does not require additional assumptions on iterate boundedness or produce hidden constants depending on initialization sharpness beyond those already declared.

    Authors: Thank you for this verification request. The proof of Theorem 4.3 (Section 4) for the variance-reduced normalized STORM under expected α-symmetric generalized smoothness constructs a recursion that explicitly couples the α-dependent smoothness with the BG-0 noise term. No additional iterate boundedness assumptions are imposed; the combination of normalization, variance reduction, and the sharp initialization condition (explicitly stated in the theorem) suffices to control the distance-dependent variance without requiring bounded domains or extra constraints. All constants in the O(ε^{-(4+α)}) bound (and the special case O(ε^{-5}) for α=1) are expressed in terms of the declared parameters, including initialization sharpness, with no hidden factors. The analysis remains valid under the stated assumptions. We are happy to expand the discussion of this point in the proof for added transparency. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from stated assumptions

full rationale

The paper states the BG-0 noise model (quadratic variance growth from initialization) and smoothness conditions as explicit assumptions in the problem setup. It then derives convergence of normalized SGD with momentum and the STORM variant by using normalization to bound the distance-dependent variance term, with Lyapunov-style recursions that recover the classical bounded-variance rates exactly when the growth parameter is set to zero. No parameter is fitted to data and then relabeled as a prediction, no self-citation is invoked as a load-bearing uniqueness theorem, and the analysis does not define any quantity in terms of the target rate. The claimed O(ε^{-6}) and O(ε^{-4}) complexities follow directly from the given noise model and normalization step without reduction to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the BG-0 noise model (variance quadratic in distance from initialization) and standard or generalized smoothness assumptions; no free parameters are fitted and no new entities are postulated.

axioms (2)
  • domain assumption Stochastic gradients obey the Blum-Gladyshev (BG-0) condition: variance grows quadratically with distance from initialization.
    This is the defining noise model of the paper and is invoked in every convergence statement.
  • domain assumption The objective satisfies either standard L-smoothness or α-symmetric generalized smoothness.
    Used to bound the progress of normalized updates and to derive the stated complexity exponents.

pith-pipeline@v0.9.0 · 5855 in / 1482 out tokens · 44515 ms · 2026-05-19T16:07:08.135682+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

    Ahmet Alacaoglu, Yura Malitsky, and Stephen J Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

  2. [2]

    Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

    Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

  3. [3]

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning

    Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems, volume 24, 2011

  4. [4]

    Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

    Julius R Blum. Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

  5. [5]

    Optimization methods for large-scale learning

    Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale learning. SIAM Review, 60(2):223–311, 2018

  6. [6]

    Lower bounds for finding stationary points I.Mathematical Programming, 177:193–230, 2019

    Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points I.Mathematical Programming, 177:193–230, 2019

  7. [7]

    Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization

    Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization. InInternational Conference on Machine Learning, pages 5396–5427. PMLR, 2023. 11

  8. [8]

    Ro- bustness to unbounded smoothness of generalized signsgd.Advances in neural information processing systems, 35:9955–9968, 2022

    Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Ro- bustness to unbounded smoothness of generalized signsgd.Advances in neural information processing systems, 35:9955–9968, 2022

  9. [9]

    Momentum improves normalized sgd

    Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InInternational conference on machine learning, pages 2260–2268. PMLR, 2020

  10. [10]

    Momentum-based variance reduction in non-convex SGD

    Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. InAdvances in Neural Information Processing Systems, 2019

  11. [11]

    Saga: A fast incremental gradi- ent method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradi- ent method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

  12. [12]

    SPIDER: Near-optimal non- convex optimization via stochastic path-integrated differential estimator

    Cong Fang, Chris Jiao Li, Zhouchen Lin, and Tong Zhang. SPIDER: Near-optimal non- convex optimization via stochastic path-integrated differential estimator. InAdvances in Neural Information Processing Systems, pages 689–699, 2018

  13. [13]

    Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance

    Arda Fazla, Ege C Kaya, Antesh Upadhyay, and Abolfazl Hashemi. Lower bounds and proximally anchored sgd for non-convex minimization under unbounded variance.arXiv preprint arXiv:2604.16620, 2026

  14. [14]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013

    Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013

  15. [15]

    On stochastic approximation.Theory of Probability & Its Applications, 10(2): 275–278, 1965

    EG Gladyshev. On stochastic approximation.Theory of Probability & Its Applications, 10(2): 275–278, 1965

  16. [16]

    A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent

    Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. InInternational Conference on Artificial Intelligence and Statistics, pages 680–690. PMLR, 2020

  17. [17]

    Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

    Benjamin Grimmer. Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

  18. [18]

    Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73(6):957–961, 1967

    Benjamin Halpern. Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73(6):957–961, 1967

  19. [19]

    Parameter-agnostic optimization under relaxed smoothness

    Florian Hübler, Junchi Yang, Xiang Li, and Niao He. Parameter-agnostic optimization under relaxed smoothness. InInternational Conference on Artificial Intelligence and Statistics, pages 4861–4869. PMLR, 2024

  20. [20]

    Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

    Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan, and Tianjiao Li. Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

  21. [21]

    Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification.Journal of machine learning research, 18(223):1–42, 2018

  22. [22]

    Non-convex distributionally robust optimization: Non-asymptotic analysis.Advances in Neural Information Processing Systems, 34:2771–2782, 2021

    Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis.Advances in Neural Information Processing Systems, 34:2771–2782, 2021. 12

  23. [23]

    Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013

    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013

  24. [24]

    Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

    Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

  25. [25]

    Unified analysis of stochastic gradient methods for composite convex and smooth optimization

    Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M Gower, and Peter Richtárik. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. Journal of Optimization Theory and Applications, 199(2):499–540, 2023

  26. [26]

    Error feedback under $(l_0,l_1)$-smoothness: Normalization and momentum

    Sarit Khirirat, Abdurakhmon Sadiev, Artem Riabinin, Eduard Gorbunov, and Peter Richtárik. Error feedback under $(l_0,l_1)$-smoothness: Normalization and momentum. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps:// openreview.net/forum?id=LmcTbBvgjP

  27. [27]

    Revisiting gradient clipping: Stochastic bias and tight convergence guarantees

    Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. InInternational Conference on Machine Learning, pages 17343–17363. PMLR, 2023

  28. [28]

    Two remarks on the method of successive approximations

    Mark Aleksandrovich Krasnosel’ski˘ ı. Two remarks on the method of successive approximations. Uspekhi matematicheskikh nauk, 10(1):123–127, 1955

  29. [29]

    Large-scale methods for distributionally robust optimization.Advances in neural information processing systems, 33: 8847–8860, 2020

    Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization.Advances in neural information processing systems, 33: 8847–8860, 2020

  30. [30]

    Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023

    Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023

  31. [31]

    Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization

    Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter Richtárik. Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. InInternational conference on machine learning, pages 6286–6295. PMLR, 2021

  32. [32]

    Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

    W Robert Mann. Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

  33. [33]

    Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4): 1574–1609, 2009

    Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4): 1574–1609, 2009

  34. [34]

    Springer Science & Business Media, 2004

    Yurii Nesterov.Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2004

  35. [35]

    Springer, 2018

    Yurii Nesterov.Lectures on convex optimization, volume 137. Springer, 2018

  36. [36]

    Sarah: A novel method for machine learning problems using stochastic recursive gradient

    Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. InInternational conference on machine learning, pages 2613–2621. PMLR, 2017

  37. [37]

    Making gradient descent optimal for strongly convex stochastic optimization

    Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. InProceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012. 13

  38. [38]

    Variance-reduced clipping for non-convex optimization

    Amirhossein Reisizadeh, Haochuan Li, Subhro Das, and Ali Jadbabaie. Variance-reduced clipping for non-convex optimization. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  39. [39]

    Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

    Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InInternational Conference on Machine Learning, pages 71–79. PMLR, 2013

  40. [40]

    Parameter-free clipped gradient descent meets polyak.Advances in Neural Information Processing Systems, 37: 44575–44599, 2024

    Yuki Takezawa, Han Bao, Ryoma Sato, Kenta Niwa, and Makoto Yamada. Parameter-free clipped gradient descent meets polyak.Advances in Neural Information Processing Systems, 37: 44575–44599, 2024

  41. [41]

    Provable adaptivity of adam under non-uniform smoothness

    Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen. Provable adaptivity of adam under non-uniform smoothness. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2960–2969, 2024

  42. [42]

    arXiv preprint arXiv:1810.10690 , year=

    Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization.arXiv preprint arXiv:1810.10690, 2018

  43. [43]

    Improved analysis of clipping algorithms for non-convex optimization.Advances in Neural Information Processing Systems, 33:15511– 15521, 2020

    Bohang Zhang, Jikai Jin, Cong Fang, and Liwei Wang. Improved analysis of clipping algorithms for non-convex optimization.Advances in Neural Information Processing Systems, 33:15511– 15521, 2020

  44. [44]

    arXiv preprint arXiv:1905.11881 (2019)

    Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity.arXiv preprint arXiv:1905.11881, 2019

  45. [45]

    On the convergence and improvement of stochastic normalized gradient descent.Science China Information Sciences, 64(3):132103, 2021

    Shen-Yi Zhao, Yin-Peng Xie, and Wu-Jun Li. On the convergence and improvement of stochastic normalized gradient descent.Science China Information Sciences, 64(3):132103, 2021

  46. [46]

    4∆ γ0 + 8 η0 + 8(1−α)α α 1−α 2α/2K1γ0λ − α 1−α 0√η0 + 8·2 α/2K1Bαγ1+α 0√η0 + 8√η0Bγ0 # T −θ +

    Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization.Journal of Machine Learning Research, 21(103):1–63, 2020. 14 Appendix Table of Contents A A Short Primer on Smoothness Classes 16 B Proofs for Normalized Momentum underBG-0 Noise 18 B.1 Auxiliary inequalities . . . . . . . . . . . . . . . . . . . . . . ....