Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise

Abolfazl Hashemi; Antesh Upadhyay; Arda Fazla

arxiv: 2605.15314 · v1 · pith:6NXGUGCSnew · submitted 2026-05-14 · 💻 cs.LG · math.OC

Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise

Antesh Upadhyay , Arda Fazla , Abolfazl Hashemi This is my paper

Pith reviewed 2026-05-19 16:07 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords nonconvex stochastic optimizationBlum-Gladyshev noisenormalized SGDmomentumvariance reductiongeneralized smoothnessoracle complexity

0 comments

The pith

Normalized stochastic gradient descent with momentum converges under BG-0 noise with O(ε^{-6}) oracle complexity using one gradient per step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that normalized SGD with momentum finds approximate stationary points in nonconvex stochastic optimization even when the variance of the stochastic gradients grows quadratically with distance from the initialization point. This holds with a single gradient evaluation per iteration and without assuming bounded domains or growing batch sizes. The same O(ε^{-6}) rate is shown to apply under both standard smoothness and α-symmetric generalized smoothness, where local curvature can scale with gradient norm. A variance-reduced normalized STORM method improves this to the minimax optimal O(ε^{-4}) rate under mean-square smoothness with sharp initialization. When the quadratic growth parameter in the noise model is zero, the bounds recover the classical rates known for bounded-variance settings.

Core claim

We prove that normalized stochastic gradient descent with momentum, using only one stochastic gradient per iteration, converges under BG-0 noise with oracle complexity O(ε^{-6}). This rate holds both for standard smoothness and for α-symmetric generalized smoothness. We then study a variance-reduced normalized STORM method. Under mean-square smoothness and sharp initialization, the method achieves the minimax optimal O(ε^{-4}) complexity, matching the lower bound. Under expected α-symmetric generalized smoothness, the STORM recursion leads to complexity O(ε^{-(4+α)}) for α in (0,1) and O(ε^{-5}) for α=1.

What carries the argument

Normalized stochastic gradient descent with momentum under the Blum-Gladyshev BG-0 noise model, in which stochastic gradient variance grows quadratically with distance from initialization.

If this is right

The convergence rate is identical under standard smoothness and under α-symmetric generalized smoothness.
Variance-reduced normalized STORM achieves O(ε^{-4}) complexity, which is minimax optimal under mean-square smoothness.
All stated rates reduce to the classical bounded-variance guarantees when the quadratic growth parameter in the noise model is set to zero.
The results require neither bounded domains nor increasing batch sizes nor explicit anchoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The robustness of normalized momentum to distance-dependent noise may explain its practical success in large-scale training where parameter drift can increase gradient variance.
Adaptive step-size rules that explicitly estimate the quadratic growth parameter could further tighten the rates in practice.
The same noise model and normalization approach may be applicable to other first-order methods such as Adam or Nesterov acceleration.

Load-bearing premise

The stochastic gradients satisfy the Blum-Gladyshev BG-0 noise model in which variance grows quadratically with distance from the initialization point.

What would settle it

A simulation on a nonconvex test function where gradient variance is forced to grow quadratically from initialization but the normalized momentum method fails to reach an ε-stationary point within the stated O(ε^{-6}) steps would falsify the convergence claim.

Figures

Figures reproduced from arXiv: 2605.15314 by Abolfazl Hashemi, Antesh Upadhyay, Arda Fazla.

**Figure 1.** Figure 1: Phase retrieval experiment under the BG-0 oracle (α= 2/3). Gradient norm vs. SFO calls (left), drift ∥x k−x 0∥ 2 (center), and batch size (right) [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗

**Figure 2.** Figure 2: Cubic polynomial under the BG-0 oracle (α= 1/2). Same layout as [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

We study nonconvex stochastic optimization under the Blum-Gladyshev ($\mathsf{BG}$-0) noise model, where the stochastic gradient variance grows quadratically with the distance from the initialization. We consider this problem under both standard smoothness and the symmetric generalized-smoothness framework, which captures objectives whose local curvature can scale with the gradient norm. We prove that normalized stochastic gradient descent with momentum, using only one stochastic gradient per iteration, converges under $\mathsf{BG}$-0 noise with oracle complexity $O(\varepsilon^{-6})$. This rate holds both for standard smoothness and for $\alpha$-symmetric generalized smoothness, showing that generalized smoothness is rate-neutral for normalized momentum in this setting. We then study a variance-reduced normalized STORM method. Under mean-square smoothness and sharp initialization, the method achieves the minimax optimal $O(\varepsilon^{-4})$ complexity, matching the lower bound. Under expected $\alpha$-symmetric generalized smoothness, the STORM recursion couples gradient-dependent smoothness with distance-dependent noise, leading to complexity $O(\varepsilon^{-(4+\alpha)})$ for $\alpha\in(0,1)$ and $O(\varepsilon^{-5})$ for $\alpha=1$. When the distance-growth parameter in the noise model vanishes, our guarantees recover the standard bounded-variance rates: $O(\varepsilon^{-4})$ for momentum, $O(\varepsilon^{-3})$ for variance reduction, and $O(\varepsilon^{-2})$ in the deterministic case. To our knowledge, these are the first convergence guarantees for normalized methods in non-convex stochastic optimization under $\mathsf{BG}$-0 noise without bounded domains, increasing batch sizes, or explicit anchoring, covering both standard and generalized smoothness regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Normalized momentum gets O(ε^{-6}) under BG-0 noise without domain bounds or batch growth, with STORM improving to near-optimal under standard smoothness.

read the letter

This paper shows that normalized stochastic gradient descent with momentum converges at O(ε^{-6}) under the Blum-Gladyshev BG-0 noise model in nonconvex optimization, using only one stochastic gradient per iteration. The same framework covers both standard smoothness and α-symmetric generalized smoothness. A variance-reduced normalized STORM variant reaches O(ε^{-4}) under mean-square smoothness with sharp initialization, and O(ε^{-(4+α)}) or O(ε^{-5}) under generalized smoothness depending on α. When the noise growth parameter is zero, the rates fall back to the usual bounded-variance results, which is a clean consistency check. The main technical move is using normalization to keep the quadratic distance-dependent variance under control without assuming bounded domains or forcing batch sizes to grow. The analysis appears to handle the coupling between gradient-dependent smoothness and distance-dependent noise through careful recursion arguments rather than hidden boundedness assumptions. That is the actual advance over prior work that needed those crutches. The rates themselves are the clearest limitation. O(ε^{-6}) for plain momentum is slow, as expected with this noise model, but it still leaves the method far from practical. For STORM the generalized-smoothness case adds an extra factor in the exponent that reduces its appeal unless α is small. The sharp-initialization requirement for the best STORM rate is also restrictive and may not hold in typical nonconvex problems. The proofs are not something I re-derived step by step, so there is always room for overlooked constant dependence or edge cases in the Lyapunov construction, though nothing in the stated recursions looks circular. This work is for researchers who care about relaxing the bounded-variance assumption in stochastic nonconvex optimization. A reader already working on normalized methods or generalized smoothness would find the rates and the recovery of classical cases useful. I would send it to peer review because it supplies the first stated guarantees in this setting and the internal logic holds together on the terms given.

Referee Report

2 major / 2 minor

Summary. The manuscript studies nonconvex stochastic optimization under the Blum-Gladyshev (BG-0) noise model, where stochastic gradient variance grows quadratically with distance from initialization. It proves that normalized SGD with momentum (single stochastic gradient per iteration) achieves O(ε^{-6}) oracle complexity under both standard smoothness and α-symmetric generalized smoothness. It further analyzes a variance-reduced normalized STORM variant, obtaining O(ε^{-4}) under mean-square smoothness with sharp initialization, O(ε^{-(4+α)}) for α∈(0,1), and O(ε^{-5}) for α=1 under expected α-symmetric generalized smoothness. All rates recover the classical bounded-variance results when the distance-growth parameter is zero.

Significance. If the stated proofs are correct, the work is significant for providing the first convergence guarantees for normalized methods under BG-0 noise without bounded domains, batch-size growth, or anchoring. The demonstration that generalized smoothness is rate-neutral for normalized momentum, together with near-optimal rates for the variance-reduced case, extends the literature on relaxing bounded-variance assumptions. The explicit recovery of known rates as a special case is a strength.

major comments (2)

[Main convergence theorem for normalized momentum (likely §3)] The central claim of O(ε^{-6}) for normalized momentum under BG-0 relies on controlling the quadratic distance-dependent variance via normalization. The analysis must be checked for whether the Lyapunov or recursion argument explicitly bounds the interaction between this noise growth and the momentum buffer without introducing initialization-dependent factors that are not stated in the complexity.
[Analysis of variance-reduced STORM (likely §4)] For the STORM variant under expected α-symmetric generalized smoothness, the stated O(ε^{-(4+α)}) complexity arises from coupling gradient-dependent smoothness with distance-dependent noise. Verify that the recursion does not require additional assumptions on iterate boundedness or produce hidden constants depending on initialization sharpness beyond those already declared.

minor comments (2)

[Problem setup and assumptions] Clarify the precise definition of 'sharp initialization' used for the O(ε^{-4}) claim and confirm it is consistent with the BG-0 model throughout the proofs.
[Preliminaries] Add a brief remark on how the α-symmetric generalized smoothness reduces to standard smoothness when α=0, to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and positive evaluation of our work on nonconvex optimization under Blum-Gladyshev noise. We address each major comment below with references to the relevant sections and proofs in the manuscript. We believe the existing analysis already resolves the points raised, but we are willing to incorporate minor clarifications for readability.

read point-by-point responses

Referee: [Main convergence theorem for normalized momentum (likely §3)] The central claim of O(ε^{-6}) for normalized momentum under BG-0 relies on controlling the quadratic distance-dependent variance via normalization. The analysis must be checked for whether the Lyapunov or recursion argument explicitly bounds the interaction between this noise growth and the momentum buffer without introducing initialization-dependent factors that are not stated in the complexity.

Authors: We appreciate the referee drawing attention to this aspect of the proof. In the analysis of Theorem 3.1 (Section 3), we employ a Lyapunov function combining the objective value with a term tracking the momentum buffer. Normalization bounds the effective step size independently of the gradient magnitude, which directly controls the quadratic variance growth term (proportional to distance from initialization) in the recursion. The momentum buffer interaction is bounded separately via a standard contraction argument on the momentum parameter, and any initialization-dependent quantities are absorbed into the step-size and momentum choices without appearing in the final O(ε^{-6}) complexity. The proof does not introduce unstated factors; the distance-dependent noise is mitigated precisely by the normalization, recovering the bounded-variance rate when the growth parameter is zero. We can add a short clarifying remark after the main recursion inequality if the referee finds it helpful. revision: partial
Referee: [Analysis of variance-reduced STORM (likely §4)] For the STORM variant under expected α-symmetric generalized smoothness, the stated O(ε^{-(4+α)}) complexity arises from coupling gradient-dependent smoothness with distance-dependent noise. Verify that the recursion does not require additional assumptions on iterate boundedness or produce hidden constants depending on initialization sharpness beyond those already declared.

Authors: Thank you for this verification request. The proof of Theorem 4.3 (Section 4) for the variance-reduced normalized STORM under expected α-symmetric generalized smoothness constructs a recursion that explicitly couples the α-dependent smoothness with the BG-0 noise term. No additional iterate boundedness assumptions are imposed; the combination of normalization, variance reduction, and the sharp initialization condition (explicitly stated in the theorem) suffices to control the distance-dependent variance without requiring bounded domains or extra constraints. All constants in the O(ε^{-(4+α)}) bound (and the special case O(ε^{-5}) for α=1) are expressed in terms of the declared parameters, including initialization sharpness, with no hidden factors. The analysis remains valid under the stated assumptions. We are happy to expand the discussion of this point in the proof for added transparency. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from stated assumptions

full rationale

The paper states the BG-0 noise model (quadratic variance growth from initialization) and smoothness conditions as explicit assumptions in the problem setup. It then derives convergence of normalized SGD with momentum and the STORM variant by using normalization to bound the distance-dependent variance term, with Lyapunov-style recursions that recover the classical bounded-variance rates exactly when the growth parameter is set to zero. No parameter is fitted to data and then relabeled as a prediction, no self-citation is invoked as a load-bearing uniqueness theorem, and the analysis does not define any quantity in terms of the target rate. The claimed O(ε^{-6}) and O(ε^{-4}) complexities follow directly from the given noise model and normalization step without reduction to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the BG-0 noise model (variance quadratic in distance from initialization) and standard or generalized smoothness assumptions; no free parameters are fitted and no new entities are postulated.

axioms (2)

domain assumption Stochastic gradients obey the Blum-Gladyshev (BG-0) condition: variance grows quadratically with distance from initialization.
This is the defining noise model of the paper and is invoked in every convergence statement.
domain assumption The objective satisfies either standard L-smoothness or α-symmetric generalized smoothness.
Used to bound the progress of normalized updates and to derive the stated complexity exponents.

pith-pipeline@v0.9.0 · 5855 in / 1482 out tokens · 44515 ms · 2026-05-19T16:07:08.135682+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

normalized stochastic gradient descent with momentum ... converges under BG-0 noise with oracle complexity O(ε^{-6}) ... NSTORM ... O(ε^{-4}) ... under mean-square smoothness
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Blum–Gladyshev (BG-0) noise model, where the stochastic gradient variance grows quadratically with the distance from the initialization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

Ahmet Alacaoglu, Yura Malitsky, and Stephen J Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

work page arXiv 2025
[2]

Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

work page 2023
[3]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems, volume 24, 2011

work page 2011
[4]

Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

Julius R Blum. Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

work page 1954
[5]

Optimization methods for large-scale learning

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale learning. SIAM Review, 60(2):223–311, 2018

work page 2018
[6]

Lower bounds for finding stationary points I.Mathematical Programming, 177:193–230, 2019

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points I.Mathematical Programming, 177:193–230, 2019

work page 2019
[7]

Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization

Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization. InInternational Conference on Machine Learning, pages 5396–5427. PMLR, 2023. 11

work page 2023
[8]

Ro- bustness to unbounded smoothness of generalized signsgd.Advances in neural information processing systems, 35:9955–9968, 2022

Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Ro- bustness to unbounded smoothness of generalized signsgd.Advances in neural information processing systems, 35:9955–9968, 2022

work page 2022
[9]

Momentum improves normalized sgd

Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InInternational conference on machine learning, pages 2260–2268. PMLR, 2020

work page 2020
[10]

Momentum-based variance reduction in non-convex SGD

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[11]

Saga: A fast incremental gradi- ent method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradi- ent method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

work page 2014
[12]

SPIDER: Near-optimal non- convex optimization via stochastic path-integrated differential estimator

Cong Fang, Chris Jiao Li, Zhouchen Lin, and Tong Zhang. SPIDER: Near-optimal non- convex optimization via stochastic path-integrated differential estimator. InAdvances in Neural Information Processing Systems, pages 689–699, 2018

work page 2018
[13]

Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance

Arda Fazla, Ege C Kaya, Antesh Upadhyay, and Abolfazl Hashemi. Lower bounds and proximally anchored sgd for non-convex minimization under unbounded variance.arXiv preprint arXiv:2604.16620, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013

work page 2013
[15]

On stochastic approximation.Theory of Probability & Its Applications, 10(2): 275–278, 1965

EG Gladyshev. On stochastic approximation.Theory of Probability & Its Applications, 10(2): 275–278, 1965

work page 1965
[16]

A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent

Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. InInternational Conference on Artificial Intelligence and Statistics, pages 680–690. PMLR, 2020

work page 2020
[17]

Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

Benjamin Grimmer. Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

work page 2019
[18]

Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73(6):957–961, 1967

Benjamin Halpern. Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73(6):957–961, 1967

work page 1967
[19]

Parameter-agnostic optimization under relaxed smoothness

Florian Hübler, Junchi Yang, Xiang Li, and Niao He. Parameter-agnostic optimization under relaxed smoothness. InInternational Conference on Artificial Intelligence and Statistics, pages 4861–4869. PMLR, 2024

work page 2024
[20]

Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan, and Tianjiao Li. Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

work page arXiv 2023
[21]

Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification.Journal of machine learning research, 18(223):1–42, 2018

work page 2018
[22]

Non-convex distributionally robust optimization: Non-asymptotic analysis.Advances in Neural Information Processing Systems, 34:2771–2782, 2021

Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis.Advances in Neural Information Processing Systems, 34:2771–2782, 2021. 12

work page 2021
[23]

Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013

work page 2013
[24]

Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

work page 2023
[25]

Unified analysis of stochastic gradient methods for composite convex and smooth optimization

Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M Gower, and Peter Richtárik. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. Journal of Optimization Theory and Applications, 199(2):499–540, 2023

work page 2023
[26]

Error feedback under $(l_0,l_1)$-smoothness: Normalization and momentum

Sarit Khirirat, Abdurakhmon Sadiev, Artem Riabinin, Eduard Gorbunov, and Peter Richtárik. Error feedback under $(l_0,l_1)$-smoothness: Normalization and momentum. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps:// openreview.net/forum?id=LmcTbBvgjP

work page 2026
[27]

Revisiting gradient clipping: Stochastic bias and tight convergence guarantees

Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. InInternational Conference on Machine Learning, pages 17343–17363. PMLR, 2023

work page 2023
[28]

Two remarks on the method of successive approximations

Mark Aleksandrovich Krasnosel’ski˘ ı. Two remarks on the method of successive approximations. Uspekhi matematicheskikh nauk, 10(1):123–127, 1955

work page 1955
[29]

Large-scale methods for distributionally robust optimization.Advances in neural information processing systems, 33: 8847–8860, 2020

Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization.Advances in neural information processing systems, 33: 8847–8860, 2020

work page 2020
[30]

Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023

Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023

work page 2023
[31]

Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization

Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter Richtárik. Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. InInternational conference on machine learning, pages 6286–6295. PMLR, 2021

work page 2021
[32]

Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

W Robert Mann. Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

work page 1953
[33]

Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4): 1574–1609, 2009

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4): 1574–1609, 2009

work page 2009
[34]

Springer Science & Business Media, 2004

Yurii Nesterov.Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2004

work page 2004
[35]

Springer, 2018

Yurii Nesterov.Lectures on convex optimization, volume 137. Springer, 2018

work page 2018
[36]

Sarah: A novel method for machine learning problems using stochastic recursive gradient

Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. InInternational conference on machine learning, pages 2613–2621. PMLR, 2017

work page 2017
[37]

Making gradient descent optimal for strongly convex stochastic optimization

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. InProceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012. 13

work page 2012
[38]

Variance-reduced clipping for non-convex optimization

Amirhossein Reisizadeh, Haochuan Li, Subhro Das, and Ali Jadbabaie. Variance-reduced clipping for non-convex optimization. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[39]

Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InInternational Conference on Machine Learning, pages 71–79. PMLR, 2013

work page 2013
[40]

Parameter-free clipped gradient descent meets polyak.Advances in Neural Information Processing Systems, 37: 44575–44599, 2024

Yuki Takezawa, Han Bao, Ryoma Sato, Kenta Niwa, and Makoto Yamada. Parameter-free clipped gradient descent meets polyak.Advances in Neural Information Processing Systems, 37: 44575–44599, 2024

work page 2024
[41]

Provable adaptivity of adam under non-uniform smoothness

Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen. Provable adaptivity of adam under non-uniform smoothness. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2960–2969, 2024

work page 2024
[42]

SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization

Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization.arXiv preprint arXiv:1810.10690, 2018

work page arXiv 2018
[43]

Improved analysis of clipping algorithms for non-convex optimization.Advances in Neural Information Processing Systems, 33:15511– 15521, 2020

Bohang Zhang, Jikai Jin, Cong Fang, and Liwei Wang. Improved analysis of clipping algorithms for non-convex optimization.Advances in Neural Information Processing Systems, 33:15511– 15521, 2020

work page 2020
[44]

arXiv preprint arXiv:1905.11881 (2019)

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity.arXiv preprint arXiv:1905.11881, 2019

work page arXiv 1905
[45]

On the convergence and improvement of stochastic normalized gradient descent.Science China Information Sciences, 64(3):132103, 2021

Shen-Yi Zhao, Yin-Peng Xie, and Wu-Jun Li. On the convergence and improvement of stochastic normalized gradient descent.Science China Information Sciences, 64(3):132103, 2021

work page 2021
[46]

4∆ γ0 + 8 η0 + 8(1−α)α α 1−α 2α/2K1γ0λ − α 1−α 0√η0 + 8·2 α/2K1Bαγ1+α 0√η0 + 8√η0Bγ0 # T −θ +

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization.Journal of Machine Learning Research, 21(103):1–63, 2020. 14 Appendix Table of Contents A A Short Primer on Smoothness Classes 16 B Proofs for Normalized Momentum underBG-0 Noise 18 B.1 Auxiliary inequalities . . . . . . . . . . . . . . . . . . . . . . ....

work page 2020

[1] [1]

Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

Ahmet Alacaoglu, Yura Malitsky, and Stephen J Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

work page arXiv 2025

[2] [2]

Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

work page 2023

[3] [3]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems, volume 24, 2011

work page 2011

[4] [4]

Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

Julius R Blum. Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

work page 1954

[5] [5]

Optimization methods for large-scale learning

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale learning. SIAM Review, 60(2):223–311, 2018

work page 2018

[6] [6]

Lower bounds for finding stationary points I.Mathematical Programming, 177:193–230, 2019

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points I.Mathematical Programming, 177:193–230, 2019

work page 2019

[7] [7]

Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization

Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimiza- tion is as efficient as smooth nonconvex optimization. InInternational Conference on Machine Learning, pages 5396–5427. PMLR, 2023. 11

work page 2023

[8] [8]

Ro- bustness to unbounded smoothness of generalized signsgd.Advances in neural information processing systems, 35:9955–9968, 2022

Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Ro- bustness to unbounded smoothness of generalized signsgd.Advances in neural information processing systems, 35:9955–9968, 2022

work page 2022

[9] [9]

Momentum improves normalized sgd

Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InInternational conference on machine learning, pages 2260–2268. PMLR, 2020

work page 2020

[10] [10]

Momentum-based variance reduction in non-convex SGD

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. InAdvances in Neural Information Processing Systems, 2019

work page 2019

[11] [11]

Saga: A fast incremental gradi- ent method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradi- ent method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014

work page 2014

[12] [12]

SPIDER: Near-optimal non- convex optimization via stochastic path-integrated differential estimator

Cong Fang, Chris Jiao Li, Zhouchen Lin, and Tong Zhang. SPIDER: Near-optimal non- convex optimization via stochastic path-integrated differential estimator. InAdvances in Neural Information Processing Systems, pages 689–699, 2018

work page 2018

[13] [13]

Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance

Arda Fazla, Ege C Kaya, Antesh Upadhyay, and Abolfazl Hashemi. Lower bounds and proximally anchored sgd for non-convex minimization under unbounded variance.arXiv preprint arXiv:2604.16620, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013

work page 2013

[15] [15]

On stochastic approximation.Theory of Probability & Its Applications, 10(2): 275–278, 1965

EG Gladyshev. On stochastic approximation.Theory of Probability & Its Applications, 10(2): 275–278, 1965

work page 1965

[16] [16]

A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent

Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. InInternational Conference on Artificial Intelligence and Statistics, pages 680–690. PMLR, 2020

work page 2020

[17] [17]

Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

Benjamin Grimmer. Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity.SIAM Journal on Optimization, 29(2):1350–1365, 2019

work page 2019

[18] [18]

Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73(6):957–961, 1967

Benjamin Halpern. Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73(6):957–961, 1967

work page 1967

[19] [19]

Parameter-agnostic optimization under relaxed smoothness

Florian Hübler, Junchi Yang, Xiang Li, and Niao He. Parameter-agnostic optimization under relaxed smoothness. InInternational Conference on Artificial Intelligence and Statistics, pages 4861–4869. PMLR, 2024

work page 2024

[20] [20]

Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan, and Tianjiao Li. Accelerated stochastic approximation with state-dependent noise.arXiv preprint arXiv:2307.01497, 2023

work page arXiv 2023

[21] [21]

Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification.Journal of machine learning research, 18(223):1–42, 2018

work page 2018

[22] [22]

Non-convex distributionally robust optimization: Non-asymptotic analysis.Advances in Neural Information Processing Systems, 34:2771–2782, 2021

Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis.Advances in Neural Information Processing Systems, 34:2771–2782, 2021. 12

work page 2021

[23] [23]

Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013

work page 2013

[24] [24]

Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world.Transactions on Machine Learning Research, 2023

work page 2023

[25] [25]

Unified analysis of stochastic gradient methods for composite convex and smooth optimization

Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M Gower, and Peter Richtárik. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. Journal of Optimization Theory and Applications, 199(2):499–540, 2023

work page 2023

[26] [26]

Error feedback under $(l_0,l_1)$-smoothness: Normalization and momentum

Sarit Khirirat, Abdurakhmon Sadiev, Artem Riabinin, Eduard Gorbunov, and Peter Richtárik. Error feedback under $(l_0,l_1)$-smoothness: Normalization and momentum. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps:// openreview.net/forum?id=LmcTbBvgjP

work page 2026

[27] [27]

Revisiting gradient clipping: Stochastic bias and tight convergence guarantees

Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. InInternational Conference on Machine Learning, pages 17343–17363. PMLR, 2023

work page 2023

[28] [28]

Two remarks on the method of successive approximations

Mark Aleksandrovich Krasnosel’ski˘ ı. Two remarks on the method of successive approximations. Uspekhi matematicheskikh nauk, 10(1):123–127, 1955

work page 1955

[29] [29]

Large-scale methods for distributionally robust optimization.Advances in neural information processing systems, 33: 8847–8860, 2020

Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization.Advances in neural information processing systems, 33: 8847–8860, 2020

work page 2020

[30] [30]

Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023

Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023

work page 2023

[31] [31]

Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization

Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter Richtárik. Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. InInternational conference on machine learning, pages 6286–6295. PMLR, 2021

work page 2021

[32] [32]

Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

W Robert Mann. Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

work page 1953

[33] [33]

Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4): 1574–1609, 2009

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4): 1574–1609, 2009

work page 2009

[34] [34]

Springer Science & Business Media, 2004

Yurii Nesterov.Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2004

work page 2004

[35] [35]

Springer, 2018

Yurii Nesterov.Lectures on convex optimization, volume 137. Springer, 2018

work page 2018

[36] [36]

Sarah: A novel method for machine learning problems using stochastic recursive gradient

Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. InInternational conference on machine learning, pages 2613–2621. PMLR, 2017

work page 2017

[37] [37]

Making gradient descent optimal for strongly convex stochastic optimization

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. InProceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012. 13

work page 2012

[38] [38]

Variance-reduced clipping for non-convex optimization

Amirhossein Reisizadeh, Haochuan Li, Subhro Das, and Ali Jadbabaie. Variance-reduced clipping for non-convex optimization. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025

[39] [39]

Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. InInternational Conference on Machine Learning, pages 71–79. PMLR, 2013

work page 2013

[40] [40]

Parameter-free clipped gradient descent meets polyak.Advances in Neural Information Processing Systems, 37: 44575–44599, 2024

Yuki Takezawa, Han Bao, Ryoma Sato, Kenta Niwa, and Makoto Yamada. Parameter-free clipped gradient descent meets polyak.Advances in Neural Information Processing Systems, 37: 44575–44599, 2024

work page 2024

[41] [41]

Provable adaptivity of adam under non-uniform smoothness

Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen. Provable adaptivity of adam under non-uniform smoothness. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2960–2969, 2024

work page 2024

[42] [42]

SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization

Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization.arXiv preprint arXiv:1810.10690, 2018

work page arXiv 2018

[43] [43]

Improved analysis of clipping algorithms for non-convex optimization.Advances in Neural Information Processing Systems, 33:15511– 15521, 2020

Bohang Zhang, Jikai Jin, Cong Fang, and Liwei Wang. Improved analysis of clipping algorithms for non-convex optimization.Advances in Neural Information Processing Systems, 33:15511– 15521, 2020

work page 2020

[44] [44]

arXiv preprint arXiv:1905.11881 (2019)

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity.arXiv preprint arXiv:1905.11881, 2019

work page arXiv 1905

[45] [45]

On the convergence and improvement of stochastic normalized gradient descent.Science China Information Sciences, 64(3):132103, 2021

Shen-Yi Zhao, Yin-Peng Xie, and Wu-Jun Li. On the convergence and improvement of stochastic normalized gradient descent.Science China Information Sciences, 64(3):132103, 2021

work page 2021

[46] [46]

4∆ γ0 + 8 η0 + 8(1−α)α α 1−α 2α/2K1γ0λ − α 1−α 0√η0 + 8·2 α/2K1Bαγ1+α 0√η0 + 8√η0Bγ0 # T −θ +

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization.Journal of Machine Learning Research, 21(103):1–63, 2020. 14 Appendix Table of Contents A A Short Primer on Smoothness Classes 16 B Proofs for Normalized Momentum underBG-0 Noise 18 B.1 Auxiliary inequalities . . . . . . . . . . . . . . . . . . . . . . ....

work page 2020