On the Limits of Biased Derivative Information for Nonconvex Stochastic Optimization

Anant Shyam; Brian Bullins

arxiv: 2606.19553 · v1 · pith:UE7UP6SKnew · submitted 2026-06-17 · 🧮 math.OC

On the Limits of Biased Derivative Information for Nonconvex Stochastic Optimization

Anant Shyam , Brian Bullins This is my paper

Pith reviewed 2026-06-26 19:38 UTC · model grok-4.3

classification 🧮 math.OC

keywords nonconvex optimizationstochastic optimizationbiased oracleslower boundsstationary pointshigher-order methodsvariance reductiontrust region methods

0 comments

The pith

Biased first-order oracles prevent nonconvex stochastic optimization from reaching better than O((ε + B²)^{1/2})-stationary points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes tight lower bounds showing that with gradient oracles whose bias is bounded by B, no algorithm can guarantee a stationary point better than O((ε + B²)^{1/2}) in smooth nonconvex problems. It then derives bias-dependent lower bounds when higher-order derivative oracles are available, showing that the achievable stationarity is limited to O(ε + B_max). To match these lower bounds the authors construct trust-region algorithms and a higher-order variance reduction scheme that reduces oracle calls in high-bias regimes. These findings matter because they delineate when bias fundamentally restricts progress and when higher-order information becomes advantageous, a distinction absent from unbiased stochastic settings.

Core claim

In the first-order setting, tight lower bounds are given for finding an O((ε + B²)^{1/2})-stationary point with biased gradient oracles, matching the upper bounds of Ajalloeian and Stich (2020). Bias-dependent lower bounds are established for higher-order oracles targeting O(ε + B_max)-stationary points. Trust-region based methods are developed to match the lower bounds for certain bias ranges, and a higher-order variance reduction scheme improves oracle complexity in high bias regimes, demonstrating benefits of higher-order information not seen in unbiased cases.

What carries the argument

Lower-bound constructions on families of smooth nonconvex functions paired with biased stochastic derivative oracles, together with trust-region algorithms and higher-order variance reduction that attain matching upper bounds.

If this is right

First-order methods cannot guarantee stationarity better than O((ε + B²)^{1/2}) when gradient bias is bounded by B.
Higher-order oracles remain limited to O(ε + B_max) stationarity under a uniform bias bound B_max.
Trust-region methods achieve the matching upper bounds for ranges of the bias parameter.
Higher-order variance reduction yields strictly better oracle complexity than first-order methods when bias is large.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithm designers could use the bias threshold that separates first-order from higher-order regimes to decide which oracle to query in applications with known approximation error.
The same hard-instance technique might be adapted to derive limits for biased oracles under additional structure such as convexity or manifold constraints.
Empirical checks on real biased oracles (for example, finite-difference gradients) could verify whether observed iteration counts scale with sqrt(B) as predicted.

Load-bearing premise

The bias of every derivative oracle is bounded by a fixed constant and the objective is smooth and non-convex.

What would settle it

An explicit smooth non-convex function together with a first-order algorithm that, despite gradient bias bounded by B, returns a point whose gradient norm is o((ε + B²)^{1/2}) for arbitrarily small ε would falsify the first-order lower bound.

read the original abstract

We consider the problem of finding $\delta$-stationary points for $\delta > 0$, i.e., $x \in \mathbb{R}^d$ such that $||\nabla F(x)|| \le \delta$, for smooth, non-convex objectives, where the derivative oracles are not only stochastic but also biased. In the first-order setting, we provide tight lower bounds for finding an $O((\epsilon + B^2)^{1/2})$-stationary point, for $\epsilon > 0$ and where $B$ is a bound on the gradient bias, matching the upper bounds of Ajalloeian and Stich (2020). We then establish bias-dependent lower bounds for algorithms that use higher-order derivative information for finding $O(\epsilon + B_{\max})$-stationary points, where $B_{\max}$ is a bound on the maximum bias for all derivatives. To complement these lower bounds, we develop trust-region based methods that, for certain ranges of bias, provide guarantees that match the corresponding lower bounds. We further improve upon the oracle complexity in high bias settings through a higher-order variance reduction scheme, in particular demonstrating the benefits, in some cases, of using higher-order derivative information, whereas such improvements are known to be unattainable for stochastic unbiased settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers tight bias-dependent lower bounds for nonconvex stochastic optimization that match known upper bounds, plus a variance reduction scheme that helps under high bias.

read the letter

The main thing here is that the lower bounds are tight and match existing upper bounds for the first-order biased case, reaching O((ε + B²)^{1/2})-stationary points, and they extend to higher-order oracles with O(ε + B_max) stationarity. The trust-region algorithms achieve these rates for some bias ranges, and the higher-order variance reduction improves complexity when bias is large, which is a useful contrast to the unbiased setting where higher-order information does not help.

The constructions look independent of the upper bounds, and the assumptions (smooth nonconvex functions, bounded bias on all derivatives) are stated clearly. The work stays within standard oracle models and does not appear to have hidden fitting or circular arguments.

The soft spots are modest. Everything hinges on the bias being uniformly bounded; if that fails the rates no longer apply, but that is an explicit modeling choice rather than a flaw. The paper is purely theoretical, so there is no empirical check on how often the high-bias regime occurs in practice.

This is for people working on oracle complexity in stochastic nonconvex optimization. A reader who cares about guiding algorithm choice when gradients are biased will get concrete rates and a new variance-reduction idea. It is solid enough to deserve a serious referee.

Referee Report

2 major / 2 minor

Summary. The manuscript considers finding δ-stationary points of smooth non-convex objectives when derivative oracles are stochastic and biased. It establishes tight lower bounds showing that first-order oracles with bias bound B cannot achieve better than O((ε + B²)^{1/2})-stationarity (matching the upper bounds of Ajalloeian and Stich 2020), and that higher-order oracles with maximum bias B_max cannot achieve better than O(ε + B_max)-stationarity. The authors complement these with trust-region algorithms that match the bounds for certain bias regimes and a higher-order variance-reduction scheme that improves oracle complexity in high-bias regimes, showing that higher-order information can be beneficial under bias (unlike the unbiased case).

Significance. If the matching lower and upper bounds hold under the stated smoothness and bounded-bias assumptions, the work provides a clear characterization of the fundamental limits imposed by bias on achievable stationarity and clarifies when higher-order derivative information yields improvements. The explicit matching to prior upper bounds and the contrast with unbiased stochastic settings are strengths that would make the results a useful reference for the field.

major comments (2)

[§4] §4 (higher-order lower bound): the construction establishes an O(ε + B_max) barrier on stationarity but does not appear to rule out the possibility that a carefully chosen combination of orders could achieve a strictly better bias dependence than B_max; if the proof relies on a single worst-case order, this should be stated explicitly as it affects the interpretation of the 'benefits in some cases' claim for higher-order methods.
[§5] §5 (trust-region algorithm): the matching upper bound is stated to hold 'for certain ranges of bias'; the precise condition on B relative to ε and smoothness constants under which the guarantee becomes tight should be given explicitly, as the current phrasing leaves open whether the algorithm recovers the lower-bound rate only in a measure-zero regime.

minor comments (2)

[§1] Notation for the bias parameters (B vs. B_max) is introduced in the abstract and §1 but the precise relationship between per-order biases and the max is not restated in the theorem statements; a short clarifying sentence would help.
[Figure 1] Figure 1 (complexity comparison) uses log-log scale without labeling the slope of the reference lines; adding the theoretical exponents would make the visual match to the derived rates immediate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and constructive suggestions. We address each major comment below.

read point-by-point responses

Referee: [§4] §4 (higher-order lower bound): the construction establishes an O(ε + B_max) barrier on stationarity but does not appear to rule out the possibility that a carefully chosen combination of orders could achieve a strictly better bias dependence than B_max; if the proof relies on a single worst-case order, this should be stated explicitly as it affects the interpretation of the 'benefits in some cases' claim for higher-order methods.

Authors: We thank the referee for highlighting this point. The lower-bound construction in §4 applies to any algorithm that may access an arbitrary combination of higher-order oracles, each with bias bounded by B_max. The worst-case instance is built so that the bias term uniformly limits the achievable stationarity for every order; consequently, no selection or combination of orders can improve upon the O(ε + B_max) barrier. We will add an explicit clarifying sentence in the revised §4 to make this scope of the result unambiguous. revision: yes
Referee: [§5] §5 (trust-region algorithm): the matching upper bound is stated to hold 'for certain ranges of bias'; the precise condition on B relative to ε and smoothness constants under which the guarantee becomes tight should be given explicitly, as the current phrasing leaves open whether the algorithm recovers the lower-bound rate only in a measure-zero regime.

Authors: We agree that the phrasing is imprecise. The trust-region analysis yields a matching rate precisely when B = Ω(ε) (with the implicit constant depending on the smoothness parameters L and the trust-region radius schedule). We will replace the phrase “for certain ranges of bias” with the explicit condition B ≥ cε (c depending on L) in the revised §5 and in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives lower bounds on the stationarity achievable with biased first- and higher-order oracles under standard smoothness and bounded-bias assumptions. These bounds are shown to match upper bounds from Ajalloeian and Stich (2020), an independent external reference, and the paper separately constructs trust-region and variance-reduced algorithms whose guarantees align with the new lower bounds. No step reduces by the paper's own equations to a fitted parameter renamed as a prediction, a self-definitional relation, or a load-bearing self-citation chain; the constructions remain independent of the target rates and do not import uniqueness results from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, invented entities, or additional axioms beyond standard smoothness and bounded-bias assumptions are visible.

axioms (2)

domain assumption Objective functions are smooth and non-convex
Stated as the problem setting for which δ-stationary points are sought.
domain assumption Derivative oracles have bounded bias B or B_max
Central to all stated lower and upper bounds.

pith-pipeline@v0.9.1-grok · 5764 in / 1310 out tokens · 23736 ms · 2026-06-26T19:38:56.512468+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 2 linked inside Pith

[1]

Convex optimization with p-norm oracles

Deeksha Adil, Brian Bullins, Arun Jambulapati, and Aaron Sidford. Convex optimization with p-norm oracles. In37th International Conference on Algorithmic Learning Theory, 2026. (Cited on page 6.)

2026
[2]

Finding approximate local minima faster than gradient descent.Symposium on Theory of Computing,

Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent.Symposium on Theory of Computing,
[3]

(Cited on pages 1 and 3.)
[4]

Beyond First Order Methods in ML Systems

Ahmad Ajalloeian and Sebastian U. Stich. On the convergence of sgd with biased gradients. International Conference on Machine Learning, Workshop on "Beyond First Order Methods in ML Systems", 2020. (Cited on pages 2, 4, 10, and 16.)

2020
[5]

Sparse communication for distributed gradient descent

Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 440–445, 2017. (Cited on page 2.)

2017
[6]

The convergence of sparsified gradient methods.Advances in Neural Information Processing Systems, 31, 2018

Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods.Advances in Neural Information Processing Systems, 31, 2018. (Cited on page 2.)

2018
[7]

How to make the gradients small stochastically: Even faster convex and nonconvex sgd.Advances in Neural Information Processing Systems, 31, 2018

Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd.Advances in Neural Information Processing Systems, 31, 2018. (Cited on page 1.)

2018
[8]

Natasha 2: Faster non-convex optimization than sgd.Advances in neural information processing systems, 31, 2018

Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd.Advances in neural information processing systems, 31, 2018. (Cited on page 1.)

2018
[9]

Duchi, Dylan J

Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Ayush Sekhari, and Karthik Srid- haran. Second-order information in non-convex stochastic optimization: Power and limitations. Conference on Learning Theory, 2020. (Cited on pages 1, 2, 5, 6, 8, and 15.)

2020
[10]

Duchi, Dylan J

Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming,
[11]

(Cited on pages 1 and 6.)
[12]

signsgd: Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. InInternational conference on machine learning, pages 560–569. PMLR, 2018. (Cited on page 1.)

2018
[13]

Jia Bi and Steve R. Gunn. A stochastic gradient method with biased estimation for faster nonconvex optimization.arXiv, 2019. (Cited on page 3.) 11

2019
[14]

Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models.Mathematical Programming, 163(1):359–368, 2017

Ernesto G Birgin, JL Gardenghi, José Mario Martínez, Sandra Augusta Santos, and Ph L Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models.Mathematical Programming, 163(1):359–368, 2017. (Cited on page 3.)

2017
[15]

convex until proven guilty

Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. "convex until proven guilty": Dimension-free acceleration of gradient descent on non-convex functions.International Confer- ence on Machine Learning, 2017. (Cited on pages 1 and 3.)

2017
[16]

Accelerated methods for nonconvex optimization.SIAM Journal on Optimization, 28(2):1751–1772, 2018

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization.SIAM Journal on Optimization, 28(2):1751–1772, 2018. (Cited on page 1.)

2018
[17]

Lower bounds for finding stationary points i.Mathematical Programming, 2019

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i.Mathematical Programming, 2019. (Cited on pages 1, 3, 5, 6, and 15.)

2019
[18]

Lower bounds for finding stationary points ii: First-order methods.Mathematical Programming, 2019

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points ii: First-order methods.Mathematical Programming, 2019. (Cited on pages 1, 3, and 6.)

2019
[19]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms.Advances in Neural Information Processing Systems, 2023. (Cited on page 1.)

2023
[20]

A guide through the zoo of biased sgd.Advances in Neural Information Processing Systems, 36, 2024

Yury Demidovich, Grigory Malinovsky, Igor Sokolov, and Peter Richtárik. A guide through the zoo of biased sgd.Advances in Neural Information Processing Systems, 36, 2024. (Cited on page 3.)

2024
[21]

On biased stochastic gradient estimation.Journal of Machine Learning Research, 2022

Derek Driggs, Jingwei Liang, and Carola-Bibiane Schönlieb. On biased stochastic gradient estimation.Journal of Machine Learning Research, 2022. (Cited on page 3.)

2022
[22]

Communication quantization for data-parallel training of deep neural networks

Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. Communication quantization for data-parallel training of deep neural networks. In2016 2nd Workshop on machine learning in hpc environments (MLHPC), pages 1–8. IEEE, 2016. (Cited on page 2.)

2016
[23]

Spider: Near-optimal non- convex optimization via stochastic path integrated differential estimator.Advances in Neural Information Processing Systems, 2018

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non- convex optimization via stochastic path integrated differential estimator.Advances in Neural Information Processing Systems, 2018. (Cited on pages 1, 2, and 8.)

2018
[24]

Sharp analysis for nonconvex sgd escaping from saddle points.Conference on Learning Theory, 2019

Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex sgd escaping from saddle points.Conference on Learning Theory, 2019. (Cited on page 1.)

2019
[25]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 2013. (Cited on page 1.)

2013
[26]

(bandit) convex optimization with biased noisy gradient oracles

Xiaowei Hu, LA Prashanth, András György, and Csaba Szepesvari. (bandit) convex optimization with biased noisy gradient oracles. InArtificial Intelligence and Statistics, pages 819–828. PMLR,
[27]

Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning.Advances in Neural Information Processing Systems, 2020

Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning.Advances in Neural Information Processing Systems, 2020. (Cited on page 3.) 12

2020
[28]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations, 2015. (Cited on page 1.)

2015
[29]

An optimal method for stochastic composite optimization.Mathematical Programming, 2011

Guanghui Lan. An optimal method for stochastic composite optimization.Mathematical Programming, 2011. (Cited on page 19.)

2011
[30]

Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the o(ϵ−7/4)complexity.Journal of Machine Learning Research, 24(157):1–37, 2023

Huan Li and Zhouchen Lin. Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the o(ϵ−7/4)complexity.Journal of Machine Learning Research, 24(157):1–37, 2023. (Cited on page 3.)

2023
[31]

Sophia: A scalable stochastic second-order optimizer for language model pre-training.International Conference on Learning Representations, 2024

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.International Conference on Learning Representations, 2024. (Cited on page 1.)

2024
[32]

Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025. (Cited on page 1.)

Pith/arXiv arXiv 2025
[33]

Stochastic optimization algorithms for problems with controllable biased oracles.arXiv, 2026

Yin Liu and Sam Davanloo Tajbakhsh. Stochastic optimization algorithms for problems with controllable biased oracles.arXiv, 2026. (Cited on page 3.)

2026
[34]

Stacey: Promoting stochastic steepest descent via acceleratedℓp-smooth nonconvex optimization.Inter- national Conference on Machine Learning, 2025

Xinyu Luo, Cedar Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, and Brian Bullins. Stacey: Promoting stochastic steepest descent via acceleratedℓp-smooth nonconvex optimization.Inter- national Conference on Machine Learning, 2025. (Cited on page 1.)

2025
[35]

Jordan, Richard Y

Lester Mackey, Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A. Tropp. Matrix concentration inequalities via the method of exchangeable pairs.The Annals of Probability,
[36]

How to make the gradients small.Optima

Yurii Nesterov. How to make the gradients small.Optima. Mathematical Optimization Society Newsletter, (88):10–11, 2012. (Cited on page 1.)

2012
[37]

Random gradient-free minimization of convex functions

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. (Cited on page 2.)

2017
[38]

Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan. Stochastic cubic regularization for fast nonconvex optimization.Advances in Neural Information Processing Systems, 2017. (Cited on page 1.)

2017
[39]

Sharp first-order lower bounds for higher-order smooth nonconvex optimization

Dongruo Zhou. Sharp first-order lower bounds for higher-order smooth nonconvex optimization. arXiv preprint arXiv:2606.05438, 2026. (Cited on page 3.)

Pith/arXiv arXiv 2026
[40]

Stochastic nested variance reduction for nonconvex optimization.Advances in Neural Information Processing Systems, 2018

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization.Advances in Neural Information Processing Systems, 2018. (Cited on page 1.) 13 A Theorems 1 and 2 Proofs A.1 Preliminaries We first introduce some important notational conventions used throughout this section. Given apth order tensorT∈R d×...,×d, we defi...

2018
[41]

,1 We again set β= L1 2(ϵ+B 2 1) 1 2 ℓ1 and T=⌊ ∆ α∆0 ⌋=⌊ ∆β 2(ϵ+B 2 1) 1 2 ∆0 ⌋ 18 By Lemma 1, we have that T−2 2ρ = 1 2ρ(⌊ ∆β 2(ϵ+B 2 1) 1 2 ∆0 ⌋ −2) ≥ 1 2ρ · ∆β 4∆0(ϵ+B 2 1) 1 2 ≥Ω ∆β ∆0(ϵ+B 2 1) 1 2 ! since for allB1 ≤O(1),ρ= Θ(1). When we further lower bound this expression, we have that Ω ∆β ∆0(ϵ+B 2 1) 1 2 ! ≥Ω ∆L1 ∆0(ϵ+B 2 1)ℓ1 = Ω ∆L1 ϵ+B 2 1 Com...

[1] [1]

Convex optimization with p-norm oracles

Deeksha Adil, Brian Bullins, Arun Jambulapati, and Aaron Sidford. Convex optimization with p-norm oracles. In37th International Conference on Algorithmic Learning Theory, 2026. (Cited on page 6.)

2026

[2] [2]

Finding approximate local minima faster than gradient descent.Symposium on Theory of Computing,

Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent.Symposium on Theory of Computing,

[3] [3]

(Cited on pages 1 and 3.)

[4] [4]

Beyond First Order Methods in ML Systems

Ahmad Ajalloeian and Sebastian U. Stich. On the convergence of sgd with biased gradients. International Conference on Machine Learning, Workshop on "Beyond First Order Methods in ML Systems", 2020. (Cited on pages 2, 4, 10, and 16.)

2020

[5] [5]

Sparse communication for distributed gradient descent

Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 440–445, 2017. (Cited on page 2.)

2017

[6] [6]

The convergence of sparsified gradient methods.Advances in Neural Information Processing Systems, 31, 2018

Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods.Advances in Neural Information Processing Systems, 31, 2018. (Cited on page 2.)

2018

[7] [7]

How to make the gradients small stochastically: Even faster convex and nonconvex sgd.Advances in Neural Information Processing Systems, 31, 2018

Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd.Advances in Neural Information Processing Systems, 31, 2018. (Cited on page 1.)

2018

[8] [8]

Natasha 2: Faster non-convex optimization than sgd.Advances in neural information processing systems, 31, 2018

Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd.Advances in neural information processing systems, 31, 2018. (Cited on page 1.)

2018

[9] [9]

Duchi, Dylan J

Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Ayush Sekhari, and Karthik Srid- haran. Second-order information in non-convex stochastic optimization: Power and limitations. Conference on Learning Theory, 2020. (Cited on pages 1, 2, 5, 6, 8, and 15.)

2020

[10] [10]

Duchi, Dylan J

Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming,

[11] [11]

(Cited on pages 1 and 6.)

[12] [12]

signsgd: Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. InInternational conference on machine learning, pages 560–569. PMLR, 2018. (Cited on page 1.)

2018

[13] [13]

Jia Bi and Steve R. Gunn. A stochastic gradient method with biased estimation for faster nonconvex optimization.arXiv, 2019. (Cited on page 3.) 11

2019

[14] [14]

Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models.Mathematical Programming, 163(1):359–368, 2017

Ernesto G Birgin, JL Gardenghi, José Mario Martínez, Sandra Augusta Santos, and Ph L Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models.Mathematical Programming, 163(1):359–368, 2017. (Cited on page 3.)

2017

[15] [15]

convex until proven guilty

Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. "convex until proven guilty": Dimension-free acceleration of gradient descent on non-convex functions.International Confer- ence on Machine Learning, 2017. (Cited on pages 1 and 3.)

2017

[16] [16]

Accelerated methods for nonconvex optimization.SIAM Journal on Optimization, 28(2):1751–1772, 2018

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization.SIAM Journal on Optimization, 28(2):1751–1772, 2018. (Cited on page 1.)

2018

[17] [17]

Lower bounds for finding stationary points i.Mathematical Programming, 2019

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i.Mathematical Programming, 2019. (Cited on pages 1, 3, 5, 6, and 15.)

2019

[18] [18]

Lower bounds for finding stationary points ii: First-order methods.Mathematical Programming, 2019

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points ii: First-order methods.Mathematical Programming, 2019. (Cited on pages 1, 3, and 6.)

2019

[19] [19]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms.Advances in Neural Information Processing Systems, 2023. (Cited on page 1.)

2023

[20] [20]

A guide through the zoo of biased sgd.Advances in Neural Information Processing Systems, 36, 2024

Yury Demidovich, Grigory Malinovsky, Igor Sokolov, and Peter Richtárik. A guide through the zoo of biased sgd.Advances in Neural Information Processing Systems, 36, 2024. (Cited on page 3.)

2024

[21] [21]

On biased stochastic gradient estimation.Journal of Machine Learning Research, 2022

Derek Driggs, Jingwei Liang, and Carola-Bibiane Schönlieb. On biased stochastic gradient estimation.Journal of Machine Learning Research, 2022. (Cited on page 3.)

2022

[22] [22]

Communication quantization for data-parallel training of deep neural networks

Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. Communication quantization for data-parallel training of deep neural networks. In2016 2nd Workshop on machine learning in hpc environments (MLHPC), pages 1–8. IEEE, 2016. (Cited on page 2.)

2016

[23] [23]

Spider: Near-optimal non- convex optimization via stochastic path integrated differential estimator.Advances in Neural Information Processing Systems, 2018

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non- convex optimization via stochastic path integrated differential estimator.Advances in Neural Information Processing Systems, 2018. (Cited on pages 1, 2, and 8.)

2018

[24] [24]

Sharp analysis for nonconvex sgd escaping from saddle points.Conference on Learning Theory, 2019

Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex sgd escaping from saddle points.Conference on Learning Theory, 2019. (Cited on page 1.)

2019

[25] [25]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 2013. (Cited on page 1.)

2013

[26] [26]

(bandit) convex optimization with biased noisy gradient oracles

Xiaowei Hu, LA Prashanth, András György, and Csaba Szepesvari. (bandit) convex optimization with biased noisy gradient oracles. InArtificial Intelligence and Statistics, pages 819–828. PMLR,

[27] [27]

Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning.Advances in Neural Information Processing Systems, 2020

Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning.Advances in Neural Information Processing Systems, 2020. (Cited on page 3.) 12

2020

[28] [28]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations, 2015. (Cited on page 1.)

2015

[29] [29]

An optimal method for stochastic composite optimization.Mathematical Programming, 2011

Guanghui Lan. An optimal method for stochastic composite optimization.Mathematical Programming, 2011. (Cited on page 19.)

2011

[30] [30]

Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the o(ϵ−7/4)complexity.Journal of Machine Learning Research, 24(157):1–37, 2023

Huan Li and Zhouchen Lin. Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the o(ϵ−7/4)complexity.Journal of Machine Learning Research, 24(157):1–37, 2023. (Cited on page 3.)

2023

[31] [31]

Sophia: A scalable stochastic second-order optimizer for language model pre-training.International Conference on Learning Representations, 2024

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.International Conference on Learning Representations, 2024. (Cited on page 1.)

2024

[32] [32]

Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025. (Cited on page 1.)

Pith/arXiv arXiv 2025

[33] [33]

Stochastic optimization algorithms for problems with controllable biased oracles.arXiv, 2026

Yin Liu and Sam Davanloo Tajbakhsh. Stochastic optimization algorithms for problems with controllable biased oracles.arXiv, 2026. (Cited on page 3.)

2026

[34] [34]

Stacey: Promoting stochastic steepest descent via acceleratedℓp-smooth nonconvex optimization.Inter- national Conference on Machine Learning, 2025

Xinyu Luo, Cedar Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, and Brian Bullins. Stacey: Promoting stochastic steepest descent via acceleratedℓp-smooth nonconvex optimization.Inter- national Conference on Machine Learning, 2025. (Cited on page 1.)

2025

[35] [35]

Jordan, Richard Y

Lester Mackey, Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A. Tropp. Matrix concentration inequalities via the method of exchangeable pairs.The Annals of Probability,

[36] [36]

How to make the gradients small.Optima

Yurii Nesterov. How to make the gradients small.Optima. Mathematical Optimization Society Newsletter, (88):10–11, 2012. (Cited on page 1.)

2012

[37] [37]

Random gradient-free minimization of convex functions

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. (Cited on page 2.)

2017

[38] [38]

Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan. Stochastic cubic regularization for fast nonconvex optimization.Advances in Neural Information Processing Systems, 2017. (Cited on page 1.)

2017

[39] [39]

Sharp first-order lower bounds for higher-order smooth nonconvex optimization

Dongruo Zhou. Sharp first-order lower bounds for higher-order smooth nonconvex optimization. arXiv preprint arXiv:2606.05438, 2026. (Cited on page 3.)

Pith/arXiv arXiv 2026

[40] [40]

Stochastic nested variance reduction for nonconvex optimization.Advances in Neural Information Processing Systems, 2018

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization.Advances in Neural Information Processing Systems, 2018. (Cited on page 1.) 13 A Theorems 1 and 2 Proofs A.1 Preliminaries We first introduce some important notational conventions used throughout this section. Given apth order tensorT∈R d×...,×d, we defi...

2018

[41] [41]

,1 We again set β= L1 2(ϵ+B 2 1) 1 2 ℓ1 and T=⌊ ∆ α∆0 ⌋=⌊ ∆β 2(ϵ+B 2 1) 1 2 ∆0 ⌋ 18 By Lemma 1, we have that T−2 2ρ = 1 2ρ(⌊ ∆β 2(ϵ+B 2 1) 1 2 ∆0 ⌋ −2) ≥ 1 2ρ · ∆β 4∆0(ϵ+B 2 1) 1 2 ≥Ω ∆β ∆0(ϵ+B 2 1) 1 2 ! since for allB1 ≤O(1),ρ= Θ(1). When we further lower bound this expression, we have that Ω ∆β ∆0(ϵ+B 2 1) 1 2 ! ≥Ω ∆L1 ∆0(ϵ+B 2 1)ℓ1 = Ω ∆L1 ϵ+B 2 1 Com...