On the Limits of Biased Derivative Information for Nonconvex Stochastic Optimization
Pith reviewed 2026-06-26 19:38 UTC · model grok-4.3
The pith
Biased first-order oracles prevent nonconvex stochastic optimization from reaching better than O((ε + B²)^{1/2})-stationary points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the first-order setting, tight lower bounds are given for finding an O((ε + B²)^{1/2})-stationary point with biased gradient oracles, matching the upper bounds of Ajalloeian and Stich (2020). Bias-dependent lower bounds are established for higher-order oracles targeting O(ε + B_max)-stationary points. Trust-region based methods are developed to match the lower bounds for certain bias ranges, and a higher-order variance reduction scheme improves oracle complexity in high bias regimes, demonstrating benefits of higher-order information not seen in unbiased cases.
What carries the argument
Lower-bound constructions on families of smooth nonconvex functions paired with biased stochastic derivative oracles, together with trust-region algorithms and higher-order variance reduction that attain matching upper bounds.
If this is right
- First-order methods cannot guarantee stationarity better than O((ε + B²)^{1/2}) when gradient bias is bounded by B.
- Higher-order oracles remain limited to O(ε + B_max) stationarity under a uniform bias bound B_max.
- Trust-region methods achieve the matching upper bounds for ranges of the bias parameter.
- Higher-order variance reduction yields strictly better oracle complexity than first-order methods when bias is large.
Where Pith is reading between the lines
- Algorithm designers could use the bias threshold that separates first-order from higher-order regimes to decide which oracle to query in applications with known approximation error.
- The same hard-instance technique might be adapted to derive limits for biased oracles under additional structure such as convexity or manifold constraints.
- Empirical checks on real biased oracles (for example, finite-difference gradients) could verify whether observed iteration counts scale with sqrt(B) as predicted.
Load-bearing premise
The bias of every derivative oracle is bounded by a fixed constant and the objective is smooth and non-convex.
What would settle it
An explicit smooth non-convex function together with a first-order algorithm that, despite gradient bias bounded by B, returns a point whose gradient norm is o((ε + B²)^{1/2}) for arbitrarily small ε would falsify the first-order lower bound.
read the original abstract
We consider the problem of finding $\delta$-stationary points for $\delta > 0$, i.e., $x \in \mathbb{R}^d$ such that $||\nabla F(x)|| \le \delta$, for smooth, non-convex objectives, where the derivative oracles are not only stochastic but also biased. In the first-order setting, we provide tight lower bounds for finding an $O((\epsilon + B^2)^{1/2})$-stationary point, for $\epsilon > 0$ and where $B$ is a bound on the gradient bias, matching the upper bounds of Ajalloeian and Stich (2020). We then establish bias-dependent lower bounds for algorithms that use higher-order derivative information for finding $O(\epsilon + B_{\max})$-stationary points, where $B_{\max}$ is a bound on the maximum bias for all derivatives. To complement these lower bounds, we develop trust-region based methods that, for certain ranges of bias, provide guarantees that match the corresponding lower bounds. We further improve upon the oracle complexity in high bias settings through a higher-order variance reduction scheme, in particular demonstrating the benefits, in some cases, of using higher-order derivative information, whereas such improvements are known to be unattainable for stochastic unbiased settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript considers finding δ-stationary points of smooth non-convex objectives when derivative oracles are stochastic and biased. It establishes tight lower bounds showing that first-order oracles with bias bound B cannot achieve better than O((ε + B²)^{1/2})-stationarity (matching the upper bounds of Ajalloeian and Stich 2020), and that higher-order oracles with maximum bias B_max cannot achieve better than O(ε + B_max)-stationarity. The authors complement these with trust-region algorithms that match the bounds for certain bias regimes and a higher-order variance-reduction scheme that improves oracle complexity in high-bias regimes, showing that higher-order information can be beneficial under bias (unlike the unbiased case).
Significance. If the matching lower and upper bounds hold under the stated smoothness and bounded-bias assumptions, the work provides a clear characterization of the fundamental limits imposed by bias on achievable stationarity and clarifies when higher-order derivative information yields improvements. The explicit matching to prior upper bounds and the contrast with unbiased stochastic settings are strengths that would make the results a useful reference for the field.
major comments (2)
- [§4] §4 (higher-order lower bound): the construction establishes an O(ε + B_max) barrier on stationarity but does not appear to rule out the possibility that a carefully chosen combination of orders could achieve a strictly better bias dependence than B_max; if the proof relies on a single worst-case order, this should be stated explicitly as it affects the interpretation of the 'benefits in some cases' claim for higher-order methods.
- [§5] §5 (trust-region algorithm): the matching upper bound is stated to hold 'for certain ranges of bias'; the precise condition on B relative to ε and smoothness constants under which the guarantee becomes tight should be given explicitly, as the current phrasing leaves open whether the algorithm recovers the lower-bound rate only in a measure-zero regime.
minor comments (2)
- [§1] Notation for the bias parameters (B vs. B_max) is introduced in the abstract and §1 but the precise relationship between per-order biases and the max is not restated in the theorem statements; a short clarifying sentence would help.
- [Figure 1] Figure 1 (complexity comparison) uses log-log scale without labeling the slope of the reference lines; adding the theoretical exponents would make the visual match to the derived rates immediate.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive assessment, and constructive suggestions. We address each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (higher-order lower bound): the construction establishes an O(ε + B_max) barrier on stationarity but does not appear to rule out the possibility that a carefully chosen combination of orders could achieve a strictly better bias dependence than B_max; if the proof relies on a single worst-case order, this should be stated explicitly as it affects the interpretation of the 'benefits in some cases' claim for higher-order methods.
Authors: We thank the referee for highlighting this point. The lower-bound construction in §4 applies to any algorithm that may access an arbitrary combination of higher-order oracles, each with bias bounded by B_max. The worst-case instance is built so that the bias term uniformly limits the achievable stationarity for every order; consequently, no selection or combination of orders can improve upon the O(ε + B_max) barrier. We will add an explicit clarifying sentence in the revised §4 to make this scope of the result unambiguous. revision: yes
-
Referee: [§5] §5 (trust-region algorithm): the matching upper bound is stated to hold 'for certain ranges of bias'; the precise condition on B relative to ε and smoothness constants under which the guarantee becomes tight should be given explicitly, as the current phrasing leaves open whether the algorithm recovers the lower-bound rate only in a measure-zero regime.
Authors: We agree that the phrasing is imprecise. The trust-region analysis yields a matching rate precisely when B = Ω(ε) (with the implicit constant depending on the smoothness parameters L and the trust-region radius schedule). We will replace the phrase “for certain ranges of bias” with the explicit condition B ≥ cε (c depending on L) in the revised §5 and in the abstract. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper derives lower bounds on the stationarity achievable with biased first- and higher-order oracles under standard smoothness and bounded-bias assumptions. These bounds are shown to match upper bounds from Ajalloeian and Stich (2020), an independent external reference, and the paper separately constructs trust-region and variance-reduced algorithms whose guarantees align with the new lower bounds. No step reduces by the paper's own equations to a fitted parameter renamed as a prediction, a self-definitional relation, or a load-bearing self-citation chain; the constructions remain independent of the target rates and do not import uniqueness results from the authors' prior work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Objective functions are smooth and non-convex
- domain assumption Derivative oracles have bounded bias B or B_max
Reference graph
Works this paper leans on
-
[1]
Convex optimization with p-norm oracles
Deeksha Adil, Brian Bullins, Arun Jambulapati, and Aaron Sidford. Convex optimization with p-norm oracles. In37th International Conference on Algorithmic Learning Theory, 2026. (Cited on page 6.)
2026
-
[2]
Finding approximate local minima faster than gradient descent.Symposium on Theory of Computing,
Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent.Symposium on Theory of Computing,
-
[3]
(Cited on pages 1 and 3.)
-
[4]
Beyond First Order Methods in ML Systems
Ahmad Ajalloeian and Sebastian U. Stich. On the convergence of sgd with biased gradients. International Conference on Machine Learning, Workshop on "Beyond First Order Methods in ML Systems", 2020. (Cited on pages 2, 4, 10, and 16.)
2020
-
[5]
Sparse communication for distributed gradient descent
Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 440–445, 2017. (Cited on page 2.)
2017
-
[6]
The convergence of sparsified gradient methods.Advances in Neural Information Processing Systems, 31, 2018
Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods.Advances in Neural Information Processing Systems, 31, 2018. (Cited on page 2.)
2018
-
[7]
How to make the gradients small stochastically: Even faster convex and nonconvex sgd.Advances in Neural Information Processing Systems, 31, 2018
Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd.Advances in Neural Information Processing Systems, 31, 2018. (Cited on page 1.)
2018
-
[8]
Natasha 2: Faster non-convex optimization than sgd.Advances in neural information processing systems, 31, 2018
Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd.Advances in neural information processing systems, 31, 2018. (Cited on page 1.)
2018
-
[9]
Duchi, Dylan J
Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Ayush Sekhari, and Karthik Srid- haran. Second-order information in non-convex stochastic optimization: Power and limitations. Conference on Learning Theory, 2020. (Cited on pages 1, 2, 5, 6, 8, and 15.)
2020
-
[10]
Duchi, Dylan J
Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming,
-
[11]
(Cited on pages 1 and 6.)
-
[12]
signsgd: Compressed optimisation for non-convex problems
Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. InInternational conference on machine learning, pages 560–569. PMLR, 2018. (Cited on page 1.)
2018
-
[13]
Jia Bi and Steve R. Gunn. A stochastic gradient method with biased estimation for faster nonconvex optimization.arXiv, 2019. (Cited on page 3.) 11
2019
-
[14]
Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models.Mathematical Programming, 163(1):359–368, 2017
Ernesto G Birgin, JL Gardenghi, José Mario Martínez, Sandra Augusta Santos, and Ph L Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models.Mathematical Programming, 163(1):359–368, 2017. (Cited on page 3.)
2017
-
[15]
convex until proven guilty
Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. "convex until proven guilty": Dimension-free acceleration of gradient descent on non-convex functions.International Confer- ence on Machine Learning, 2017. (Cited on pages 1 and 3.)
2017
-
[16]
Accelerated methods for nonconvex optimization.SIAM Journal on Optimization, 28(2):1751–1772, 2018
Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization.SIAM Journal on Optimization, 28(2):1751–1772, 2018. (Cited on page 1.)
2018
-
[17]
Lower bounds for finding stationary points i.Mathematical Programming, 2019
Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i.Mathematical Programming, 2019. (Cited on pages 1, 3, 5, 6, and 15.)
2019
-
[18]
Lower bounds for finding stationary points ii: First-order methods.Mathematical Programming, 2019
Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points ii: First-order methods.Mathematical Programming, 2019. (Cited on pages 1, 3, and 6.)
2019
-
[19]
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms.Advances in Neural Information Processing Systems, 2023. (Cited on page 1.)
2023
-
[20]
A guide through the zoo of biased sgd.Advances in Neural Information Processing Systems, 36, 2024
Yury Demidovich, Grigory Malinovsky, Igor Sokolov, and Peter Richtárik. A guide through the zoo of biased sgd.Advances in Neural Information Processing Systems, 36, 2024. (Cited on page 3.)
2024
-
[21]
On biased stochastic gradient estimation.Journal of Machine Learning Research, 2022
Derek Driggs, Jingwei Liang, and Carola-Bibiane Schönlieb. On biased stochastic gradient estimation.Journal of Machine Learning Research, 2022. (Cited on page 3.)
2022
-
[22]
Communication quantization for data-parallel training of deep neural networks
Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. Communication quantization for data-parallel training of deep neural networks. In2016 2nd Workshop on machine learning in hpc environments (MLHPC), pages 1–8. IEEE, 2016. (Cited on page 2.)
2016
-
[23]
Spider: Near-optimal non- convex optimization via stochastic path integrated differential estimator.Advances in Neural Information Processing Systems, 2018
Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non- convex optimization via stochastic path integrated differential estimator.Advances in Neural Information Processing Systems, 2018. (Cited on pages 1, 2, and 8.)
2018
-
[24]
Sharp analysis for nonconvex sgd escaping from saddle points.Conference on Learning Theory, 2019
Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex sgd escaping from saddle points.Conference on Learning Theory, 2019. (Cited on page 1.)
2019
-
[25]
Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 2013
Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 2013. (Cited on page 1.)
2013
-
[26]
(bandit) convex optimization with biased noisy gradient oracles
Xiaowei Hu, LA Prashanth, András György, and Csaba Szepesvari. (bandit) convex optimization with biased noisy gradient oracles. InArtificial Intelligence and Statistics, pages 819–828. PMLR,
-
[27]
Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning.Advances in Neural Information Processing Systems, 2020
Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning.Advances in Neural Information Processing Systems, 2020. (Cited on page 3.) 12
2020
-
[28]
Kingma and Jimmy Ba
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations, 2015. (Cited on page 1.)
2015
-
[29]
An optimal method for stochastic composite optimization.Mathematical Programming, 2011
Guanghui Lan. An optimal method for stochastic composite optimization.Mathematical Programming, 2011. (Cited on page 19.)
2011
-
[30]
Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the o(ϵ−7/4)complexity.Journal of Machine Learning Research, 24(157):1–37, 2023
Huan Li and Zhouchen Lin. Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the o(ϵ−7/4)complexity.Journal of Machine Learning Research, 24(157):1–37, 2023. (Cited on page 3.)
2023
-
[31]
Sophia: A scalable stochastic second-order optimizer for language model pre-training.International Conference on Learning Representations, 2024
Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.International Conference on Learning Representations, 2024. (Cited on page 1.)
2024
-
[32]
Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025. (Cited on page 1.)
Pith/arXiv arXiv 2025
-
[33]
Stochastic optimization algorithms for problems with controllable biased oracles.arXiv, 2026
Yin Liu and Sam Davanloo Tajbakhsh. Stochastic optimization algorithms for problems with controllable biased oracles.arXiv, 2026. (Cited on page 3.)
2026
-
[34]
Stacey: Promoting stochastic steepest descent via acceleratedℓp-smooth nonconvex optimization.Inter- national Conference on Machine Learning, 2025
Xinyu Luo, Cedar Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, and Brian Bullins. Stacey: Promoting stochastic steepest descent via acceleratedℓp-smooth nonconvex optimization.Inter- national Conference on Machine Learning, 2025. (Cited on page 1.)
2025
-
[35]
Jordan, Richard Y
Lester Mackey, Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A. Tropp. Matrix concentration inequalities via the method of exchangeable pairs.The Annals of Probability,
-
[36]
How to make the gradients small.Optima
Yurii Nesterov. How to make the gradients small.Optima. Mathematical Optimization Society Newsletter, (88):10–11, 2012. (Cited on page 1.)
2012
-
[37]
Random gradient-free minimization of convex functions
Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. (Cited on page 2.)
2017
-
[38]
Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan. Stochastic cubic regularization for fast nonconvex optimization.Advances in Neural Information Processing Systems, 2017. (Cited on page 1.)
2017
-
[39]
Sharp first-order lower bounds for higher-order smooth nonconvex optimization
Dongruo Zhou. Sharp first-order lower bounds for higher-order smooth nonconvex optimization. arXiv preprint arXiv:2606.05438, 2026. (Cited on page 3.)
Pith/arXiv arXiv 2026
-
[40]
Stochastic nested variance reduction for nonconvex optimization.Advances in Neural Information Processing Systems, 2018
Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization.Advances in Neural Information Processing Systems, 2018. (Cited on page 1.) 13 A Theorems 1 and 2 Proofs A.1 Preliminaries We first introduce some important notational conventions used throughout this section. Given apth order tensorT∈R d×...,×d, we defi...
2018
-
[41]
,1 We again set β= L1 2(ϵ+B 2 1) 1 2 ℓ1 and T=⌊ ∆ α∆0 ⌋=⌊ ∆β 2(ϵ+B 2 1) 1 2 ∆0 ⌋ 18 By Lemma 1, we have that T−2 2ρ = 1 2ρ(⌊ ∆β 2(ϵ+B 2 1) 1 2 ∆0 ⌋ −2) ≥ 1 2ρ · ∆β 4∆0(ϵ+B 2 1) 1 2 ≥Ω ∆β ∆0(ϵ+B 2 1) 1 2 ! since for allB1 ≤O(1),ρ= Θ(1). When we further lower bound this expression, we have that Ω ∆β ∆0(ϵ+B 2 1) 1 2 ! ≥Ω ∆L1 ∆0(ϵ+B 2 1)ℓ1 = Ω ∆L1 ϵ+B 2 1 Com...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.