High-Probability Guarantees for Random Zeroth-Order (Stochastic) Gradient Descent
Pith reviewed 2026-07-01 09:26 UTC · model grok-4.3
The pith
Random zeroth-order gradient descent reaches ε-suboptimal solutions with probability 1-δ using only an additive log(1/δ) term in query complexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For deterministic L-smooth and μ-strongly convex d-dimensional objectives the two-query random zeroth-order method finds an ε-suboptimal solution with probability at least 1-δ using O((dL/μ) log(1/ε) + log(1/δ)) function queries; for stochastic objectives under bounded noise the corresponding random zeroth-order SGD achieves the same guarantee with O(d log(1/ε)(log(1/ε) + log(1/δ))/ε) queries.
What carries the argument
The two-point random zeroth-order gradient estimator formed by querying the objective at a random point and at a nearby perturbed point on the unit sphere, then scaling the difference by dimension.
If this is right
- Only an additive logarithmic number of extra queries is required to upgrade expectation bounds to high-probability bounds.
- The stochastic result holds without assuming uniformly bounded stochastic gradients.
- The same high-probability analysis applies directly to the classical two-query estimator without changing its per-step cost.
- High-confidence performance guarantees become available for black-box and bandit settings at modest extra cost.
Where Pith is reading between the lines
- Concentration tools can be applied to the zeroth-order direction estimates without inflating the leading dimension dependence.
- The same log(1/δ) overhead may appear when high-probability analysis is carried out for other zeroth-order estimators that use a constant number of queries per step.
- The bounded-noise assumption could be relaxed to sub-Gaussian noise while preserving the stated form of the bound.
Load-bearing premise
The objective must be L-smooth and μ-strongly convex in the deterministic setting and must satisfy a bounded-noise condition in the stochastic setting.
What would settle it
An explicit L-smooth μ-strongly convex function and choice of ε, δ such that after the stated number of queries the algorithm returns a point farther than ε from the optimum with probability greater than δ.
read the original abstract
Zeroth-order optimization aims to minimize an objective function using only function evaluations, and is therefore fundamental in black-box optimization, hyperparameter tuning, bandit learning, and adversarial machine learning. While classical zeroth-order methods are well understood in expectation, much less is known about their high-probability behavior, especially for smooth and strongly convex objectives. In this paper, we establish high-probability convergence guarantees for random zeroth-order gradient descent in both deterministic and stochastic settings. For deterministic $L$-smooth and $\mu$-strongly convex objectives of $d$-dimension, we show that the classical two-query random zeroth-order method finds an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{dL}{\mu}\log\frac{1}{\varepsilon} + \log\frac{1}{\delta} \right) \] function queries. Thus, compared with the standard in-expectation complexity, only an additive logarithmic dependence on the confidence parameter is needed. For stochastic objectives, under a bounded-noise condition and without assuming uniformly bounded stochastic gradients, we prove that random zeroth-order stochastic gradient descent achieves an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{ d\log(1/\varepsilon) \left(\log(1/\varepsilon)+\log(1/\delta)\right) }{\varepsilon} \right) \] queries. Our results provide high-confidence counterparts to classical expectation-based zeroth-order convergence guarantees and clarify the additional cost required to obtain reliable performance guarantees.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper establishes high-probability convergence guarantees for the classical two-query random zeroth-order gradient descent method. For deterministic L-smooth and μ-strongly convex d-dimensional objectives, it shows that an ε-suboptimal solution is found with probability at least 1-δ using O((dL/μ) log(1/ε) + log(1/δ)) function queries. For stochastic objectives under a bounded-noise condition (without uniformly bounded stochastic gradients), random zeroth-order SGD achieves the same guarantee with O(d log(1/ε) (log(1/ε) + log(1/δ))/ε) queries.
Significance. If the stated rates hold, the results are significant because they supply high-probability counterparts to classical in-expectation zeroth-order analyses, requiring only an additive logarithmic dependence on 1/δ. This is useful for reliable black-box optimization, hyperparameter tuning, and adversarial settings. The relaxation to bounded noise without uniform gradient bounds is a positive feature of the stochastic analysis.
minor comments (2)
- The abstract states the main theorems clearly but does not indicate the section numbers where the full proofs appear; adding explicit theorem references would improve readability.
- Notation for the random-direction estimator and the precise form of the bounded-noise assumption should be defined at first use in the introduction for readers unfamiliar with the ZO literature.
Simulated Author's Rebuttal
We thank the referee for their positive summary of our contributions and for recommending minor revision. The referee's description of the results is accurate. No major comments were provided in the report, so we have no specific points to address.
Circularity Check
No circularity; derivation self-contained under standard assumptions
full rationale
The paper derives high-probability query complexities for random ZO-GD and ZO-SGD directly from L-smoothness, μ-strong convexity, and bounded-noise conditions using standard concentration arguments on random-direction gradient estimators. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations are present. The stated rates follow from applying martingale or concentration bounds to the trajectory under the explicitly listed assumptions, without any step that reduces the output to the input by construction. This is the normal case of a self-contained theoretical analysis.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption L-smoothness and μ-strong convexity of the objective
- domain assumption Bounded-noise condition without uniformly bounded stochastic gradients
Reference graph
Works this paper leans on
-
[1]
Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt (pp.\ 28--40)
2010
-
[2]
E., & Nocedal, J
Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review , 60(2), 223--311
2018
-
[3]
Conn, A., Scheinberg, K., & Vicente, L. (2009). Introduction to Derivative-Free Optimization . SIAM
2009
-
[4]
Duchi, J., Jordan, M., & Wainwright, M. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory , 61, 2059--2072
2015
-
[5]
D., Kalai, A
Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (pp.\ 385--394).: SIAM
2005
-
[6]
& Lan, G
Ghadimi, S. & Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23, 2341--2368
2013
-
[7]
Kakade, S. M. & Tewari, A. (2008). On the generalization ability of online strongly convex programming algorithms. Advances in neural information processing systems , 21
2008
-
[8]
& Massart, P
Laurent, B. & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics , 28(5), 1302--1338
2000
- [9]
-
[10]
D., Chen, D., & Arora, S
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems , 36, 53038--53075
2023
-
[11]
& Spokoiny, V
Nesterov, Y. & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics , 17, 527--566
2017
-
[12]
Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning
2012
-
[13]
& Monro, S
Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3), 400--407
1951
-
[14]
Shamir, O. (2013). On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory (COLT) (pp.\ 3--24)
2013
-
[15]
Shamir, O. (2017). An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research , 18(52), 1--11
2017
-
[16]
Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The advancement in stochastic zeroth-order optimization: Mechanism of accelerated convergence of gaussian direction on objectives with skewed hessian eigenvalues
2024
-
[17]
Ye, H. (2026). Optimal high-probability regret for online convex optimization with two-point bandit feedback. arXiv preprint arXiv:2603.25029
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [18]
-
[19]
J., & Zhang, T
Ye, H., Huang, Z., Fang, C., Li, C. J., & Zhang, T. (2025b). Hessian-aware zeroth-order optimization. IEEE transactions on pattern analysis and machine intelligence
- [20]
-
[21]
Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). Why gradient clipping accelerates training: A theoretical justification for adaptivity. International Conference on Learning Representations
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.