High-Probability Guarantees for Random Zeroth-Order (Stochastic) Gradient Descent
Pith reviewed 2026-05-08 05:45 UTC · model grok-4.3
The pith
Random zeroth-order gradient descent reaches ε-suboptimality with probability 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries for smooth strongly convex functions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For deterministic L-smooth and μ-strongly convex objectives of d-dimension, the classical two-query random zeroth-order method finds an ε-suboptimal solution with probability at least 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) function queries. For stochastic objectives under a bounded-noise condition, random zeroth-order stochastic gradient descent achieves an ε-suboptimal solution with probability at least 1-δ using O(d log(1/ε)(log(1/ε) + log(1/δ))/ε) queries without assuming uniformly bounded stochastic gradients.
What carries the argument
The two-query random zeroth-order gradient estimator, which approximates the gradient via finite differences along a random unit vector and enables concentration analysis for high-probability bounds.
Load-bearing premise
The objective is L-smooth and μ-strongly convex in the deterministic case or satisfies a bounded-noise condition in the stochastic case.
What would settle it
A concrete L-smooth strongly convex function on which the two-query random zeroth-order method requires more than O((dL/μ)log(1/ε) + log(1/δ)) queries to reach ε-suboptimality with probability 1-δ would falsify the deterministic claim.
read the original abstract
Zeroth-order optimization aims to minimize an objective function using only function evaluations, and is therefore fundamental in black-box optimization, hyperparameter tuning, bandit learning, and adversarial machine learning. While classical zeroth-order methods are well understood in expectation, much less is known about their high-probability behavior, especially for smooth and strongly convex objectives. In this paper, we establish high-probability convergence guarantees for random zeroth-order gradient descent in both deterministic and stochastic settings. For deterministic $L$-smooth and $\mu$-strongly convex objectives of $d$-dimension, we show that the classical two-query random zeroth-order method finds an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{dL}{\mu}\log\frac{1}{\varepsilon} + \log\frac{1}{\delta} \right) \] function queries. Thus, compared with the standard in-expectation complexity, only an additive logarithmic dependence on the confidence parameter is needed. For stochastic objectives, under a bounded-noise condition and without assuming uniformly bounded stochastic gradients, we prove that random zeroth-order stochastic gradient descent achieves an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{ d\log(1/\varepsilon) \left(\log(1/\varepsilon)+\log(1/\delta)\right) }{\varepsilon} \right) \] queries. Our results provide high-confidence counterparts to classical expectation-based zeroth-order convergence guarantees and clarify the additional cost required to obtain reliable performance guarantees.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper establishes high-probability convergence guarantees for the classical two-query random zeroth-order gradient descent method. For deterministic L-smooth and μ-strongly convex objectives in d dimensions, it claims that O((dL/μ) log(1/ε) + log(1/δ)) function queries suffice to reach an ε-suboptimal point with probability at least 1-δ. For stochastic objectives under a bounded-noise assumption (without requiring uniformly bounded stochastic gradients), it claims that random ZO-SGD achieves the same guarantee with O(d log(1/ε) (log(1/ε) + log(1/δ)) / ε) queries. The results are positioned as high-probability analogues of classical in-expectation ZO rates.
Significance. If the stochastic high-probability bound holds under only bounded noise, the result would be significant because it relaxes a common assumption in ZO-SGD analyses and shows that only logarithmic overhead in 1/δ is needed beyond the expectation rate. The deterministic bound is a modest but clean improvement over standard analyses. The paper supplies explicit complexity expressions rather than asymptotic notation alone.
major comments (2)
- [Abstract / stochastic theorem] Abstract and the stochastic main result (presumably Theorem 4 or §4): the claimed O(d log(1/ε)(log(1/ε)+log(1/δ))/ε) query bound under bounded noise alone is load-bearing for the paper's contribution, yet the variance of each two-query random-direction ZO estimator scales with the local gradient magnitude. Without uniform bounds on stochastic gradients, it is unclear whether the martingale increments satisfy the sub-exponential or bounded-difference conditions required for the concentration argument (e.g., Freedman or Bernstein inequality) used to convert the expectation bound into a high-probability statement. A concrete counter-example or additional moment assumption may be needed.
- [§3] §3 (deterministic analysis) and the transition to high probability: while the deterministic rate is standard, the proof sketch must explicitly show how the additive log(1/δ) term arises from a single application of a concentration inequality without inflating the leading dL/μ log(1/ε) term; if the analysis uses a union bound over iterations, the dependence on log(1/δ) should be verified to remain additive rather than multiplicative.
minor comments (2)
- [§2] Notation for the random direction u (uniform on the sphere) and the two-query estimator should be introduced once in §2 and used consistently; the current abstract uses “random zeroth-order” without reminding the reader of the exact estimator form.
- [Problem formulation] The bounded-noise assumption is stated only in the abstract; a precise mathematical formulation (e.g., E[|f(x,ξ) - f(x)|] ≤ σ or similar) should appear in the problem statement section before the theorems.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight important points about the concentration arguments in both the deterministic and stochastic settings. We address each major comment below and will revise the manuscript to improve clarity.
read point-by-point responses
-
Referee: [Abstract / stochastic theorem] Abstract and the stochastic main result (presumably Theorem 4 or §4): the claimed O(d log(1/ε)(log(1/ε)+log(1/δ))/ε) query bound under bounded noise alone is load-bearing for the paper's contribution, yet the variance of each two-query random-direction ZO estimator scales with the local gradient magnitude. Without uniform bounds on stochastic gradients, it is unclear whether the martingale increments satisfy the sub-exponential or bounded-difference conditions required for the concentration argument (e.g., Freedman or Bernstein inequality) used to convert the expectation bound into a high-probability statement. A concrete counter-example or additional moment assumption may be needed.
Authors: We thank the referee for raising this technical point. Under the bounded-noise assumption, the stochastic perturbation in each function evaluation is bounded by a constant B independent of the gradient. Consequently, the additive noise term in the two-query ZO estimator is bounded by O(B/h), while the deterministic component scales with the current gradient norm. We apply Freedman's inequality to the resulting martingale, where the variance proxy at each step is proportional to the squared norm of the current gradient (plus the bounded noise term). Because the analysis already controls the sum of squared gradient norms via the expected progress (which is O(1/ε) in the convex case), the concentration holds with only an additive logarithmic factor in 1/δ. We will add a detailed derivation of this application of Freedman's inequality to the revised §4. revision: yes
-
Referee: [§3] §3 (deterministic analysis) and the transition to high probability: while the deterministic rate is standard, the proof sketch must explicitly show how the additive log(1/δ) term arises from a single application of a concentration inequality without inflating the leading dL/μ log(1/ε) term; if the analysis uses a union bound over iterations, the dependence on log(1/δ) should be verified to remain additive rather than multiplicative.
Authors: The deterministic analysis uses a union bound over the T = O((d L / μ) log(1/ε)) iterations. We set the per-iteration failure probability to δ/T, so that the total failure probability is at most δ. This produces an additive term log(T/δ) = log(1/δ) + log T. Since log T = O(log(d L / μ) + log log(1/ε)), the extra logarithmic factor is absorbed into the claimed additive O(log(1/δ)) term and does not inflate the leading (d L / μ) log(1/ε) factor. We will revise the proof sketch in §3 to state this calculation explicitly. revision: yes
Circularity Check
No circularity in claimed derivation chain
full rationale
The paper states high-probability query complexity bounds derived from standard assumptions (L-smoothness and μ-strong convexity for deterministic case; bounded noise without uniform gradient bounds for stochastic case). These assumptions are external to the claimed results and not defined in terms of the output complexities. No equations or steps reduce the final bounds to fitted parameters, self-referential definitions, or load-bearing self-citations by construction. The analysis invokes independent concentration inequalities for the high-probability statements, keeping the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Objective function is L-smooth and μ-strongly convex
- domain assumption Stochastic noise is bounded
Reference graph
Works this paper leans on
-
[1]
Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt (pp.\ 28--40)
work page 2010
-
[2]
Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review , 60(2), 223--311
work page 2018
-
[3]
Conn, A., Scheinberg, K., & Vicente, L. (2009). Introduction to Derivative-Free Optimization . SIAM
work page 2009
-
[4]
Duchi, J., Jordan, M., & Wainwright, M. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory , 61, 2059--2072
work page 2015
-
[5]
Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (pp.\ 385--394).: SIAM
work page 2005
- [6]
-
[7]
Kakade, S. M. & Tewari, A. (2008). On the generalization ability of online strongly convex programming algorithms. Advances in neural information processing systems , 21
work page 2008
-
[8]
Laurent, B. & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics , 28(5), 1302--1338
work page 2000
- [9]
-
[10]
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems , 36, 53038--53075
work page 2023
-
[11]
Nesterov, Y. & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics , 17, 527--566
work page 2017
-
[12]
Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning
work page 2012
-
[13]
Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3), 400--407
work page 1951
-
[14]
Shamir, O. (2013). On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory (COLT) (pp.\ 3--24)
work page 2013
-
[15]
Shamir, O. (2017). An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research , 18(52), 1--11
work page 2017
-
[16]
Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The advancement in stochastic zeroth-order optimization: Mechanism of accelerated convergence of gaussian direction on objectives with skewed hessian eigenvalues
work page 2024
-
[17]
Ye, H. (2026). Optimal high-probability regret for online convex optimization with two-point bandit feedback. arXiv preprint arXiv:2603.25029
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [18]
-
[19]
Ye, H., Huang, Z., Fang, C., Li, C. J., & Zhang, T. (2025b). Hessian-aware zeroth-order optimization. IEEE transactions on pattern analysis and machine intelligence
- [20]
-
[21]
Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). Why gradient clipping accelerates training: A theoretical justification for adaptivity. International Conference on Learning Representations
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.