High-Probability Guarantees for Random Zeroth-Order (Stochastic) Gradient Descent

Haishan Ye

arxiv: 2604.23613 · v1 · submitted 2026-04-26 · 🧮 math.OC

High-Probability Guarantees for Random Zeroth-Order (Stochastic) Gradient Descent

Haishan Ye This is my paper

Pith reviewed 2026-05-08 05:45 UTC · model grok-4.3

classification 🧮 math.OC

keywords zeroth-order optimizationhigh-probability convergencegradient descentstochastic optimizationsmooth convex functionsblack-box optimizationquery complexity

0 comments

The pith

Random zeroth-order gradient descent reaches ε-suboptimality with probability 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries for smooth strongly convex functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes high-probability convergence guarantees for random zeroth-order gradient descent in both deterministic and stochastic settings. For L-smooth and μ-strongly convex deterministic objectives in d dimensions, the classical two-query method reaches an ε-suboptimal solution with probability at least 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) function queries. In the stochastic case under bounded noise, random zeroth-order SGD achieves the same guarantee with O(d log(1/ε)(log(1/ε) + log(1/δ))/ε) queries without needing uniformly bounded stochastic gradients. These results supply high-confidence counterparts to standard in-expectation bounds and quantify the extra cost for reliable performance in black-box settings.

Core claim

For deterministic L-smooth and μ-strongly convex objectives of d-dimension, the classical two-query random zeroth-order method finds an ε-suboptimal solution with probability at least 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) function queries. For stochastic objectives under a bounded-noise condition, random zeroth-order stochastic gradient descent achieves an ε-suboptimal solution with probability at least 1-δ using O(d log(1/ε)(log(1/ε) + log(1/δ))/ε) queries without assuming uniformly bounded stochastic gradients.

What carries the argument

The two-query random zeroth-order gradient estimator, which approximates the gradient via finite differences along a random unit vector and enables concentration analysis for high-probability bounds.

Load-bearing premise

The objective is L-smooth and μ-strongly convex in the deterministic case or satisfies a bounded-noise condition in the stochastic case.

What would settle it

A concrete L-smooth strongly convex function on which the two-query random zeroth-order method requires more than O((dL/μ)log(1/ε) + log(1/δ)) queries to reach ε-suboptimality with probability 1-δ would falsify the deterministic claim.

read the original abstract

Zeroth-order optimization aims to minimize an objective function using only function evaluations, and is therefore fundamental in black-box optimization, hyperparameter tuning, bandit learning, and adversarial machine learning. While classical zeroth-order methods are well understood in expectation, much less is known about their high-probability behavior, especially for smooth and strongly convex objectives. In this paper, we establish high-probability convergence guarantees for random zeroth-order gradient descent in both deterministic and stochastic settings. For deterministic $L$-smooth and $\mu$-strongly convex objectives of $d$-dimension, we show that the classical two-query random zeroth-order method finds an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{dL}{\mu}\log\frac{1}{\varepsilon} + \log\frac{1}{\delta} \right) \] function queries. Thus, compared with the standard in-expectation complexity, only an additive logarithmic dependence on the confidence parameter is needed. For stochastic objectives, under a bounded-noise condition and without assuming uniformly bounded stochastic gradients, we prove that random zeroth-order stochastic gradient descent achieves an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{ d\log(1/\varepsilon) \left(\log(1/\varepsilon)+\log(1/\delta)\right) }{\varepsilon} \right) \] queries. Our results provide high-confidence counterparts to classical expectation-based zeroth-order convergence guarantees and clarify the additional cost required to obtain reliable performance guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds high-probability rates to random zeroth-order descent with only log(1/δ) overhead in the deterministic case, but the stochastic bound under bounded noise alone is the part that needs checking.

read the letter

The main takeaway is that this work supplies explicit high-probability query complexities for the usual random zeroth-order gradient method. In the deterministic L-smooth μ-strongly convex setting the bound is O((dL/μ) log(1/ε) + log(1/δ)), which is the standard expectation rate plus an additive log term in the failure probability. The stochastic version under bounded noise gives O(d log(1/ε)(log(1/ε) + log(1/δ))/ε) queries for the same guarantee. That is the concrete new piece relative to the classical in-expectation literature they cite. The deterministic statement is clean and matches what one would expect from standard concentration arguments on the estimator. The paper states the results plainly and notes the practical motivation for high-confidence bounds in black-box settings like hyperparameter tuning. The stochastic claim is where the evidence is thinner. The zeroth-order estimator is formed from two function queries along a random direction, so its variance scales with the local gradient magnitude. Bounded noise on the stochastic objective does not by itself control the tails of those increments if the underlying stochastic gradients can grow. Without an extra moment or bounded-gradient assumption, standard martingale concentration tools may not deliver the claimed probability. The abstract does not indicate that the proof adds such a condition, so the high-probability guarantee rests on whether the derivation implicitly assumes more than stated. This paper is for readers who already know the expectation-based zeroth-order results and want the high-probability counterparts for applications that need reliability rather than just average performance. A theory-oriented reader working on black-box or adversarial optimization would get direct use from the explicit rates if the stochastic analysis holds. It deserves a serious referee to verify the concentration steps in the stochastic case and confirm whether the assumptions are sufficient as written. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper establishes high-probability convergence guarantees for the classical two-query random zeroth-order gradient descent method. For deterministic L-smooth and μ-strongly convex objectives in d dimensions, it claims that O((dL/μ) log(1/ε) + log(1/δ)) function queries suffice to reach an ε-suboptimal point with probability at least 1-δ. For stochastic objectives under a bounded-noise assumption (without requiring uniformly bounded stochastic gradients), it claims that random ZO-SGD achieves the same guarantee with O(d log(1/ε) (log(1/ε) + log(1/δ)) / ε) queries. The results are positioned as high-probability analogues of classical in-expectation ZO rates.

Significance. If the stochastic high-probability bound holds under only bounded noise, the result would be significant because it relaxes a common assumption in ZO-SGD analyses and shows that only logarithmic overhead in 1/δ is needed beyond the expectation rate. The deterministic bound is a modest but clean improvement over standard analyses. The paper supplies explicit complexity expressions rather than asymptotic notation alone.

major comments (2)

[Abstract / stochastic theorem] Abstract and the stochastic main result (presumably Theorem 4 or §4): the claimed O(d log(1/ε)(log(1/ε)+log(1/δ))/ε) query bound under bounded noise alone is load-bearing for the paper's contribution, yet the variance of each two-query random-direction ZO estimator scales with the local gradient magnitude. Without uniform bounds on stochastic gradients, it is unclear whether the martingale increments satisfy the sub-exponential or bounded-difference conditions required for the concentration argument (e.g., Freedman or Bernstein inequality) used to convert the expectation bound into a high-probability statement. A concrete counter-example or additional moment assumption may be needed.
[§3] §3 (deterministic analysis) and the transition to high probability: while the deterministic rate is standard, the proof sketch must explicitly show how the additive log(1/δ) term arises from a single application of a concentration inequality without inflating the leading dL/μ log(1/ε) term; if the analysis uses a union bound over iterations, the dependence on log(1/δ) should be verified to remain additive rather than multiplicative.

minor comments (2)

[§2] Notation for the random direction u (uniform on the sphere) and the two-query estimator should be introduced once in §2 and used consistently; the current abstract uses “random zeroth-order” without reminding the reader of the exact estimator form.
[Problem formulation] The bounded-noise assumption is stated only in the abstract; a precise mathematical formulation (e.g., E[|f(x,ξ) - f(x)|] ≤ σ or similar) should appear in the problem statement section before the theorems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points about the concentration arguments in both the deterministic and stochastic settings. We address each major comment below and will revise the manuscript to improve clarity.

read point-by-point responses

Referee: [Abstract / stochastic theorem] Abstract and the stochastic main result (presumably Theorem 4 or §4): the claimed O(d log(1/ε)(log(1/ε)+log(1/δ))/ε) query bound under bounded noise alone is load-bearing for the paper's contribution, yet the variance of each two-query random-direction ZO estimator scales with the local gradient magnitude. Without uniform bounds on stochastic gradients, it is unclear whether the martingale increments satisfy the sub-exponential or bounded-difference conditions required for the concentration argument (e.g., Freedman or Bernstein inequality) used to convert the expectation bound into a high-probability statement. A concrete counter-example or additional moment assumption may be needed.

Authors: We thank the referee for raising this technical point. Under the bounded-noise assumption, the stochastic perturbation in each function evaluation is bounded by a constant B independent of the gradient. Consequently, the additive noise term in the two-query ZO estimator is bounded by O(B/h), while the deterministic component scales with the current gradient norm. We apply Freedman's inequality to the resulting martingale, where the variance proxy at each step is proportional to the squared norm of the current gradient (plus the bounded noise term). Because the analysis already controls the sum of squared gradient norms via the expected progress (which is O(1/ε) in the convex case), the concentration holds with only an additive logarithmic factor in 1/δ. We will add a detailed derivation of this application of Freedman's inequality to the revised §4. revision: yes
Referee: [§3] §3 (deterministic analysis) and the transition to high probability: while the deterministic rate is standard, the proof sketch must explicitly show how the additive log(1/δ) term arises from a single application of a concentration inequality without inflating the leading dL/μ log(1/ε) term; if the analysis uses a union bound over iterations, the dependence on log(1/δ) should be verified to remain additive rather than multiplicative.

Authors: The deterministic analysis uses a union bound over the T = O((d L / μ) log(1/ε)) iterations. We set the per-iteration failure probability to δ/T, so that the total failure probability is at most δ. This produces an additive term log(T/δ) = log(1/δ) + log T. Since log T = O(log(d L / μ) + log log(1/ε)), the extra logarithmic factor is absorbed into the claimed additive O(log(1/δ)) term and does not inflate the leading (d L / μ) log(1/ε) factor. We will revise the proof sketch in §3 to state this calculation explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation chain

full rationale

The paper states high-probability query complexity bounds derived from standard assumptions (L-smoothness and μ-strong convexity for deterministic case; bounded noise without uniform gradient bounds for stochastic case). These assumptions are external to the claimed results and not defined in terms of the output complexities. No equations or steps reduce the final bounds to fitted parameters, self-referential definitions, or load-bearing self-citations by construction. The analysis invokes independent concentration inequalities for the high-probability statements, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions of smoothness and strong convexity plus bounded noise; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Objective function is L-smooth and μ-strongly convex
Invoked for the deterministic high-probability bound.
domain assumption Stochastic noise is bounded
Used for the stochastic setting without uniform gradient boundedness.

pith-pipeline@v0.9.0 · 5606 in / 1427 out tokens · 49423 ms · 2026-05-08T05:45:08.341843+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt (pp.\ 28--40)

work page 2010
[2]

E., & Nocedal, J

Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review , 60(2), 223--311

work page 2018
[3]

Conn, A., Scheinberg, K., & Vicente, L. (2009). Introduction to Derivative-Free Optimization . SIAM

work page 2009
[4]

Duchi, J., Jordan, M., & Wainwright, M. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory , 61, 2059--2072

work page 2015
[5]

D., Kalai, A

Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (pp.\ 385--394).: SIAM

work page 2005
[6]

& Lan, G

Ghadimi, S. & Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23, 2341--2368

work page 2013
[7]

Kakade, S. M. & Tewari, A. (2008). On the generalization ability of online strongly convex programming algorithms. Advances in neural information processing systems , 21

work page 2008
[8]

& Massart, P

Laurent, B. & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics , 28(5), 1302--1338

work page 2000
[9]

& Zhou, Z

Liu, Z. & Zhou, Z. (2023). Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531

work page arXiv 2023
[10]

D., Chen, D., & Arora, S

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems , 36, 53038--53075

work page 2023
[11]

& Spokoiny, V

Nesterov, Y. & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics , 17, 527--566

work page 2017
[12]

Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning

work page 2012
[13]

& Monro, S

Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3), 400--407

work page 1951
[14]

Shamir, O. (2013). On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory (COLT) (pp.\ 3--24)

work page 2013
[15]

Shamir, O. (2017). An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research , 18(52), 1--11

work page 2017
[16]

Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The advancement in stochastic zeroth-order optimization: Mechanism of accelerated convergence of gaussian direction on objectives with skewed hessian eigenvalues

work page 2024
[17]

Ye, H. (2026). Optimal high-probability regret for online convex optimization with two-point bandit feedback. arXiv preprint arXiv:2603.25029

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Ye, H., Chang, X., & Chen, X. (2025a). A unified zeroth-order optimization framework via oblivious randomized sketching. arXiv preprint arXiv:2510.10945

work page arXiv
[19]

J., & Zhang, T

Ye, H., Huang, Z., Fang, C., Li, C. J., & Zhang, T. (2025b). Hessian-aware zeroth-order optimization. IEEE transactions on pattern analysis and machine intelligence

work page
[20]

Yu, H., Yan, Y.-H., & Zhao, P. (2026). Improved dimension dependence for bandit convex optimization with gradient variations. arXiv preprint arXiv:2602.04761

work page arXiv 2026
[21]

Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). Why gradient clipping accelerates training: A theoretical justification for adaptivity. International Conference on Learning Representations

work page 2020

[1] [1]

Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt (pp.\ 28--40)

work page 2010

[2] [2]

E., & Nocedal, J

Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review , 60(2), 223--311

work page 2018

[3] [3]

Conn, A., Scheinberg, K., & Vicente, L. (2009). Introduction to Derivative-Free Optimization . SIAM

work page 2009

[4] [4]

Duchi, J., Jordan, M., & Wainwright, M. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory , 61, 2059--2072

work page 2015

[5] [5]

D., Kalai, A

Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (pp.\ 385--394).: SIAM

work page 2005

[6] [6]

& Lan, G

Ghadimi, S. & Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23, 2341--2368

work page 2013

[7] [7]

Kakade, S. M. & Tewari, A. (2008). On the generalization ability of online strongly convex programming algorithms. Advances in neural information processing systems , 21

work page 2008

[8] [8]

& Massart, P

Laurent, B. & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics , 28(5), 1302--1338

work page 2000

[9] [9]

& Zhou, Z

Liu, Z. & Zhou, Z. (2023). Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531

work page arXiv 2023

[10] [10]

D., Chen, D., & Arora, S

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems , 36, 53038--53075

work page 2023

[11] [11]

& Spokoiny, V

Nesterov, Y. & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics , 17, 527--566

work page 2017

[12] [12]

Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning

work page 2012

[13] [13]

& Monro, S

Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3), 400--407

work page 1951

[14] [14]

Shamir, O. (2013). On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory (COLT) (pp.\ 3--24)

work page 2013

[15] [15]

Shamir, O. (2017). An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research , 18(52), 1--11

work page 2017

[16] [16]

Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The advancement in stochastic zeroth-order optimization: Mechanism of accelerated convergence of gaussian direction on objectives with skewed hessian eigenvalues

work page 2024

[17] [17]

Ye, H. (2026). Optimal high-probability regret for online convex optimization with two-point bandit feedback. arXiv preprint arXiv:2603.25029

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Ye, H., Chang, X., & Chen, X. (2025a). A unified zeroth-order optimization framework via oblivious randomized sketching. arXiv preprint arXiv:2510.10945

work page arXiv

[19] [19]

J., & Zhang, T

Ye, H., Huang, Z., Fang, C., Li, C. J., & Zhang, T. (2025b). Hessian-aware zeroth-order optimization. IEEE transactions on pattern analysis and machine intelligence

work page

[20] [20]

Yu, H., Yan, Y.-H., & Zhao, P. (2026). Improved dimension dependence for bandit convex optimization with gradient variations. arXiv preprint arXiv:2602.04761

work page arXiv 2026

[21] [21]

Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). Why gradient clipping accelerates training: A theoretical justification for adaptivity. International Conference on Learning Representations

work page 2020