High-Probability Guarantees for Random Zeroth-Order (Stochastic) Gradient Descent

Haishan Ye

arxiv: 2604.23613 · v2 · pith:LBCTHNMWnew · submitted 2026-04-26 · 🧮 math.OC

High-Probability Guarantees for Random Zeroth-Order (Stochastic) Gradient Descent

Haishan Ye This is my paper

Pith reviewed 2026-07-01 09:26 UTC · model grok-4.3

classification 🧮 math.OC

keywords zeroth-order optimizationhigh-probability convergencesmooth strongly convex functionsstochastic gradient descentblack-box optimizationquery complexity

0 comments

The pith

Random zeroth-order gradient descent reaches ε-suboptimal solutions with probability 1-δ using only an additive log(1/δ) term in query complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the classical two-query random zeroth-order method attains high-probability convergence for smooth strongly convex objectives. In the deterministic case the query count is O((dL/μ) log(1/ε) + log(1/δ)), matching the usual expectation bound except for the extra logarithmic dependence on the failure probability δ. In the stochastic case with bounded noise but no uniform bound on gradients, the same probability guarantee holds with O(d log(1/ε)(log(1/ε) + log(1/δ))/ε) queries. These results supply high-confidence counterparts to the classical in-expectation analyses for black-box optimization.

Core claim

For deterministic L-smooth and μ-strongly convex d-dimensional objectives the two-query random zeroth-order method finds an ε-suboptimal solution with probability at least 1-δ using O((dL/μ) log(1/ε) + log(1/δ)) function queries; for stochastic objectives under bounded noise the corresponding random zeroth-order SGD achieves the same guarantee with O(d log(1/ε)(log(1/ε) + log(1/δ))/ε) queries.

What carries the argument

The two-point random zeroth-order gradient estimator formed by querying the objective at a random point and at a nearby perturbed point on the unit sphere, then scaling the difference by dimension.

If this is right

Only an additive logarithmic number of extra queries is required to upgrade expectation bounds to high-probability bounds.
The stochastic result holds without assuming uniformly bounded stochastic gradients.
The same high-probability analysis applies directly to the classical two-query estimator without changing its per-step cost.
High-confidence performance guarantees become available for black-box and bandit settings at modest extra cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Concentration tools can be applied to the zeroth-order direction estimates without inflating the leading dimension dependence.
The same log(1/δ) overhead may appear when high-probability analysis is carried out for other zeroth-order estimators that use a constant number of queries per step.
The bounded-noise assumption could be relaxed to sub-Gaussian noise while preserving the stated form of the bound.

Load-bearing premise

The objective must be L-smooth and μ-strongly convex in the deterministic setting and must satisfy a bounded-noise condition in the stochastic setting.

What would settle it

An explicit L-smooth μ-strongly convex function and choice of ε, δ such that after the stated number of queries the algorithm returns a point farther than ε from the optimum with probability greater than δ.

read the original abstract

Zeroth-order optimization aims to minimize an objective function using only function evaluations, and is therefore fundamental in black-box optimization, hyperparameter tuning, bandit learning, and adversarial machine learning. While classical zeroth-order methods are well understood in expectation, much less is known about their high-probability behavior, especially for smooth and strongly convex objectives. In this paper, we establish high-probability convergence guarantees for random zeroth-order gradient descent in both deterministic and stochastic settings. For deterministic $L$-smooth and $\mu$-strongly convex objectives of $d$-dimension, we show that the classical two-query random zeroth-order method finds an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{dL}{\mu}\log\frac{1}{\varepsilon} + \log\frac{1}{\delta} \right) \] function queries. Thus, compared with the standard in-expectation complexity, only an additive logarithmic dependence on the confidence parameter is needed. For stochastic objectives, under a bounded-noise condition and without assuming uniformly bounded stochastic gradients, we prove that random zeroth-order stochastic gradient descent achieves an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{ d\log(1/\varepsilon) \left(\log(1/\varepsilon)+\log(1/\delta)\right) }{\varepsilon} \right) \] queries. Our results provide high-confidence counterparts to classical expectation-based zeroth-order convergence guarantees and clarify the additional cost required to obtain reliable performance guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

High-probability rates for random ZO-GD add only log(1/δ) over expectation bounds under standard assumptions.

read the letter

The key takeaway is that this paper supplies high-probability query complexities for the classical random zeroth-order gradient method. For deterministic L-smooth μ-strongly convex problems the two-query estimator reaches ε-suboptimality with probability 1-δ using O((dL/μ) log(1/ε) + log(1/δ)) queries. The stochastic version under bounded noise gives O(d log(1/ε) (log(1/ε) + log(1/δ))/ε) queries.

It does well by keeping the rates close to the known expectation results and by avoiding a uniform bound on stochastic gradients. The deterministic bound in particular is clean, with the extra cost appearing only as an additive log term.

The soft spots are minor. The stochastic rate carries an extra log(1/ε) factor that may be an artifact of the concentration argument rather than necessary. Full proof details would be needed to check exactly how the trajectory-wide high-probability events are controlled, but the abstract states the assumptions explicitly and the claims line up with standard tools.

This paper is for researchers already working on zeroth-order methods who need high-confidence guarantees for black-box settings such as hyperparameter tuning. A reader familiar with the expectation analysis will see the extension immediately.

It deserves peer review because the contribution is focused, the rates are explicit, and no internal inconsistency appears in the stated results.

Referee Report

0 major / 2 minor

Summary. The paper establishes high-probability convergence guarantees for the classical two-query random zeroth-order gradient descent method. For deterministic L-smooth and μ-strongly convex d-dimensional objectives, it shows that an ε-suboptimal solution is found with probability at least 1-δ using O((dL/μ) log(1/ε) + log(1/δ)) function queries. For stochastic objectives under a bounded-noise condition (without uniformly bounded stochastic gradients), random zeroth-order SGD achieves the same guarantee with O(d log(1/ε) (log(1/ε) + log(1/δ))/ε) queries.

Significance. If the stated rates hold, the results are significant because they supply high-probability counterparts to classical in-expectation zeroth-order analyses, requiring only an additive logarithmic dependence on 1/δ. This is useful for reliable black-box optimization, hyperparameter tuning, and adversarial settings. The relaxation to bounded noise without uniform gradient bounds is a positive feature of the stochastic analysis.

minor comments (2)

The abstract states the main theorems clearly but does not indicate the section numbers where the full proofs appear; adding explicit theorem references would improve readability.
Notation for the random-direction estimator and the precise form of the bounded-noise assumption should be defined at first use in the introduction for readers unfamiliar with the ZO literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our contributions and for recommending minor revision. The referee's description of the results is accurate. No major comments were provided in the report, so we have no specific points to address.

Circularity Check

0 steps flagged

No circularity; derivation self-contained under standard assumptions

full rationale

The paper derives high-probability query complexities for random ZO-GD and ZO-SGD directly from L-smoothness, μ-strong convexity, and bounded-noise conditions using standard concentration arguments on random-direction gradient estimators. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations are present. The stated rates follow from applying martingale or concentration bounds to the trajectory under the explicitly listed assumptions, without any step that reduces the output to the input by construction. This is the normal case of a self-contained theoretical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The results depend on standard domain assumptions of smoothness, strong convexity, and bounded noise in the stochastic setting; no free parameters or invented entities are introduced.

axioms (2)

domain assumption L-smoothness and μ-strong convexity of the objective
Standard assumption for convergence rates in optimization, invoked to obtain the query bounds.
domain assumption Bounded-noise condition without uniformly bounded stochastic gradients
Key for the stochastic case analysis as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5837 in / 1309 out tokens · 36743 ms · 2026-07-01T09:26:01.322829+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt (pp.\ 28--40)

2010
[2]

E., & Nocedal, J

Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review , 60(2), 223--311

2018
[3]

Conn, A., Scheinberg, K., & Vicente, L. (2009). Introduction to Derivative-Free Optimization . SIAM

2009
[4]

Duchi, J., Jordan, M., & Wainwright, M. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory , 61, 2059--2072

2015
[5]

D., Kalai, A

Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (pp.\ 385--394).: SIAM

2005
[6]

& Lan, G

Ghadimi, S. & Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23, 2341--2368

2013
[7]

Kakade, S. M. & Tewari, A. (2008). On the generalization ability of online strongly convex programming algorithms. Advances in neural information processing systems , 21

2008
[8]

& Massart, P

Laurent, B. & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics , 28(5), 1302--1338

2000
[9]

& Zhou, Z

Liu, Z. & Zhou, Z. (2023). Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531

work page arXiv 2023
[10]

D., Chen, D., & Arora, S

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems , 36, 53038--53075

2023
[11]

& Spokoiny, V

Nesterov, Y. & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics , 17, 527--566

2017
[12]

Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning

2012
[13]

& Monro, S

Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3), 400--407

1951
[14]

Shamir, O. (2013). On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory (COLT) (pp.\ 3--24)

2013
[15]

Shamir, O. (2017). An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research , 18(52), 1--11

2017
[16]

Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The advancement in stochastic zeroth-order optimization: Mechanism of accelerated convergence of gaussian direction on objectives with skewed hessian eigenvalues

2024
[17]

Ye, H. (2026). Optimal high-probability regret for online convex optimization with two-point bandit feedback. arXiv preprint arXiv:2603.25029

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Ye, H., Chang, X., & Chen, X. (2025a). A unified zeroth-order optimization framework via oblivious randomized sketching. arXiv preprint arXiv:2510.10945

work page arXiv
[19]

J., & Zhang, T

Ye, H., Huang, Z., Fang, C., Li, C. J., & Zhang, T. (2025b). Hessian-aware zeroth-order optimization. IEEE transactions on pattern analysis and machine intelligence
[20]

Yu, H., Yan, Y.-H., & Zhao, P. (2026). Improved dimension dependence for bandit convex optimization with gradient variations. arXiv preprint arXiv:2602.04761

work page arXiv 2026
[21]

Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). Why gradient clipping accelerates training: A theoretical justification for adaptivity. International Conference on Learning Representations

2020

[1] [1]

Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt (pp.\ 28--40)

2010

[2] [2]

E., & Nocedal, J

Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review , 60(2), 223--311

2018

[3] [3]

Conn, A., Scheinberg, K., & Vicente, L. (2009). Introduction to Derivative-Free Optimization . SIAM

2009

[4] [4]

Duchi, J., Jordan, M., & Wainwright, M. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory , 61, 2059--2072

2015

[5] [5]

D., Kalai, A

Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (pp.\ 385--394).: SIAM

2005

[6] [6]

& Lan, G

Ghadimi, S. & Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23, 2341--2368

2013

[7] [7]

Kakade, S. M. & Tewari, A. (2008). On the generalization ability of online strongly convex programming algorithms. Advances in neural information processing systems , 21

2008

[8] [8]

& Massart, P

Laurent, B. & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics , 28(5), 1302--1338

2000

[9] [9]

& Zhou, Z

Liu, Z. & Zhou, Z. (2023). Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531

work page arXiv 2023

[10] [10]

D., Chen, D., & Arora, S

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems , 36, 53038--53075

2023

[11] [11]

& Spokoiny, V

Nesterov, Y. & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics , 17, 527--566

2017

[12] [12]

Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning

2012

[13] [13]

& Monro, S

Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3), 400--407

1951

[14] [14]

Shamir, O. (2013). On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory (COLT) (pp.\ 3--24)

2013

[15] [15]

Shamir, O. (2017). An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research , 18(52), 1--11

2017

[16] [16]

Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The advancement in stochastic zeroth-order optimization: Mechanism of accelerated convergence of gaussian direction on objectives with skewed hessian eigenvalues

2024

[17] [17]

Ye, H. (2026). Optimal high-probability regret for online convex optimization with two-point bandit feedback. arXiv preprint arXiv:2603.25029

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Ye, H., Chang, X., & Chen, X. (2025a). A unified zeroth-order optimization framework via oblivious randomized sketching. arXiv preprint arXiv:2510.10945

work page arXiv

[19] [19]

J., & Zhang, T

Ye, H., Huang, Z., Fang, C., Li, C. J., & Zhang, T. (2025b). Hessian-aware zeroth-order optimization. IEEE transactions on pattern analysis and machine intelligence

[20] [20]

Yu, H., Yan, Y.-H., & Zhao, P. (2026). Improved dimension dependence for bandit convex optimization with gradient variations. arXiv preprint arXiv:2602.04761

work page arXiv 2026

[21] [21]

Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). Why gradient clipping accelerates training: A theoretical justification for adaptivity. International Conference on Learning Representations

2020