pith. sign in

arxiv: 2604.23613 · v1 · submitted 2026-04-26 · 🧮 math.OC

High-Probability Guarantees for Random Zeroth-Order (Stochastic) Gradient Descent

Pith reviewed 2026-05-08 05:45 UTC · model grok-4.3

classification 🧮 math.OC
keywords zeroth-order optimizationhigh-probability convergencegradient descentstochastic optimizationsmooth convex functionsblack-box optimizationquery complexity
0
0 comments X

The pith

Random zeroth-order gradient descent reaches ε-suboptimality with probability 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries for smooth strongly convex functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes high-probability convergence guarantees for random zeroth-order gradient descent in both deterministic and stochastic settings. For L-smooth and μ-strongly convex deterministic objectives in d dimensions, the classical two-query method reaches an ε-suboptimal solution with probability at least 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) function queries. In the stochastic case under bounded noise, random zeroth-order SGD achieves the same guarantee with O(d log(1/ε)(log(1/ε) + log(1/δ))/ε) queries without needing uniformly bounded stochastic gradients. These results supply high-confidence counterparts to standard in-expectation bounds and quantify the extra cost for reliable performance in black-box settings.

Core claim

For deterministic L-smooth and μ-strongly convex objectives of d-dimension, the classical two-query random zeroth-order method finds an ε-suboptimal solution with probability at least 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) function queries. For stochastic objectives under a bounded-noise condition, random zeroth-order stochastic gradient descent achieves an ε-suboptimal solution with probability at least 1-δ using O(d log(1/ε)(log(1/ε) + log(1/δ))/ε) queries without assuming uniformly bounded stochastic gradients.

What carries the argument

The two-query random zeroth-order gradient estimator, which approximates the gradient via finite differences along a random unit vector and enables concentration analysis for high-probability bounds.

Load-bearing premise

The objective is L-smooth and μ-strongly convex in the deterministic case or satisfies a bounded-noise condition in the stochastic case.

What would settle it

A concrete L-smooth strongly convex function on which the two-query random zeroth-order method requires more than O((dL/μ)log(1/ε) + log(1/δ)) queries to reach ε-suboptimality with probability 1-δ would falsify the deterministic claim.

read the original abstract

Zeroth-order optimization aims to minimize an objective function using only function evaluations, and is therefore fundamental in black-box optimization, hyperparameter tuning, bandit learning, and adversarial machine learning. While classical zeroth-order methods are well understood in expectation, much less is known about their high-probability behavior, especially for smooth and strongly convex objectives. In this paper, we establish high-probability convergence guarantees for random zeroth-order gradient descent in both deterministic and stochastic settings. For deterministic $L$-smooth and $\mu$-strongly convex objectives of $d$-dimension, we show that the classical two-query random zeroth-order method finds an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{dL}{\mu}\log\frac{1}{\varepsilon} + \log\frac{1}{\delta} \right) \] function queries. Thus, compared with the standard in-expectation complexity, only an additive logarithmic dependence on the confidence parameter is needed. For stochastic objectives, under a bounded-noise condition and without assuming uniformly bounded stochastic gradients, we prove that random zeroth-order stochastic gradient descent achieves an $\varepsilon$-suboptimal solution with probability at least $1-\delta$ using \[ \mathcal{O}\left( \frac{ d\log(1/\varepsilon) \left(\log(1/\varepsilon)+\log(1/\delta)\right) }{\varepsilon} \right) \] queries. Our results provide high-confidence counterparts to classical expectation-based zeroth-order convergence guarantees and clarify the additional cost required to obtain reliable performance guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper establishes high-probability convergence guarantees for the classical two-query random zeroth-order gradient descent method. For deterministic L-smooth and μ-strongly convex objectives in d dimensions, it claims that O((dL/μ) log(1/ε) + log(1/δ)) function queries suffice to reach an ε-suboptimal point with probability at least 1-δ. For stochastic objectives under a bounded-noise assumption (without requiring uniformly bounded stochastic gradients), it claims that random ZO-SGD achieves the same guarantee with O(d log(1/ε) (log(1/ε) + log(1/δ)) / ε) queries. The results are positioned as high-probability analogues of classical in-expectation ZO rates.

Significance. If the stochastic high-probability bound holds under only bounded noise, the result would be significant because it relaxes a common assumption in ZO-SGD analyses and shows that only logarithmic overhead in 1/δ is needed beyond the expectation rate. The deterministic bound is a modest but clean improvement over standard analyses. The paper supplies explicit complexity expressions rather than asymptotic notation alone.

major comments (2)
  1. [Abstract / stochastic theorem] Abstract and the stochastic main result (presumably Theorem 4 or §4): the claimed O(d log(1/ε)(log(1/ε)+log(1/δ))/ε) query bound under bounded noise alone is load-bearing for the paper's contribution, yet the variance of each two-query random-direction ZO estimator scales with the local gradient magnitude. Without uniform bounds on stochastic gradients, it is unclear whether the martingale increments satisfy the sub-exponential or bounded-difference conditions required for the concentration argument (e.g., Freedman or Bernstein inequality) used to convert the expectation bound into a high-probability statement. A concrete counter-example or additional moment assumption may be needed.
  2. [§3] §3 (deterministic analysis) and the transition to high probability: while the deterministic rate is standard, the proof sketch must explicitly show how the additive log(1/δ) term arises from a single application of a concentration inequality without inflating the leading dL/μ log(1/ε) term; if the analysis uses a union bound over iterations, the dependence on log(1/δ) should be verified to remain additive rather than multiplicative.
minor comments (2)
  1. [§2] Notation for the random direction u (uniform on the sphere) and the two-query estimator should be introduced once in §2 and used consistently; the current abstract uses “random zeroth-order” without reminding the reader of the exact estimator form.
  2. [Problem formulation] The bounded-noise assumption is stated only in the abstract; a precise mathematical formulation (e.g., E[|f(x,ξ) - f(x)|] ≤ σ or similar) should appear in the problem statement section before the theorems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points about the concentration arguments in both the deterministic and stochastic settings. We address each major comment below and will revise the manuscript to improve clarity.

read point-by-point responses
  1. Referee: [Abstract / stochastic theorem] Abstract and the stochastic main result (presumably Theorem 4 or §4): the claimed O(d log(1/ε)(log(1/ε)+log(1/δ))/ε) query bound under bounded noise alone is load-bearing for the paper's contribution, yet the variance of each two-query random-direction ZO estimator scales with the local gradient magnitude. Without uniform bounds on stochastic gradients, it is unclear whether the martingale increments satisfy the sub-exponential or bounded-difference conditions required for the concentration argument (e.g., Freedman or Bernstein inequality) used to convert the expectation bound into a high-probability statement. A concrete counter-example or additional moment assumption may be needed.

    Authors: We thank the referee for raising this technical point. Under the bounded-noise assumption, the stochastic perturbation in each function evaluation is bounded by a constant B independent of the gradient. Consequently, the additive noise term in the two-query ZO estimator is bounded by O(B/h), while the deterministic component scales with the current gradient norm. We apply Freedman's inequality to the resulting martingale, where the variance proxy at each step is proportional to the squared norm of the current gradient (plus the bounded noise term). Because the analysis already controls the sum of squared gradient norms via the expected progress (which is O(1/ε) in the convex case), the concentration holds with only an additive logarithmic factor in 1/δ. We will add a detailed derivation of this application of Freedman's inequality to the revised §4. revision: yes

  2. Referee: [§3] §3 (deterministic analysis) and the transition to high probability: while the deterministic rate is standard, the proof sketch must explicitly show how the additive log(1/δ) term arises from a single application of a concentration inequality without inflating the leading dL/μ log(1/ε) term; if the analysis uses a union bound over iterations, the dependence on log(1/δ) should be verified to remain additive rather than multiplicative.

    Authors: The deterministic analysis uses a union bound over the T = O((d L / μ) log(1/ε)) iterations. We set the per-iteration failure probability to δ/T, so that the total failure probability is at most δ. This produces an additive term log(T/δ) = log(1/δ) + log T. Since log T = O(log(d L / μ) + log log(1/ε)), the extra logarithmic factor is absorbed into the claimed additive O(log(1/δ)) term and does not inflate the leading (d L / μ) log(1/ε) factor. We will revise the proof sketch in §3 to state this calculation explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation chain

full rationale

The paper states high-probability query complexity bounds derived from standard assumptions (L-smoothness and μ-strong convexity for deterministic case; bounded noise without uniform gradient bounds for stochastic case). These assumptions are external to the claimed results and not defined in terms of the output complexities. No equations or steps reduce the final bounds to fitted parameters, self-referential definitions, or load-bearing self-citations by construction. The analysis invokes independent concentration inequalities for the high-probability statements, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions of smoothness and strong convexity plus bounded noise; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Objective function is L-smooth and μ-strongly convex
    Invoked for the deterministic high-probability bound.
  • domain assumption Stochastic noise is bounded
    Used for the stochastic setting without uniform gradient boundedness.

pith-pipeline@v0.9.0 · 5606 in / 1427 out tokens · 49423 ms · 2026-05-08T05:45:08.341843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    Agarwal, A., Dekel, O., & Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt (pp.\ 28--40)

  2. [2]

    E., & Nocedal, J

    Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review , 60(2), 223--311

  3. [3]

    Conn, A., Scheinberg, K., & Vicente, L. (2009). Introduction to Derivative-Free Optimization . SIAM

  4. [4]

    Duchi, J., Jordan, M., & Wainwright, M. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory , 61, 2059--2072

  5. [5]

    D., Kalai, A

    Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (pp.\ 385--394).: SIAM

  6. [6]

    & Lan, G

    Ghadimi, S. & Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23, 2341--2368

  7. [7]

    Kakade, S. M. & Tewari, A. (2008). On the generalization ability of online strongly convex programming algorithms. Advances in neural information processing systems , 21

  8. [8]

    & Massart, P

    Laurent, B. & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics , 28(5), 1302--1338

  9. [9]

    & Zhou, Z

    Liu, Z. & Zhou, Z. (2023). Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531

  10. [10]

    D., Chen, D., & Arora, S

    Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems , 36, 53038--53075

  11. [11]

    & Spokoiny, V

    Nesterov, Y. & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics , 17, 527--566

  12. [12]

    Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning

  13. [13]

    & Monro, S

    Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3), 400--407

  14. [14]

    Shamir, O. (2013). On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory (COLT) (pp.\ 3--24)

  15. [15]

    Shamir, O. (2017). An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research , 18(52), 1--11

  16. [16]

    Wang, Y., Ye, H., Liu, Y., Dai, G., Tsang, I., & Wang, J. (2024). The advancement in stochastic zeroth-order optimization: Mechanism of accelerated convergence of gaussian direction on objectives with skewed hessian eigenvalues

  17. [17]

    Ye, H. (2026). Optimal high-probability regret for online convex optimization with two-point bandit feedback. arXiv preprint arXiv:2603.25029

  18. [18]

    Ye, H., Chang, X., & Chen, X. (2025a). A unified zeroth-order optimization framework via oblivious randomized sketching. arXiv preprint arXiv:2510.10945

  19. [19]

    J., & Zhang, T

    Ye, H., Huang, Z., Fang, C., Li, C. J., & Zhang, T. (2025b). Hessian-aware zeroth-order optimization. IEEE transactions on pattern analysis and machine intelligence

  20. [20]

    Yu, H., Yan, Y.-H., & Zhao, P. (2026). Improved dimension dependence for bandit convex optimization with gradient variations. arXiv preprint arXiv:2602.04761

  21. [21]

    Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). Why gradient clipping accelerates training: A theoretical justification for adaptivity. International Conference on Learning Representations