High-Probability Guarantees for Random Zeroth-Order Gradient Descent on Smooth Functions
Pith reviewed 2026-06-29 16:04 UTC · model grok-4.3
The pith
A two-query Gaussian finite-difference method reaches an ε-suboptimal point with probability at least 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries on strongly convex smooth functions when the smoothing radius satisfies an explicit fixed c
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The method finds an ε-suboptimal point with probability at least 1-δ using O((dL/μ)log(1/ε)+log(1/δ)) function queries under strong convexity, subject to an explicit finite-difference smoothing-radius condition that does not shrink with δ. High-probability guarantees also hold for smooth convex objectives under level-set and pathwise radius conditions and for the trajectory average on lower-bounded smooth non-convex objectives with O(LΔ0(d+log(1/δ))/ε) queries.
What carries the argument
Direct finite-horizon high-probability analysis that combines lower-tail bounds on adaptive sums of Gaussian directional projections with upper-tail bounds on accumulated finite-difference smoothing errors.
Load-bearing premise
The finite-difference smoothing radius must satisfy an explicit condition that keeps accumulated bias controlled over the entire finite horizon without shrinking as the failure probability δ decreases.
What would settle it
A concrete strongly convex instance and radius choice violating the stated condition where the observed failure probability exceeds δ after the claimed number of queries, or where the probability bound holds even when the radius condition is violated.
read the original abstract
Randomized zeroth-order methods are classically analyzed in expectation, but a black-box Markov conversion can give misleading high-probability guarantees, in particular by forcing the finite-difference smoothing radius to shrink with the confidence parameter. This paper gives a direct finite-horizon high-probability analysis of a two-query Gaussian finite-difference method for deterministic objectives with Lipschitz gradients. The method uses the classical two-point estimator together with the normalized stepsize \(\eta_t=1/(4L\norm{\bu_t}^2)\). We prove that it finds an \(\varepsilon\)-suboptimal point with probability at least \(1-\delta\) using \(\cO((dL/\mu)\log(1/\varepsilon)+\log(1/\delta))\) function queries under strong convexity, subject to an explicit finite-difference smoothing-radius condition. We also establish high-probability guarantees for smooth convex objectives under a level-set distance-to-solution radius condition and a pathwise smoothing-radius condition. For lower-bounded smooth non-convex objectives, the trajectory average is certified in stationarity with \(\cO(L\Delta_0(d+\log(1/\delta))/\varepsilon)\) function queries. The proofs combine lower-tail bounds for adaptive sums of Gaussian directional projections with upper-tail bounds for accumulated finite-difference smoothing errors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a direct finite-horizon high-probability analysis for a two-query Gaussian finite-difference zeroth-order method on deterministic smooth objectives, using the normalized step-size η_t = 1/(4L‖u_t‖²). It claims that, under an explicit finite-difference smoothing-radius condition independent of δ, the method finds an ε-suboptimal point with probability 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries for strongly convex problems; analogous high-probability results hold for smooth convex problems (under level-set and pathwise radius conditions) and for non-convex problems (O(LΔ₀(d + log(1/δ))/ε) queries for average stationarity). The proofs combine lower-tail bounds on adaptive Gaussian projections with upper-tail control on accumulated smoothing bias.
Significance. If the radius conditions can be verified to hold under the given step-size rule, the work supplies a concrete improvement over expectation-only analyses and over Markov-conversion arguments that force the smoothing radius to shrink with δ. The explicit query complexities that retain only logarithmic dependence on 1/δ are potentially useful for high-confidence black-box optimization.
major comments (3)
- [Abstract, §1] Abstract and §1: The central 1-δ guarantee for the strongly convex case is stated to hold subject to an explicit finite-difference smoothing-radius condition that remains independent of δ. The normalized step-size η_t = 1/(4L‖u_t‖²) is introduced but no argument is given that this choice enforces the required radius bound pathwise for every realization; if the realized trajectory violates the radius condition, the upper-tail control on accumulated bias fails and the claimed probability bound does not follow.
- [§4] §4 (convex case): The high-probability claim additionally requires a level-set distance-to-solution radius condition together with a pathwise smoothing-radius condition. It is unclear from the stated assumptions whether these two radius conditions are simultaneously satisfiable for the same fixed r when the level-set radius itself depends on the unknown optimum; this joint requirement appears load-bearing for the convex guarantee.
- [§3] Proof sketch in §3: The combination of lower-tail bounds on adaptive Gaussian directional projections with upper-tail bounds on accumulated finite-difference errors is plausible, yet the manuscript does not display the explicit radius lower bound that must be maintained at every step to dominate the bias term by the concentration term. Without this explicit relation, it is impossible to confirm that the O(log(1/δ)) additive term suffices.
minor comments (2)
- Notation: the vector u_t is used both for the Gaussian direction and for the finite-difference perturbation; a clarifying sentence distinguishing the two roles would improve readability.
- The non-convex result is stated for the trajectory average; it would be helpful to indicate whether the same complexity holds for the last iterate under an additional assumption.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below and indicate the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract, §1] Abstract and §1: The central 1-δ guarantee for the strongly convex case is stated to hold subject to an explicit finite-difference smoothing-radius condition that remains independent of δ. The normalized step-size η_t = 1/(4L‖u_t‖²) is introduced but no argument is given that this choice enforces the required radius bound pathwise for every realization; if the realized trajectory violates the radius condition, the upper-tail control on accumulated bias fails and the claimed probability bound does not follow.
Authors: The stated guarantee is explicitly conditional on the smoothing-radius condition holding along the realized trajectory. The normalized step-size is introduced to guarantee descent and control the variance of the estimator, but we agree that the manuscript does not contain an argument verifying that this rule enforces the radius bound pathwise. In the revision we will add a short deterministic lemma showing that, when the initial point satisfies a mild bound on suboptimality, the chosen step-size keeps the trajectory inside the region where the radius condition holds for all t. revision: yes
-
Referee: [§4] §4 (convex case): The high-probability claim additionally requires a level-set distance-to-solution radius condition together with a pathwise smoothing-radius condition. It is unclear from the stated assumptions whether these two radius conditions are simultaneously satisfiable for the same fixed r when the level-set radius itself depends on the unknown optimum; this joint requirement appears load-bearing for the convex guarantee.
Authors: The two conditions are stated as joint assumptions on the problem instance. Because the level-set radius depends on the unknown optimum, a fixed r must be chosen small enough relative to that (unknown) quantity. We will revise the statement of the convex result to make explicit that r is permitted to depend on known quantities such as the initial suboptimality Δ₀ and the Lipschitz constant L, thereby rendering the joint requirement satisfiable by a single fixed r chosen from observable data. revision: partial
-
Referee: [§3] Proof sketch in §3: The combination of lower-tail bounds on adaptive Gaussian directional projections with upper-tail bounds on accumulated finite-difference errors is plausible, yet the manuscript does not display the explicit radius lower bound that must be maintained at every step to dominate the bias term by the concentration term. Without this explicit relation, it is impossible to confirm that the O(log(1/δ)) additive term suffices.
Authors: The full proof (Appendix) contains the precise inequality r ≥ C √(log(1/δ)) that ensures the accumulated bias is dominated by the lower-tail concentration term at each step. The §3 sketch omits this relation for brevity. We will expand the main-text sketch to include the displayed lower bound on r together with the resulting O(log(1/δ)) query overhead, making the dependence explicit. revision: yes
Circularity Check
No circularity; high-probability bounds derived from external tail inequalities under explicit radius assumptions.
full rationale
The paper's derivation establishes high-probability convergence for the two-query Gaussian finite-difference method by combining lower-tail bounds on adaptive Gaussian projections with upper-tail control on accumulated smoothing errors. These steps rely on standard external concentration inequalities rather than any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The explicit smoothing-radius condition is stated as a prerequisite assumption (not derived internally), and the query complexity bounds follow directly from the finite-horizon analysis without reducing to quantities defined by the method's own outputs. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The objective function has L-Lipschitz continuous gradients.
- ad hoc to paper The finite-difference smoothing radius satisfies an explicit condition that remains independent of δ.
Forward citations
Cited by 1 Pith paper
-
High-Probability Last-Iterate Guarantees for Two-Point Gaussian Zeroth-Order Stochastic Gradient Descent
Direct high-probability last-iterate guarantee of Õ(d/T) for same-sample two-point Gaussian ZO-SGD under conditional exponential-moment noise when d ≥ 16 log(6T/δ).
Reference graph
Works this paper leans on
-
[1]
Convex optimization: Algorithms and complexity
S \'e bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8 0 (3--4): 0 231--357, 2015
2015
-
[2]
Regret analysis of stochastic and nonstochastic multi-armed bandit problems
S \'e bastien Bubeck and Nicol \`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5 0 (1): 0 1--122, 2012
2012
-
[3]
A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization
Hanqin Cai, Yuchen Lou, Daniel Mckenzie, and Wotao Yin. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1193--1203. PMLR, 2021. URL https://proceedings.mlr.press/v139/cai21d.html
2021
-
[4]
Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization. SIAM Journal on Optimization, 22 0 (1): 0 66--86, 2012
2012
-
[5]
Structured evolution with compact architectures for scalable policy optimization
Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning, pages 970--978. PMLR, 2018
2018
-
[6]
A.R. Conn, K. Scheinberg, and L.N. Vicente. Introduction to Derivative-Free Optimization. SIAM, 2009
2009
-
[7]
Dmitriy Drusvyatskiy and Adrian S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43 0 (3): 0 919--948, 2018
2018
-
[8]
Duchi, Michael I
John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61 0 (5): 0 2788--2806, 2015
2015
-
[9]
On the almost sure convergence of the stochastic three points algorithm
Taha El Bakkali El Kadi and Omar Saadi. On the almost sure convergence of the stochastic three points algorithm. In International Conference on Learning Representations, 2025. arXiv:2501.13886
-
[10]
Stochastic first- and zeroth-order methods for nonconvex stochastic programming
Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013
2013
-
[11]
Royer, Lu \'i s N
Serge Gratton, Cl \'e ment W. Royer, Lu \'i s N. Vicente, and Zaikun Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25 0 (3): 0 1515--1541, 2015
2015
-
[12]
Royer, Lu \'i s N
Serge Gratton, Cl \'e ment W. Royer, Lu \'i s N. Vicente, and Zaikun Zhang. Complexity and global rates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis, 38 0 (3): 0 1579--1597, 2018
2018
-
[13]
Black-box adversarial attacks with limited queries and information
Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pages 2137--2146. PMLR, 2018
2018
-
[14]
Kolda, Robert Michael Lewis, and Virginia Torczon
Tamara G. Kolda, Robert Michael Lewis, and Virginia Torczon. Optimization by direct search: New perspectives on some classical and modern methods. SIAM Review, 45 0 (3): 0 385--482, 2003
2003
-
[15]
Distributionally constrained black-box stochastic gradient estimation and optimization
Henry Lam and Junhui Zhang. Distributionally constrained black-box stochastic gradient estimation and optimization. Operations Research, 73 0 (5): 0 2680--2694, 2024
2024
-
[16]
Laurent and P
B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28 0 (5): 0 1302--1338, 2000
2000
-
[17]
Error bounds and convergence analysis of feasible descent methods: A general approach
Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: A general approach. Annals of Operations Research, 46--47 0 (1): 0 157--178, 1993
1993
-
[18]
Fine-tuning language models with just forward passes
Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36: 0 53038--53075, 2023
2023
-
[19]
Nesterov and V
Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17 0 (2): 0 527--566, 2017
2017
-
[20]
Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization
Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Springer, New York, 2004
2004
-
[21]
Efficiency of coordinate descent methods on huge-scale optimization problems
Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22 0 (2): 0 341--362, 2012
2012
-
[22]
Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2 edition, 2006
2006
-
[23]
Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function
Peter Richt \'a rik and Martin Tak \'a c . Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 0 (1--2): 0 1--38, 2014
2014
-
[24]
Parallel coordinate descent methods for big data optimization
Peter Richt \'a rik and Martin Tak \'a c . Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156 0 (1--2): 0 433--484, 2016
2016
-
[25]
Lindon Roberts and Cl \'e ment W. Royer. Direct search based on probabilistic descent in reduced spaces. SIAM Journal on Optimization, 33 0 (4): 0 3057--3082, 2023
2023
-
[26]
An optimal algorithm for bandit and zero-order convex optimization with two-point feedback
Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18 0 (52): 0 1--11, 2017
2017
-
[27]
u ller, and Bernd G \
Sebastian U. Stich, Christian L. M \"u ller, and Bernd G \"a rtner. Optimization of convex functions with random pursuit. SIAM Journal on Optimization, 23 0 (2): 0 1284--1309, 2013
2013
-
[28]
Inexact coordinate descent: Complexity and preconditioning
Rachael Tappenden, Peter Richt \'a rik, and Jacek Gondzio. Inexact coordinate descent: Complexity and preconditioning. Journal of Optimization Theory and Applications, 170 0 (1): 0 144--176, 2016
2016
-
[29]
Lu \'i s N. Vicente. Worst case complexity of direct search. EURO Journal on Computational Optimization, 1 0 (1): 0 143--153, 2013
2013
-
[30]
Technical note---on adaptivity in nonstationary stochastic optimization with bandit feedback
Yining Wang. Technical note---on adaptivity in nonstationary stochastic optimization with bandit feedback. Operations Research, 73 0 (2): 0 819--828, 2023
2023
-
[31]
A unified zeroth-order optimization framework via oblivious randomized sketching
Haishan Ye, Xiangyu Chang, and Xi Chen. A unified zeroth-order optimization framework via oblivious randomized sketching. arXiv preprint arXiv:2510.10945, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.