pith. sign in

arxiv: 2605.26547 · v2 · pith:DN5NG37Gnew · submitted 2026-05-26 · 🧮 math.OC

High-Probability Guarantees for Random Zeroth-Order Gradient Descent on Smooth Functions

Pith reviewed 2026-06-29 16:04 UTC · model grok-4.3

classification 🧮 math.OC
keywords zeroth-order optimizationhigh-probability boundsgradient descentsmooth functionsstrong convexityfinite-difference smoothingrandomized algorithmsconvex optimization
0
0 comments X

The pith

A two-query Gaussian finite-difference method reaches an ε-suboptimal point with probability at least 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries on strongly convex smooth functions when the smoothing radius satisfies an explicit fixed c

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper supplies a direct finite-horizon high-probability analysis for randomized zeroth-order gradient descent on deterministic smooth objectives instead of relying on expectation bounds converted via Markov inequalities. The two-point estimator with normalized stepsize 1/(4L ||u_t||^2) is shown to reach ε-suboptimality with probability 1-δ under strong convexity provided the finite-difference smoothing radius stays large enough to keep accumulated bias controlled over the whole trajectory. The same style of argument yields high-probability stationarity certificates for the trajectory average on lower-bounded smooth non-convex problems and level-set guarantees on smooth convex problems. The proofs rest on lower-tail bounds for sums of Gaussian directional projections paired with upper-tail control of smoothing errors.

Core claim

The method finds an ε-suboptimal point with probability at least 1-δ using O((dL/μ)log(1/ε)+log(1/δ)) function queries under strong convexity, subject to an explicit finite-difference smoothing-radius condition that does not shrink with δ. High-probability guarantees also hold for smooth convex objectives under level-set and pathwise radius conditions and for the trajectory average on lower-bounded smooth non-convex objectives with O(LΔ0(d+log(1/δ))/ε) queries.

What carries the argument

Direct finite-horizon high-probability analysis that combines lower-tail bounds on adaptive sums of Gaussian directional projections with upper-tail bounds on accumulated finite-difference smoothing errors.

Load-bearing premise

The finite-difference smoothing radius must satisfy an explicit condition that keeps accumulated bias controlled over the entire finite horizon without shrinking as the failure probability δ decreases.

What would settle it

A concrete strongly convex instance and radius choice violating the stated condition where the observed failure probability exceeds δ after the claimed number of queries, or where the probability bound holds even when the radius condition is violated.

read the original abstract

Randomized zeroth-order methods are classically analyzed in expectation, but a black-box Markov conversion can give misleading high-probability guarantees, in particular by forcing the finite-difference smoothing radius to shrink with the confidence parameter. This paper gives a direct finite-horizon high-probability analysis of a two-query Gaussian finite-difference method for deterministic objectives with Lipschitz gradients. The method uses the classical two-point estimator together with the normalized stepsize \(\eta_t=1/(4L\norm{\bu_t}^2)\). We prove that it finds an \(\varepsilon\)-suboptimal point with probability at least \(1-\delta\) using \(\cO((dL/\mu)\log(1/\varepsilon)+\log(1/\delta))\) function queries under strong convexity, subject to an explicit finite-difference smoothing-radius condition. We also establish high-probability guarantees for smooth convex objectives under a level-set distance-to-solution radius condition and a pathwise smoothing-radius condition. For lower-bounded smooth non-convex objectives, the trajectory average is certified in stationarity with \(\cO(L\Delta_0(d+\log(1/\delta))/\varepsilon)\) function queries. The proofs combine lower-tail bounds for adaptive sums of Gaussian directional projections with upper-tail bounds for accumulated finite-difference smoothing errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper develops a direct finite-horizon high-probability analysis for a two-query Gaussian finite-difference zeroth-order method on deterministic smooth objectives, using the normalized step-size η_t = 1/(4L‖u_t‖²). It claims that, under an explicit finite-difference smoothing-radius condition independent of δ, the method finds an ε-suboptimal point with probability 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries for strongly convex problems; analogous high-probability results hold for smooth convex problems (under level-set and pathwise radius conditions) and for non-convex problems (O(LΔ₀(d + log(1/δ))/ε) queries for average stationarity). The proofs combine lower-tail bounds on adaptive Gaussian projections with upper-tail control on accumulated smoothing bias.

Significance. If the radius conditions can be verified to hold under the given step-size rule, the work supplies a concrete improvement over expectation-only analyses and over Markov-conversion arguments that force the smoothing radius to shrink with δ. The explicit query complexities that retain only logarithmic dependence on 1/δ are potentially useful for high-confidence black-box optimization.

major comments (3)
  1. [Abstract, §1] Abstract and §1: The central 1-δ guarantee for the strongly convex case is stated to hold subject to an explicit finite-difference smoothing-radius condition that remains independent of δ. The normalized step-size η_t = 1/(4L‖u_t‖²) is introduced but no argument is given that this choice enforces the required radius bound pathwise for every realization; if the realized trajectory violates the radius condition, the upper-tail control on accumulated bias fails and the claimed probability bound does not follow.
  2. [§4] §4 (convex case): The high-probability claim additionally requires a level-set distance-to-solution radius condition together with a pathwise smoothing-radius condition. It is unclear from the stated assumptions whether these two radius conditions are simultaneously satisfiable for the same fixed r when the level-set radius itself depends on the unknown optimum; this joint requirement appears load-bearing for the convex guarantee.
  3. [§3] Proof sketch in §3: The combination of lower-tail bounds on adaptive Gaussian directional projections with upper-tail bounds on accumulated finite-difference errors is plausible, yet the manuscript does not display the explicit radius lower bound that must be maintained at every step to dominate the bias term by the concentration term. Without this explicit relation, it is impossible to confirm that the O(log(1/δ)) additive term suffices.
minor comments (2)
  1. Notation: the vector u_t is used both for the Gaussian direction and for the finite-difference perturbation; a clarifying sentence distinguishing the two roles would improve readability.
  2. The non-convex result is stated for the trajectory average; it would be helpful to indicate whether the same complexity holds for the last iterate under an additional assumption.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract, §1] Abstract and §1: The central 1-δ guarantee for the strongly convex case is stated to hold subject to an explicit finite-difference smoothing-radius condition that remains independent of δ. The normalized step-size η_t = 1/(4L‖u_t‖²) is introduced but no argument is given that this choice enforces the required radius bound pathwise for every realization; if the realized trajectory violates the radius condition, the upper-tail control on accumulated bias fails and the claimed probability bound does not follow.

    Authors: The stated guarantee is explicitly conditional on the smoothing-radius condition holding along the realized trajectory. The normalized step-size is introduced to guarantee descent and control the variance of the estimator, but we agree that the manuscript does not contain an argument verifying that this rule enforces the radius bound pathwise. In the revision we will add a short deterministic lemma showing that, when the initial point satisfies a mild bound on suboptimality, the chosen step-size keeps the trajectory inside the region where the radius condition holds for all t. revision: yes

  2. Referee: [§4] §4 (convex case): The high-probability claim additionally requires a level-set distance-to-solution radius condition together with a pathwise smoothing-radius condition. It is unclear from the stated assumptions whether these two radius conditions are simultaneously satisfiable for the same fixed r when the level-set radius itself depends on the unknown optimum; this joint requirement appears load-bearing for the convex guarantee.

    Authors: The two conditions are stated as joint assumptions on the problem instance. Because the level-set radius depends on the unknown optimum, a fixed r must be chosen small enough relative to that (unknown) quantity. We will revise the statement of the convex result to make explicit that r is permitted to depend on known quantities such as the initial suboptimality Δ₀ and the Lipschitz constant L, thereby rendering the joint requirement satisfiable by a single fixed r chosen from observable data. revision: partial

  3. Referee: [§3] Proof sketch in §3: The combination of lower-tail bounds on adaptive Gaussian directional projections with upper-tail bounds on accumulated finite-difference errors is plausible, yet the manuscript does not display the explicit radius lower bound that must be maintained at every step to dominate the bias term by the concentration term. Without this explicit relation, it is impossible to confirm that the O(log(1/δ)) additive term suffices.

    Authors: The full proof (Appendix) contains the precise inequality r ≥ C √(log(1/δ)) that ensures the accumulated bias is dominated by the lower-tail concentration term at each step. The §3 sketch omits this relation for brevity. We will expand the main-text sketch to include the displayed lower bound on r together with the resulting O(log(1/δ)) query overhead, making the dependence explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; high-probability bounds derived from external tail inequalities under explicit radius assumptions.

full rationale

The paper's derivation establishes high-probability convergence for the two-query Gaussian finite-difference method by combining lower-tail bounds on adaptive Gaussian projections with upper-tail control on accumulated smoothing errors. These steps rely on standard external concentration inequalities rather than any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The explicit smoothing-radius condition is stated as a prerequisite assumption (not derived internally), and the query complexity bounds follow directly from the finite-horizon analysis without reducing to quantities defined by the method's own outputs. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard domain assumption that the objective has L-Lipschitz gradients and on the paper-specific requirement that the finite-difference radius satisfies trajectory-dependent bounds that do not degrade with δ.

axioms (2)
  • domain assumption The objective function has L-Lipschitz continuous gradients.
    Explicitly stated in the abstract as the setting for deterministic objectives.
  • ad hoc to paper The finite-difference smoothing radius satisfies an explicit condition that remains independent of δ.
    Required for the high-probability claim to avoid the shrinkage issue of Markov conversion.

pith-pipeline@v0.9.1-grok · 5751 in / 1423 out tokens · 33609 ms · 2026-06-29T16:04:54.693127+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. High-Probability Last-Iterate Guarantees for Two-Point Gaussian Zeroth-Order Stochastic Gradient Descent

    math.OC 2026-06 unverdicted novelty 7.0

    Direct high-probability last-iterate guarantee of Õ(d/T) for same-sample two-point Gaussian ZO-SGD under conditional exponential-moment noise when d ≥ 16 log(6T/δ).

Reference graph

Works this paper leans on

31 extracted references · 2 canonical work pages · cited by 1 Pith paper

  1. [1]

    Convex optimization: Algorithms and complexity

    S \'e bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8 0 (3--4): 0 231--357, 2015

  2. [2]

    Regret analysis of stochastic and nonstochastic multi-armed bandit problems

    S \'e bastien Bubeck and Nicol \`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5 0 (1): 0 1--122, 2012

  3. [3]

    A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization

    Hanqin Cai, Yuchen Lou, Daniel Mckenzie, and Wotao Yin. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1193--1203. PMLR, 2021. URL https://proceedings.mlr.press/v139/cai21d.html

  4. [4]

    Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization. SIAM Journal on Optimization, 22 0 (1): 0 66--86, 2012

  5. [5]

    Structured evolution with compact architectures for scalable policy optimization

    Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning, pages 970--978. PMLR, 2018

  6. [6]

    A.R. Conn, K. Scheinberg, and L.N. Vicente. Introduction to Derivative-Free Optimization. SIAM, 2009

  7. [7]

    Dmitriy Drusvyatskiy and Adrian S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43 0 (3): 0 919--948, 2018

  8. [8]

    Duchi, Michael I

    John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61 0 (5): 0 2788--2806, 2015

  9. [9]

    On the almost sure convergence of the stochastic three points algorithm

    Taha El Bakkali El Kadi and Omar Saadi. On the almost sure convergence of the stochastic three points algorithm. In International Conference on Learning Representations, 2025. arXiv:2501.13886

  10. [10]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming

    Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013

  11. [11]

    Royer, Lu \'i s N

    Serge Gratton, Cl \'e ment W. Royer, Lu \'i s N. Vicente, and Zaikun Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25 0 (3): 0 1515--1541, 2015

  12. [12]

    Royer, Lu \'i s N

    Serge Gratton, Cl \'e ment W. Royer, Lu \'i s N. Vicente, and Zaikun Zhang. Complexity and global rates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis, 38 0 (3): 0 1579--1597, 2018

  13. [13]

    Black-box adversarial attacks with limited queries and information

    Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pages 2137--2146. PMLR, 2018

  14. [14]

    Kolda, Robert Michael Lewis, and Virginia Torczon

    Tamara G. Kolda, Robert Michael Lewis, and Virginia Torczon. Optimization by direct search: New perspectives on some classical and modern methods. SIAM Review, 45 0 (3): 0 385--482, 2003

  15. [15]

    Distributionally constrained black-box stochastic gradient estimation and optimization

    Henry Lam and Junhui Zhang. Distributionally constrained black-box stochastic gradient estimation and optimization. Operations Research, 73 0 (5): 0 2680--2694, 2024

  16. [16]

    Laurent and P

    B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28 0 (5): 0 1302--1338, 2000

  17. [17]

    Error bounds and convergence analysis of feasible descent methods: A general approach

    Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: A general approach. Annals of Operations Research, 46--47 0 (1): 0 157--178, 1993

  18. [18]

    Fine-tuning language models with just forward passes

    Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36: 0 53038--53075, 2023

  19. [19]

    Nesterov and V

    Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17 0 (2): 0 527--566, 2017

  20. [20]

    Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization

    Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Springer, New York, 2004

  21. [21]

    Efficiency of coordinate descent methods on huge-scale optimization problems

    Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22 0 (2): 0 341--362, 2012

  22. [22]

    Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2 edition, 2006

  23. [23]

    Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function

    Peter Richt \'a rik and Martin Tak \'a c . Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 0 (1--2): 0 1--38, 2014

  24. [24]

    Parallel coordinate descent methods for big data optimization

    Peter Richt \'a rik and Martin Tak \'a c . Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156 0 (1--2): 0 433--484, 2016

  25. [25]

    Lindon Roberts and Cl \'e ment W. Royer. Direct search based on probabilistic descent in reduced spaces. SIAM Journal on Optimization, 33 0 (4): 0 3057--3082, 2023

  26. [26]

    An optimal algorithm for bandit and zero-order convex optimization with two-point feedback

    Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18 0 (52): 0 1--11, 2017

  27. [27]

    u ller, and Bernd G \

    Sebastian U. Stich, Christian L. M \"u ller, and Bernd G \"a rtner. Optimization of convex functions with random pursuit. SIAM Journal on Optimization, 23 0 (2): 0 1284--1309, 2013

  28. [28]

    Inexact coordinate descent: Complexity and preconditioning

    Rachael Tappenden, Peter Richt \'a rik, and Jacek Gondzio. Inexact coordinate descent: Complexity and preconditioning. Journal of Optimization Theory and Applications, 170 0 (1): 0 144--176, 2016

  29. [29]

    Lu \'i s N. Vicente. Worst case complexity of direct search. EURO Journal on Computational Optimization, 1 0 (1): 0 143--153, 2013

  30. [30]

    Technical note---on adaptivity in nonstationary stochastic optimization with bandit feedback

    Yining Wang. Technical note---on adaptivity in nonstationary stochastic optimization with bandit feedback. Operations Research, 73 0 (2): 0 819--828, 2023

  31. [31]

    A unified zeroth-order optimization framework via oblivious randomized sketching

    Haishan Ye, Xiangyu Chang, and Xi Chen. A unified zeroth-order optimization framework via oblivious randomized sketching. arXiv preprint arXiv:2510.10945, 2025