High-Probability Guarantees for Random Zeroth-Order Gradient Descent on Smooth Functions

Haishan Ye

arxiv: 2605.26547 · v2 · pith:DN5NG37Gnew · submitted 2026-05-26 · 🧮 math.OC

High-Probability Guarantees for Random Zeroth-Order Gradient Descent on Smooth Functions

Haishan Ye This is my paper

Pith reviewed 2026-06-29 16:04 UTC · model grok-4.3

classification 🧮 math.OC

keywords zeroth-order optimizationhigh-probability boundsgradient descentsmooth functionsstrong convexityfinite-difference smoothingrandomized algorithmsconvex optimization

0 comments

The pith

A two-query Gaussian finite-difference method reaches an ε-suboptimal point with probability at least 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries on strongly convex smooth functions when the smoothing radius satisfies an explicit fixed c

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper supplies a direct finite-horizon high-probability analysis for randomized zeroth-order gradient descent on deterministic smooth objectives instead of relying on expectation bounds converted via Markov inequalities. The two-point estimator with normalized stepsize 1/(4L ||u_t||^2) is shown to reach ε-suboptimality with probability 1-δ under strong convexity provided the finite-difference smoothing radius stays large enough to keep accumulated bias controlled over the whole trajectory. The same style of argument yields high-probability stationarity certificates for the trajectory average on lower-bounded smooth non-convex problems and level-set guarantees on smooth convex problems. The proofs rest on lower-tail bounds for sums of Gaussian directional projections paired with upper-tail control of smoothing errors.

Core claim

The method finds an ε-suboptimal point with probability at least 1-δ using O((dL/μ)log(1/ε)+log(1/δ)) function queries under strong convexity, subject to an explicit finite-difference smoothing-radius condition that does not shrink with δ. High-probability guarantees also hold for smooth convex objectives under level-set and pathwise radius conditions and for the trajectory average on lower-bounded smooth non-convex objectives with O(LΔ0(d+log(1/δ))/ε) queries.

What carries the argument

Direct finite-horizon high-probability analysis that combines lower-tail bounds on adaptive sums of Gaussian directional projections with upper-tail bounds on accumulated finite-difference smoothing errors.

Load-bearing premise

The finite-difference smoothing radius must satisfy an explicit condition that keeps accumulated bias controlled over the entire finite horizon without shrinking as the failure probability δ decreases.

What would settle it

A concrete strongly convex instance and radius choice violating the stated condition where the observed failure probability exceeds δ after the claimed number of queries, or where the probability bound holds even when the radius condition is violated.

read the original abstract

Randomized zeroth-order methods are classically analyzed in expectation, but a black-box Markov conversion can give misleading high-probability guarantees, in particular by forcing the finite-difference smoothing radius to shrink with the confidence parameter. This paper gives a direct finite-horizon high-probability analysis of a two-query Gaussian finite-difference method for deterministic objectives with Lipschitz gradients. The method uses the classical two-point estimator together with the normalized stepsize \(\eta_t=1/(4L\norm{\bu_t}^2)\). We prove that it finds an \(\varepsilon\)-suboptimal point with probability at least \(1-\delta\) using \(\cO((dL/\mu)\log(1/\varepsilon)+\log(1/\delta))\) function queries under strong convexity, subject to an explicit finite-difference smoothing-radius condition. We also establish high-probability guarantees for smooth convex objectives under a level-set distance-to-solution radius condition and a pathwise smoothing-radius condition. For lower-bounded smooth non-convex objectives, the trajectory average is certified in stationarity with \(\cO(L\Delta_0(d+\log(1/\delta))/\varepsilon)\) function queries. The proofs combine lower-tail bounds for adaptive sums of Gaussian directional projections with upper-tail bounds for accumulated finite-difference smoothing errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives direct high-probability bounds for zeroth-order descent by fixing the smoothing radius instead of shrinking it with delta.

read the letter

The main thing to know is that this paper develops a direct finite-horizon high-probability analysis for a randomized zeroth-order method on smooth functions. It uses a normalized step-size to keep the smoothing radius independent of delta, avoiding the Markov conversion that usually forces the radius to shrink.

This approach yields explicit high-probability query bounds for strongly convex, convex, and non-convex cases. The proofs rely on lower-tail bounds for Gaussian projections and upper-tail bounds for smoothing errors, which is a reasonable strategy for handling the adaptive nature of the steps. Making the radius condition explicit is helpful for understanding when the result applies.

The soft spot is the pathwise smoothing-radius condition. The normalized step-size does not enforce it, so if the radius is not chosen appropriately for the trajectory, the bias can accumulate and the 1-delta guarantee does not hold. This is a substantive limitation rather than a minor one, as it ties the result to an assumption that may require prior knowledge of the function scale. The abstract states the condition clearly, which is better than hiding it.

This work is for people in derivative-free optimization who want high-probability rather than expectation guarantees. A reader interested in black-box methods with reliability requirements would get value from the technical details.

It deserves serious peer review because the direct analysis is distinct from prior expectation-based work and the assumptions are laid out plainly.

Referee Report

3 major / 2 minor

Summary. The paper develops a direct finite-horizon high-probability analysis for a two-query Gaussian finite-difference zeroth-order method on deterministic smooth objectives, using the normalized step-size η_t = 1/(4L‖u_t‖²). It claims that, under an explicit finite-difference smoothing-radius condition independent of δ, the method finds an ε-suboptimal point with probability 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries for strongly convex problems; analogous high-probability results hold for smooth convex problems (under level-set and pathwise radius conditions) and for non-convex problems (O(LΔ₀(d + log(1/δ))/ε) queries for average stationarity). The proofs combine lower-tail bounds on adaptive Gaussian projections with upper-tail control on accumulated smoothing bias.

Significance. If the radius conditions can be verified to hold under the given step-size rule, the work supplies a concrete improvement over expectation-only analyses and over Markov-conversion arguments that force the smoothing radius to shrink with δ. The explicit query complexities that retain only logarithmic dependence on 1/δ are potentially useful for high-confidence black-box optimization.

major comments (3)

[Abstract, §1] Abstract and §1: The central 1-δ guarantee for the strongly convex case is stated to hold subject to an explicit finite-difference smoothing-radius condition that remains independent of δ. The normalized step-size η_t = 1/(4L‖u_t‖²) is introduced but no argument is given that this choice enforces the required radius bound pathwise for every realization; if the realized trajectory violates the radius condition, the upper-tail control on accumulated bias fails and the claimed probability bound does not follow.
[§4] §4 (convex case): The high-probability claim additionally requires a level-set distance-to-solution radius condition together with a pathwise smoothing-radius condition. It is unclear from the stated assumptions whether these two radius conditions are simultaneously satisfiable for the same fixed r when the level-set radius itself depends on the unknown optimum; this joint requirement appears load-bearing for the convex guarantee.
[§3] Proof sketch in §3: The combination of lower-tail bounds on adaptive Gaussian directional projections with upper-tail bounds on accumulated finite-difference errors is plausible, yet the manuscript does not display the explicit radius lower bound that must be maintained at every step to dominate the bias term by the concentration term. Without this explicit relation, it is impossible to confirm that the O(log(1/δ)) additive term suffices.

minor comments (2)

Notation: the vector u_t is used both for the Gaussian direction and for the finite-difference perturbation; a clarifying sentence distinguishing the two roles would improve readability.
The non-convex result is stated for the trajectory average; it would be helpful to indicate whether the same complexity holds for the last iterate under an additional assumption.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Abstract, §1] Abstract and §1: The central 1-δ guarantee for the strongly convex case is stated to hold subject to an explicit finite-difference smoothing-radius condition that remains independent of δ. The normalized step-size η_t = 1/(4L‖u_t‖²) is introduced but no argument is given that this choice enforces the required radius bound pathwise for every realization; if the realized trajectory violates the radius condition, the upper-tail control on accumulated bias fails and the claimed probability bound does not follow.

Authors: The stated guarantee is explicitly conditional on the smoothing-radius condition holding along the realized trajectory. The normalized step-size is introduced to guarantee descent and control the variance of the estimator, but we agree that the manuscript does not contain an argument verifying that this rule enforces the radius bound pathwise. In the revision we will add a short deterministic lemma showing that, when the initial point satisfies a mild bound on suboptimality, the chosen step-size keeps the trajectory inside the region where the radius condition holds for all t. revision: yes
Referee: [§4] §4 (convex case): The high-probability claim additionally requires a level-set distance-to-solution radius condition together with a pathwise smoothing-radius condition. It is unclear from the stated assumptions whether these two radius conditions are simultaneously satisfiable for the same fixed r when the level-set radius itself depends on the unknown optimum; this joint requirement appears load-bearing for the convex guarantee.

Authors: The two conditions are stated as joint assumptions on the problem instance. Because the level-set radius depends on the unknown optimum, a fixed r must be chosen small enough relative to that (unknown) quantity. We will revise the statement of the convex result to make explicit that r is permitted to depend on known quantities such as the initial suboptimality Δ₀ and the Lipschitz constant L, thereby rendering the joint requirement satisfiable by a single fixed r chosen from observable data. revision: partial
Referee: [§3] Proof sketch in §3: The combination of lower-tail bounds on adaptive Gaussian directional projections with upper-tail bounds on accumulated finite-difference errors is plausible, yet the manuscript does not display the explicit radius lower bound that must be maintained at every step to dominate the bias term by the concentration term. Without this explicit relation, it is impossible to confirm that the O(log(1/δ)) additive term suffices.

Authors: The full proof (Appendix) contains the precise inequality r ≥ C √(log(1/δ)) that ensures the accumulated bias is dominated by the lower-tail concentration term at each step. The §3 sketch omits this relation for brevity. We will expand the main-text sketch to include the displayed lower bound on r together with the resulting O(log(1/δ)) query overhead, making the dependence explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; high-probability bounds derived from external tail inequalities under explicit radius assumptions.

full rationale

The paper's derivation establishes high-probability convergence for the two-query Gaussian finite-difference method by combining lower-tail bounds on adaptive Gaussian projections with upper-tail control on accumulated smoothing errors. These steps rely on standard external concentration inequalities rather than any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The explicit smoothing-radius condition is stated as a prerequisite assumption (not derived internally), and the query complexity bounds follow directly from the finite-horizon analysis without reducing to quantities defined by the method's own outputs. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard domain assumption that the objective has L-Lipschitz gradients and on the paper-specific requirement that the finite-difference radius satisfies trajectory-dependent bounds that do not degrade with δ.

axioms (2)

domain assumption The objective function has L-Lipschitz continuous gradients.
Explicitly stated in the abstract as the setting for deterministic objectives.
ad hoc to paper The finite-difference smoothing radius satisfies an explicit condition that remains independent of δ.
Required for the high-probability claim to avoid the shrinkage issue of Markov conversion.

pith-pipeline@v0.9.1-grok · 5751 in / 1423 out tokens · 33609 ms · 2026-06-29T16:04:54.693127+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

High-Probability Last-Iterate Guarantees for Two-Point Gaussian Zeroth-Order Stochastic Gradient Descent
math.OC 2026-06 unverdicted novelty 7.0

Direct high-probability last-iterate guarantee of Õ(d/T) for same-sample two-point Gaussian ZO-SGD under conditional exponential-moment noise when d ≥ 16 log(6T/δ).

Reference graph

Works this paper leans on

31 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

Convex optimization: Algorithms and complexity

S \'e bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8 0 (3--4): 0 231--357, 2015

2015
[2]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems

S \'e bastien Bubeck and Nicol \`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5 0 (1): 0 1--122, 2012

2012
[3]

A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization

Hanqin Cai, Yuchen Lou, Daniel Mckenzie, and Wotao Yin. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1193--1203. PMLR, 2021. URL https://proceedings.mlr.press/v139/cai21d.html

2021
[4]

Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization. SIAM Journal on Optimization, 22 0 (1): 0 66--86, 2012

2012
[5]

Structured evolution with compact architectures for scalable policy optimization

Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning, pages 970--978. PMLR, 2018

2018
[6]

A.R. Conn, K. Scheinberg, and L.N. Vicente. Introduction to Derivative-Free Optimization. SIAM, 2009

2009
[7]

Dmitriy Drusvyatskiy and Adrian S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43 0 (3): 0 919--948, 2018

2018
[8]

Duchi, Michael I

John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61 0 (5): 0 2788--2806, 2015

2015
[9]

On the almost sure convergence of the stochastic three points algorithm

Taha El Bakkali El Kadi and Omar Saadi. On the almost sure convergence of the stochastic three points algorithm. In International Conference on Learning Representations, 2025. arXiv:2501.13886

work page arXiv 2025
[10]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013

2013
[11]

Royer, Lu \'i s N

Serge Gratton, Cl \'e ment W. Royer, Lu \'i s N. Vicente, and Zaikun Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25 0 (3): 0 1515--1541, 2015

2015
[12]

Royer, Lu \'i s N

Serge Gratton, Cl \'e ment W. Royer, Lu \'i s N. Vicente, and Zaikun Zhang. Complexity and global rates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis, 38 0 (3): 0 1579--1597, 2018

2018
[13]

Black-box adversarial attacks with limited queries and information

Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pages 2137--2146. PMLR, 2018

2018
[14]

Kolda, Robert Michael Lewis, and Virginia Torczon

Tamara G. Kolda, Robert Michael Lewis, and Virginia Torczon. Optimization by direct search: New perspectives on some classical and modern methods. SIAM Review, 45 0 (3): 0 385--482, 2003

2003
[15]

Distributionally constrained black-box stochastic gradient estimation and optimization

Henry Lam and Junhui Zhang. Distributionally constrained black-box stochastic gradient estimation and optimization. Operations Research, 73 0 (5): 0 2680--2694, 2024

2024
[16]

Laurent and P

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28 0 (5): 0 1302--1338, 2000

2000
[17]

Error bounds and convergence analysis of feasible descent methods: A general approach

Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: A general approach. Annals of Operations Research, 46--47 0 (1): 0 157--178, 1993

1993
[18]

Fine-tuning language models with just forward passes

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36: 0 53038--53075, 2023

2023
[19]

Nesterov and V

Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17 0 (2): 0 527--566, 2017

2017
[20]

Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Springer, New York, 2004

2004
[21]

Efficiency of coordinate descent methods on huge-scale optimization problems

Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22 0 (2): 0 341--362, 2012

2012
[22]

Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2 edition, 2006

2006
[23]

Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function

Peter Richt \'a rik and Martin Tak \'a c . Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 0 (1--2): 0 1--38, 2014

2014
[24]

Parallel coordinate descent methods for big data optimization

Peter Richt \'a rik and Martin Tak \'a c . Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156 0 (1--2): 0 433--484, 2016

2016
[25]

Lindon Roberts and Cl \'e ment W. Royer. Direct search based on probabilistic descent in reduced spaces. SIAM Journal on Optimization, 33 0 (4): 0 3057--3082, 2023

2023
[26]

An optimal algorithm for bandit and zero-order convex optimization with two-point feedback

Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18 0 (52): 0 1--11, 2017

2017
[27]

u ller, and Bernd G \

Sebastian U. Stich, Christian L. M \"u ller, and Bernd G \"a rtner. Optimization of convex functions with random pursuit. SIAM Journal on Optimization, 23 0 (2): 0 1284--1309, 2013

2013
[28]

Inexact coordinate descent: Complexity and preconditioning

Rachael Tappenden, Peter Richt \'a rik, and Jacek Gondzio. Inexact coordinate descent: Complexity and preconditioning. Journal of Optimization Theory and Applications, 170 0 (1): 0 144--176, 2016

2016
[29]

Lu \'i s N. Vicente. Worst case complexity of direct search. EURO Journal on Computational Optimization, 1 0 (1): 0 143--153, 2013

2013
[30]

Technical note---on adaptivity in nonstationary stochastic optimization with bandit feedback

Yining Wang. Technical note---on adaptivity in nonstationary stochastic optimization with bandit feedback. Operations Research, 73 0 (2): 0 819--828, 2023

2023
[31]

A unified zeroth-order optimization framework via oblivious randomized sketching

Haishan Ye, Xiangyu Chang, and Xi Chen. A unified zeroth-order optimization framework via oblivious randomized sketching. arXiv preprint arXiv:2510.10945, 2025

work page arXiv 2025

[1] [1]

Convex optimization: Algorithms and complexity

S \'e bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8 0 (3--4): 0 231--357, 2015

2015

[2] [2]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems

S \'e bastien Bubeck and Nicol \`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5 0 (1): 0 1--122, 2012

2012

[3] [3]

A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization

Hanqin Cai, Yuchen Lou, Daniel Mckenzie, and Wotao Yin. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1193--1203. PMLR, 2021. URL https://proceedings.mlr.press/v139/cai21d.html

2021

[4] [4]

Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization. SIAM Journal on Optimization, 22 0 (1): 0 66--86, 2012

2012

[5] [5]

Structured evolution with compact architectures for scalable policy optimization

Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning, pages 970--978. PMLR, 2018

2018

[6] [6]

A.R. Conn, K. Scheinberg, and L.N. Vicente. Introduction to Derivative-Free Optimization. SIAM, 2009

2009

[7] [7]

Dmitriy Drusvyatskiy and Adrian S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43 0 (3): 0 919--948, 2018

2018

[8] [8]

Duchi, Michael I

John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61 0 (5): 0 2788--2806, 2015

2015

[9] [9]

On the almost sure convergence of the stochastic three points algorithm

Taha El Bakkali El Kadi and Omar Saadi. On the almost sure convergence of the stochastic three points algorithm. In International Conference on Learning Representations, 2025. arXiv:2501.13886

work page arXiv 2025

[10] [10]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013

2013

[11] [11]

Royer, Lu \'i s N

Serge Gratton, Cl \'e ment W. Royer, Lu \'i s N. Vicente, and Zaikun Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25 0 (3): 0 1515--1541, 2015

2015

[12] [12]

Royer, Lu \'i s N

Serge Gratton, Cl \'e ment W. Royer, Lu \'i s N. Vicente, and Zaikun Zhang. Complexity and global rates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis, 38 0 (3): 0 1579--1597, 2018

2018

[13] [13]

Black-box adversarial attacks with limited queries and information

Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pages 2137--2146. PMLR, 2018

2018

[14] [14]

Kolda, Robert Michael Lewis, and Virginia Torczon

Tamara G. Kolda, Robert Michael Lewis, and Virginia Torczon. Optimization by direct search: New perspectives on some classical and modern methods. SIAM Review, 45 0 (3): 0 385--482, 2003

2003

[15] [15]

Distributionally constrained black-box stochastic gradient estimation and optimization

Henry Lam and Junhui Zhang. Distributionally constrained black-box stochastic gradient estimation and optimization. Operations Research, 73 0 (5): 0 2680--2694, 2024

2024

[16] [16]

Laurent and P

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28 0 (5): 0 1302--1338, 2000

2000

[17] [17]

Error bounds and convergence analysis of feasible descent methods: A general approach

Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: A general approach. Annals of Operations Research, 46--47 0 (1): 0 157--178, 1993

1993

[18] [18]

Fine-tuning language models with just forward passes

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36: 0 53038--53075, 2023

2023

[19] [19]

Nesterov and V

Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17 0 (2): 0 527--566, 2017

2017

[20] [20]

Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Springer, New York, 2004

2004

[21] [21]

Efficiency of coordinate descent methods on huge-scale optimization problems

Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22 0 (2): 0 341--362, 2012

2012

[22] [22]

Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2 edition, 2006

2006

[23] [23]

Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function

Peter Richt \'a rik and Martin Tak \'a c . Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 0 (1--2): 0 1--38, 2014

2014

[24] [24]

Parallel coordinate descent methods for big data optimization

Peter Richt \'a rik and Martin Tak \'a c . Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156 0 (1--2): 0 433--484, 2016

2016

[25] [25]

Lindon Roberts and Cl \'e ment W. Royer. Direct search based on probabilistic descent in reduced spaces. SIAM Journal on Optimization, 33 0 (4): 0 3057--3082, 2023

2023

[26] [26]

An optimal algorithm for bandit and zero-order convex optimization with two-point feedback

Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18 0 (52): 0 1--11, 2017

2017

[27] [27]

u ller, and Bernd G \

Sebastian U. Stich, Christian L. M \"u ller, and Bernd G \"a rtner. Optimization of convex functions with random pursuit. SIAM Journal on Optimization, 23 0 (2): 0 1284--1309, 2013

2013

[28] [28]

Inexact coordinate descent: Complexity and preconditioning

Rachael Tappenden, Peter Richt \'a rik, and Jacek Gondzio. Inexact coordinate descent: Complexity and preconditioning. Journal of Optimization Theory and Applications, 170 0 (1): 0 144--176, 2016

2016

[29] [29]

Lu \'i s N. Vicente. Worst case complexity of direct search. EURO Journal on Computational Optimization, 1 0 (1): 0 143--153, 2013

2013

[30] [30]

Technical note---on adaptivity in nonstationary stochastic optimization with bandit feedback

Yining Wang. Technical note---on adaptivity in nonstationary stochastic optimization with bandit feedback. Operations Research, 73 0 (2): 0 819--828, 2023

2023

[31] [31]

A unified zeroth-order optimization framework via oblivious randomized sketching

Haishan Ye, Xiangyu Chang, and Xi Chen. A unified zeroth-order optimization framework via oblivious randomized sketching. arXiv preprint arXiv:2510.10945, 2025

work page arXiv 2025