pith. machine review for the scientific record. sign in

arxiv: 2604.25372 · v2 · submitted 2026-04-28 · 🧮 math.OC · cs.LG· cs.NA· cs.SY· eess.SY· math.NA

Recognition: unknown

From Cursed to Competitive: Closing the ZO-FO Gap via Input-to-State Stability

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:45 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.NAcs.SYeess.SYmath.NA
keywords zeroth-order optimizationfirst-order optimizationinput-to-state stabilityconvergence ratesdynamical systemsperturbation analysisdimension dependenceoptimization algorithms
0
0 comments X

The pith

Zeroth-order methods match first-order convergence rates in expectation when their averaged dynamics satisfy input-to-state stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that zeroth-order optimization algorithms can avoid the usual extra dependence on problem dimension that appears in their convergence rates relative to first-order methods. By casting the averaged update of a ZO algorithm as the corresponding first-order update plus a perturbation whose size is controlled by design parameters, the authors apply input-to-state stability to prove that the ZO trajectory tracks the FO trajectory at the same decay rate and ends inside a ball around the FO fixed point. The radius of that ball shrinks with the perturbation bound and can be made arbitrarily small by choice of parameters. A reader cares because this removes the dimension-dependent slowdown that has historically made ZO methods impractical for high-dimensional problems while preserving the same iteration complexity as FO methods.

Core claim

Under several conditions, in expectation, ZO methods do not suffer from extra dimension dependencies in their convergence rates with respect to their FO counterparts. The average of a ZO algorithm is formulated as the average of its FO counterpart with bounded perturbations whose values depend only on design parameters. Using input-to-state stability properties, ZO methods follow the same decay rate as their FO counterparts and converge to a neighbourhood of the fixed point of FO methods, where its radius depends on the bound of the norm of the perturbations, which can be made arbitrarily small.

What carries the argument

Input-to-state stability applied to the dynamical system formed by expressing the averaged zeroth-order iteration as the first-order iteration plus a bounded perturbation that depends only on algorithm parameters.

If this is right

  • ZO algorithms inherit the same linear or sublinear rate as their FO versions once the transient due to the perturbation dies out.
  • The final error ball around the FO solution can be driven below any tolerance by tightening the perturbation bound through parameter selection.
  • Convergence guarantees already proved for FO methods extend immediately to the corresponding ZO methods up to the controllable neighborhood.
  • The result holds in expectation over the random perturbations introduced by finite-difference or random-direction estimators.
  • The same stability argument applies to any pair of ZO and FO recursions whose difference can be bounded uniformly in the state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be used to certify robustness of ZO methods against other structured disturbances such as adversarial noise or model mismatch.
  • Parameter tuning rules that minimize the perturbation bound might be derived directly from the ISS gain function rather than by trial and error.
  • The approach suggests that similar perturbation-plus-stability arguments could close gaps between other derivative-free methods and their gradient-based analogues.
  • Extending the analysis from deterministic to stochastic or non-smooth settings would require checking whether the perturbation bound remains uniform under those relaxations.

Load-bearing premise

The averaged trajectory of the zeroth-order algorithm can be written exactly as the first-order trajectory plus a perturbation whose magnitude is bounded by quantities that depend only on design parameters, and the resulting closed-loop system is input-to-state stable.

What would settle it

A high-dimensional quadratic test problem where the observed convergence rate of a standard zeroth-order gradient estimator deviates from the first-order rate by more than the predicted perturbation radius after the transient has died, with the deviation growing with dimension.

Figures

Figures reproduced from arXiv: 2604.25372 by Amir Ali Farzin, Iman Shames, Philipp Braun.

Figure 1
Figure 1. Figure 1: Quadratic objective (n = 1000): effect of the step size h on GD (solid) and ZO-GD (dashed). Reducing h shrinks the perturbation and brings ZO-GD closer to GD. 4.2 Binary classification with a neural network We consider binary classification on MNIST (digits 0 vs 1) using a two-layer fully connected network with ReLU activations and sigmoid output (d = 784, H = 128 (hidden layer neurons), n = 100,609 parame… view at source ↗
Figure 2
Figure 2. Figure 2: Training loss for GD, HB, NAG and their ZO counterparts on a two-layer neural network view at source ↗
Figure 3
Figure 3. Figure 3: Dimension scaling: Training loss for GD (solid) and ZO-GD (dashed) across five network view at source ↗
Figure 4
Figure 4. Figure 4: Parameter analysis: smoothing parameter. view at source ↗
Figure 5
Figure 5. Figure 5: Parameter analysis: number of sampled directions. view at source ↗
Figure 6
Figure 6. Figure 6: Parameter analysis: compensation for divergent cases. view at source ↗
Figure 7
Figure 7. Figure 7: Parameter sensitivity for ZO-GD on the neural network. Left: effect of view at source ↗
read the original abstract

While it is generally understood that zeroth-order (ZO) algorithms have an extra dependency on their number of iterations for any choice of parameters, compared to their first-order (FO) counterparts, in this work, we show that under several conditions, in expectation, ZO methods do not suffer from extra dimension dependencies in their convergence rates with respect to their FO counterparts. We look at optimisation algorithms from the dynamical systems perspective and analyse the conditions under which one can formulate the average of a ZO algorithm as the average of its FO counterpart with bounded perturbations with values dependent on design parameters. Then, using input-to-state stability properties, we show ZO methods follow the same decay rate as their FO counterparts and converge to a neighbourhood of the fixed point of FO methods, where its radius depends on the bound of the norm of the perturbations, which can be made arbitrarily small. The theoretical findings are illustrated via numerical examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that, under several conditions, zeroth-order (ZO) optimization algorithms do not incur extra dimension dependencies in their expected convergence rates relative to first-order (FO) counterparts. Viewing the algorithms through a dynamical-systems lens, the averaged ZO iteration is expressed as the corresponding FO iteration plus a perturbation whose norm is bounded solely by design parameters (e.g., smoothing radius). Input-to-state stability (ISS) of the averaged FO system is then invoked to conclude that ZO trajectories follow the same decay rate and converge to a neighborhood of the FO fixed point whose radius can be made arbitrarily small by tightening the perturbation bound. Numerical examples are provided to illustrate the findings.

Significance. If the stated conditions hold and the perturbation bound is shown to be dimension-independent, the result would be significant: it offers a principled way to close the well-known ZO-FO gap in iteration complexity for high-dimensional problems where gradients are unavailable. The dynamical-systems/ISS framework is a strength and could generalize beyond the algorithms treated. The work earns credit for attempting a parameter-dependent perturbation analysis rather than relying on ad-hoc bounds.

major comments (2)
  1. [Abstract] Abstract: the claim that ZO methods 'do not suffer from extra dimension dependencies' is load-bearing, yet the abstract supplies neither the explicit conditions nor the bias derivation showing that the perturbation norm depends only on design parameters and is free of residual d-dependence; without these steps the central assertion cannot be verified.
  2. [Dynamical-systems formulation] The weakest assumption (that the averaged ZO map equals the FO map plus a bounded perturbation) must be derived explicitly for each algorithm considered; the manuscript should supply the bias calculation (presumably in the dynamical-systems section) and confirm that the bound contains no hidden dimension factors.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it named the concrete ZO algorithms (e.g., ZO-GD, ZO-SGD) to which the ISS argument is applied.
  2. [Numerical examples] Numerical examples should report both the observed convergence rate and the size of the limiting neighborhood as functions of the smoothing radius, allowing direct comparison with the ISS-derived bound.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract should more explicitly state the conditions and that the bias derivations need to be supplied in full detail in the dynamical-systems section. We will make these revisions to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that ZO methods 'do not suffer from extra dimension dependencies' is load-bearing, yet the abstract supplies neither the explicit conditions nor the bias derivation showing that the perturbation norm depends only on design parameters and is free of residual d-dependence; without these steps the central assertion cannot be verified.

    Authors: We acknowledge the point. While the abstract refers to 'several conditions' and 'bounded perturbations with values dependent on design parameters,' it does not enumerate them or outline the bias. In the revised version, we will modify the abstract to list the primary conditions (Lipschitz smoothness of the objective function, a fixed positive smoothing radius, and taking expectation with respect to the random directions in the ZO estimator) and indicate that the perturbation bound is obtained from the expectation of the ZO gradient estimator bias, which is independent of the dimension d. The explicit bias derivation will be included in the main text. revision: yes

  2. Referee: [Dynamical-systems formulation] The weakest assumption (that the averaged ZO map equals the FO map plus a bounded perturbation) must be derived explicitly for each algorithm considered; the manuscript should supply the bias calculation (presumably in the dynamical-systems section) and confirm that the bound contains no hidden dimension factors.

    Authors: We agree that this assumption is central and should be derived explicitly. The manuscript currently presents the averaged ZO iteration as the FO iteration plus a perturbation in the dynamical-systems formulation, but we will add a new subsection with the complete calculation of the bias for each algorithm (starting with the standard Gaussian smoothing ZO gradient). Using a Taylor expansion of the objective, we derive that the expected perturbation is bounded by a constant times the smoothing radius, where the constant depends on the gradient Lipschitz constant but not on the dimension. We will include the proof that there are no hidden d factors and state the precise bound. This will be done for all algorithms analyzed in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation formulates the expected ZO iteration as the FO iteration plus a perturbation whose bound depends only on design parameters (smoothing radius etc.), then invokes standard external ISS theorems to conclude that the ZO trajectory tracks the FO fixed point within a radius controlled by that bound. No step equates a derived quantity to its own input by definition, renames a fitted parameter as a prediction, or relies on a self-citation whose content is itself unverified. The ISS property is an independent dynamical-systems fact; the bias bound is derived from the smoothing operator and does not presuppose the convergence claim. The chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that ZO updates equal FO updates plus bounded perturbations and that the closed-loop system satisfies input-to-state stability. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The expected ZO update equals the FO update plus a perturbation whose norm is bounded by a quantity that depends only on design parameters.
    Invoked to reduce ZO dynamics to a perturbed FO system.
  • domain assumption The resulting dynamical system satisfies input-to-state stability with respect to the perturbation input.
    Used to conclude that bounded perturbations produce bounded deviation from the FO trajectory.

pith-pipeline@v0.9.0 · 5479 in / 1399 out tokens · 46994 ms · 2026-05-07T15:45:59.807520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 7 canonical work pages

  1. [1]

    Minimisation of submodular functions using gaussian zeroth-order random oracles

    [FPB+25b] Amir Ali Farzin, Yuen-Man Pun, Philipp Braun, Tyler Summers, and Iman Shames. Minimisation of submodular functions using gaussian zeroth-order random oracles. arXiv preprint arXiv:2510.15257,

  2. [2]

    Solving the offline and online min-max problem of non-smooth submodular-concave functions: A zeroth-order approach.arXiv preprint arXiv:2601.21243, 2026

    10 [FPB+26] Amir Ali Farzin, Yuen-Man Pun, Philipp Braun, Tyler Summers, and Iman Shames. Solving the offline and online min-max problem of non-smooth submodular-concave functions: A zeroth-order approach.arXiv preprint arXiv:2601.21243,

  3. [3]

    arXiv preprint arXiv:2505.02281 , year=

    [FPBS25a] Amir Ali Farzin, Yuen-Man Pun, Philipp Braun, and Iman Shames. Minimisa- tion of quasar-convex functions using random zeroth-order oracles.arXiv preprint arXiv:2505.02281,

  4. [4]

    On the stability connection between discrete-time algorithms and their resolution odes: Applications to min-max optimisation.arXiv preprint arXiv:2603.01430,

    [FPBS26] Amir Ali Farzin, Yuen-Man Pun, Philipp Braun, and Iman Shames. On the stability connection between discrete-time algorithms and their resolution odes: Applications to min-max optimisation.arXiv preprint arXiv:2603.01430,

  5. [5]

    Salimans, J

    [SHC+17] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolu- tion strategies as a scalable alternative to reinforcement learning.arXiv preprint arXiv:1703.03864,

  6. [6]

    Input to state stability: Basic concepts and results

    [Son08] Eduardo D Sontag. Input to state stability: Basic concepts and results. InNonlinear and optimal control theory: lectures given at the CIME summer school held in Cetraro, Italy June 19–29, 2004, pages 163–220. Springer,

  7. [7]

    Zeroth-order algorithms for nonconvex minimax problems with improved complexities

    [WBMR20] Zhongruo Wang, Krishnakumar Balasubramanian, Shiqian Ma, and Meisam Razaviyayn. Zeroth-order algorithms for nonconvex minimax problems with improved complexities. arXiv preprint arXiv:2001.07819,

  8. [8]

    Hessian- aware zeroth-order optimization for black-box adversarial attack.arXiv preprint arXiv:1812.11377,

    [YHF+18] Haishan Ye, Zhichao Huang, Cong Fang, Chris Junchi Li, and Tong Zhang. Hessian- aware zeroth-order optimization for black-box adversarial attack.arXiv preprint arXiv:1812.11377,