High-probability zeroth-order online convex optimisation beyond Euclidean geometry

Arya Akhavan; David Janz; El-Mahdi El-Mhamdi

arxiv: 2509.21484 · v3 · submitted 2025-09-25 · 💻 cs.LG · stat.ML

High-probability zeroth-order online convex optimisation beyond Euclidean geometry

David Janz , El-Mahdi El-Mhamdi , Arya Akhavan This is my paper

Pith reviewed 2026-05-18 13:54 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords online convex optimizationzeroth-order optimizationhigh-probability regret boundsFTRLnon-Euclidean geometryLipschitz lossesfinite-difference estimatorsquadratic variation

0 comments

The pith

Random Lipschitz losses with convex means admit high-probability regret bounds in any norm for zeroth-order online convex optimisation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes unified high-probability regret bounds for online convex optimisation with ell_q-Lipschitz random losses whose expectations are convex. It pairs ell_p-regularised FTRL with randomised two-point finite-difference estimators sampled via cone measures on ell_r-spheres. The bounds apply for every p, q, r in [1, infinity] and are time-uniform because the analysis obtains all-moment bounds on the estimator in the dual FTRL norm. A sympathetic reader would care because the results recover and strengthen prior in-expectation rates to high probability while proving optimality for q up to 2 and identifying an intrinsic gap for larger q.

Core claim

For random Lipschitz losses whose mean is convex, the analysis driven by all-moment bounds for the gradient estimator in the dual FTRL norm yields time-uniform control of the quadratic variation, producing unified high-probability regret bounds that hold for arbitrary p, q, r. The algorithm is anytime and data-driven; in previously studied special cases the rates recover the known in-expectation guarantees while strengthening them to time-uniform high probability. Together with constant-probability lower bounds these results establish optimality for q in [1,2] under appropriate sampling geometry and expose a gap for q greater than 2 that appears intrinsic to the estimators themselves.

What carries the argument

All-moment bounds for the randomised two-point finite-difference gradient estimator in the dual FTRL norm, which deliver time-uniform quadratic variation control.

If this is right

The algorithm recovers known in-expectation regret rates in special cases while strengthening them to time-uniform high probability.
Optimality holds for q in [1,2] under appropriate sampling geometry when combined with constant-probability lower bounds.
A performance gap for q greater than 2 appears intrinsic to the cone-measure two-point estimators.
The method is anytime and data-driven for every choice of p, q, r in [1, infinity].

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The all-moment technique could extend to other zeroth-order estimators or to losses whose expectations are only approximately convex.
Alternative sampling distributions on the sphere might close the identified gap for q greater than 2.
Time-uniform high-probability control may prove useful in sequential settings where accumulated failure probability must remain small.

Load-bearing premise

The losses are random with convex mean and the gradient estimator admits all-moment bounds in the dual FTRL norm that enable time-uniform quadratic variation control.

What would settle it

An explicit sequence of random Lipschitz losses with convex mean together with a choice of p, q, r where the quadratic variation of the estimator exceeds the claimed all-moment bound and produces regret violating the high-probability guarantee would falsify the result.

read the original abstract

We study online convex optimisation with $\ell_q$-Lipschitz losses, $\ell_p$-regularised FTRL, and randomised two-point finite-difference gradient estimators based on cone-measure sampling from $\ell_r$-spheres. For random Lipschitz losses whose mean is convex, we prove unified high-probability regret bounds for all $p,q,r \in [1,\infty]$. The analysis is driven by all-moment bounds for the gradient estimator in the dual FTRL norm, yielding time-uniform control of the quadratic variation. The algorithm is anytime and data-driven; in the special cases previously studied, its rates recover the known in-expectation guarantees while strengthening them to time-uniform high probability. Together with constant-probability lower bounds, these results establish optimality for $q\in[1,2]$ under appropriate sampling geometry, and expose a gap for $q>2$ that appears intrinsic to the estimators themselves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies high-probability time-uniform bounds for zeroth-order OCO across general norms but the all-moment control needs explicit verification when p, q, r differ.

read the letter

The main takeaway is that this paper gives time-uniform high-probability regret bounds for zeroth-order online convex optimization with q-norm Lipschitz losses, p-norm regularized FTRL, and two-point cone-measure estimators on r-spheres. It targets random losses whose mean is convex and claims the bounds hold for every p, q, r in [1, infinity]. The analysis rests on all-moment bounds for the estimator in the dual norm to control quadratic variation through martingale tools, and the method is anytime and data-driven. In special cases it recovers and strengthens earlier in-expectation rates. It also supplies constant-probability lower bounds that establish optimality for q in [1,2] under suitable sampling and flag an intrinsic gap for q larger than 2. That combination of unified upper bounds plus matching lower bounds is the clearest new piece. The work is technically grounded and engages the existing literature without obvious circularity or invented quantities. The soft spot is exactly the one the stress-test flags: when the Lipschitz geometry q, the sampling radius r, and the regularizer p are not the same, norm equivalences can insert factors that grow with dimension or with the gap between 1/r and 1/p. The abstract asserts the moment constants stay controlled for arbitrary mismatches, but without the explicit derivations or dependence on d it is impossible to confirm the bounds remain useful rather than vacuous. If the full proofs keep the constants independent of these mismatches, the result is stronger; if they hide d-dependent terms or extra assumptions, the “for all p,q,r” claim narrows. This paper is for researchers already working on high-probability online optimization or non-Euclidean bandit settings. A reader who cares about anytime guarantees and general-norm analyses will find concrete rates and lower bounds worth checking. The thinking is clear and the claims are specific enough to be checked, so the paper deserves a serious referee to inspect the moment bounds and the lower-bound arguments. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript studies zeroth-order online convex optimization with ℓ_q-Lipschitz losses, ℓ_p-regularized FTRL, and two-point finite-difference gradient estimators obtained by cone-measure sampling on ℓ_r-spheres. For random losses whose mean is convex, it claims unified high-probability regret bounds that hold for every triple (p, q, r) ∈ [1, ∞]^3. The analysis proceeds by establishing all-moment bounds on the estimator in the dual FTRL norm, which are then used to obtain time-uniform control of the quadratic variation process via a suitable martingale inequality. The resulting bounds are anytime and data-driven; they recover the known in-expectation rates in previously studied special cases while strengthening them to high probability, and they are shown to be optimal for q ∈ [1, 2] under matching sampling geometry while exposing an intrinsic gap for q > 2.

Significance. If the moment-bound and quadratic-variation arguments are correct, the work supplies the first unified high-probability theory for non-Euclidean zeroth-order online convex optimization. It strengthens existing in-expectation guarantees to time-uniform statements, identifies matching lower bounds for q ≤ 2, and isolates an estimator-dependent gap for q > 2. These results would be of clear interest to the online-learning and zeroth-order optimization communities.

major comments (2)

[analysis of moment bounds for the gradient estimator] The central high-probability claim for arbitrary p, q, r rests on the derivation of all-moment bounds E[‖G_t‖_*^k] ≤ C_k (independent of mismatches) for the two-point cone-measure estimator G_t measured in the dual norm induced by the ℓ_p-FTRL regularizer. When q (Lipschitz geometry) differs from r (sampling) and p (regularizer), the norm-equivalence constants that appear in the moment calculations can grow with dimension or with |1/r − 1/p|. The manuscript must explicitly verify that these constants remain bounded for general Lipschitz losses whose mean is convex; otherwise the time-uniform quadratic-variation control invoked later in the proof fails to hold uniformly.
[optimality and lower-bound section] The optimality statement for q ∈ [1, 2] and the claimed intrinsic gap for q > 2 are load-bearing conclusions. They rely on the same all-moment bounds together with constant-probability lower bounds. The lower-bound construction and its dependence on the choice of sampling radius r should be stated with explicit constants so that the gap can be verified independently of the upper-bound derivation.

minor comments (2)

[preliminaries] Notation for the dual norm ‖·‖_* and the precise definition of the cone-measure sampling distribution on the ℓ_r-sphere should be introduced once, early in the paper, rather than re-defined in each technical lemma.
[introduction] The statement that the algorithm is 'data-driven' and 'anytime' is repeated in the abstract and introduction; a single forward reference to the precise stopping-time argument would suffice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We respond to each major comment below and will revise the manuscript to address the points raised.

read point-by-point responses

Referee: [analysis of moment bounds for the gradient estimator] The central high-probability claim for arbitrary p, q, r rests on the derivation of all-moment bounds E[‖G_t‖_*^k] ≤ C_k (independent of mismatches) for the two-point cone-measure estimator G_t measured in the dual norm induced by the ℓ_p-FTRL regularizer. When q (Lipschitz geometry) differs from r (sampling) and p (regularizer), the norm-equivalence constants that appear in the moment calculations can grow with dimension or with |1/r − 1/p|. The manuscript must explicitly verify that these constants remain bounded for general Lipschitz losses whose mean is convex; otherwise the time-uniform quadratic-variation control invoked later in the proof fails to hold uniformly.

Authors: We thank the referee for this observation. The all-moment bounds appear in Lemma 3.2, where the cone-measure sampling on the ℓ_r-sphere is used to obtain rotational invariance that controls the k-th moments of the two-point estimator directly in the dual norm. While general norm equivalences between ℓ_p, ℓ_q and ℓ_r can involve dimension-dependent factors, the specific construction of the estimator (normalized directional differences along cone-measure samples) combined with the assumption that the mean loss is convex yields moment bounds whose constants depend only on p, q, r and k. These constants remain finite and independent of dimension and time for any fixed triple in [1, ∞]^3. We will add an explicit corollary after Lemma 3.2 that computes the norm-equivalence factors appearing in the proof and confirms they are bounded (in fact, at most polynomial in max(p, q, r)). This will make the subsequent quadratic-variation control fully rigorous and uniform. revision: yes
Referee: [optimality and lower-bound section] The optimality statement for q ∈ [1, 2] and the claimed intrinsic gap for q > 2 are load-bearing conclusions. They rely on the same all-moment bounds together with constant-probability lower bounds. The lower-bound construction and its dependence on the choice of sampling radius r should be stated with explicit constants so that the gap can be verified independently of the upper-bound derivation.

Authors: We agree that greater explicitness in the lower-bound section will strengthen the paper. Section 4 currently sketches a standard hard-instance construction for ℓ_q-Lipschitz losses, adapted to the two-point estimator with cone-measure sampling of radius r. The dependence on r enters through the variance term in the constant-probability lower bound. In the revision we will expand this section to state the lower bounds with fully explicit constants, including the precise multiplicative factors involving r. This will allow direct comparison with the upper bounds and confirm that the matching holds (up to universal factors) for q ∈ [1, 2] under suitable choice of r, while the gap for q > 2 is indeed estimator-intrinsic and independent of the upper-bound analysis. revision: yes

Circularity Check

0 steps flagged

Derivation relies on external all-moment bounds and standard FTRL analysis; no reduction to self-referential inputs

full rationale

The central high-probability regret bounds are obtained by deriving all-moment controls on the two-point cone-measure gradient estimator in the dual FTRL norm and then applying standard martingale inequalities for time-uniform quadratic variation. These moment bounds are presented as proven results for the given sampling geometry rather than fitted parameters or quantities defined in terms of the target regret. Special cases recover prior in-expectation rates without re-deriving them circularly. While self-citations to earlier FTRL or zeroth-order work appear, they are not load-bearing for the unified p,q,r claim; the argument remains self-contained against external benchmarks and does not reduce by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that losses are random with convex mean and that the randomized estimator admits all-moment bounds sufficient for quadratic variation control. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Losses are random with convex mean
Explicitly stated as the setting in which the high-probability bounds hold.
domain assumption Gradient estimator admits all-moment bounds in the dual FTRL norm
Described as driving the analysis and enabling time-uniform quadratic variation control.

pith-pipeline@v0.9.0 · 5690 in / 1512 out tokens · 53758 ms · 2026-05-18T13:54:09.181452+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

all-moment bounds for the gradient estimator in the dual FTRL norm, yielding time-uniform control of the quadratic variation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified high-probability regret bounds for all p,q,r ∈ [1,∞]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Cited on pages 1, 4, 13. 13 H. Fokkema, D. Van der Hoeven, T. Lattimore and J. J. Mayo. Online newton method for bandit convex optimisation extended abstract. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1713–1714, 2024. Cited on page 8. W. Huang, M. Ye and B. Du. Learn from others and be yourself in heterogeneous federated learning. I...

work page arXiv 2024
[2]

We begin by extendingf to a1-Lipschitz function defined on all ofRd

For every 1-Lipschitz functionfon∂B d 1, allr>0, and allδ∈(0,1), P(|f(ζ)−Ef(ζ)|>r)≤361 exp{−0.003rd}.(11) Proof. We begin by extendingf to a1-Lipschitz function defined on all ofRd. This may be done by, for example, taking¯f(x) = inf y∈∂Bd 1 (f(y) +∥x−y∥2). We will writef for the extended function. Now letx′,S′be independent copies ofx,S. By the triangle ...

work page 2024
[3]

Samplesc∼µandζ◦uniformly from∂Bd 2

work page
[4]

Queries the function at two perturbed points: y=f c(x+hζ), y ′=f c(x−hζ), whereh>0

work page
[5]

As Proposition 5.2 relies on Lipschitz concentration on theℓ1-sphere, the analogous proposition forg◦depends on the concentration of a Lipschitz function on theℓ2-sphere

Forms the estimator g◦= d 2h(y−y′)ζ◦.(38) Similar to the justification for constructing gradient estimators based onℓ1-randomisation, the construction in(38) can be motivated via Stokes’ theorem and the fact thatg◦is an unbiased gradient estimator for the following surrogate function: f◦ h(x) =E[f(x+hU)], whereUis uniformly generated fromB d 2. As Proposi...

work page 2018
[6]

Proposition E.2.Suppose Assumption 3.2 holds

For every1-Lipschitz functionfon∂B d 2, allr>0, and allδ∈(0,1), P(|f(X)−Ef(X)|>r)≤exp{−cdr2},(39) wherec>0is a universal constant that is independent ofd. Proposition E.2.Suppose Assumption 3.2 holds. Fort∈[n]and j∈[m], letxt be the iterates ofFedZero, equipped with theℓ2-randomised gradient estimators and letg◦ j,t. Then, for allk≥2, the random variables...

work page 2020
[7]

Let ζ be a random variable that is uniformly generated from∂Bd

work page
[8]

IfΘ ⊂Rd is convex set, andf is a convex function thenfh is convex onΘandf h(x)≥f(x)for allx∈Θ

Then E [ d 2h(fc(x+hζ)−fc(x−hζ)) sign(ζ) ] =∇fh(x).(40) Moreover, for alld≥3andx∈Rd we have |fh(x)−f(x)|≤bq(d)Lh,(41) whereb q(d)is defined in (4). IfΘ ⊂Rd is convex set, andf is a convex function thenfh is convex onΘandf h(x)≥f(x)for allx∈Θ. Lemma F.2(Lemma 4 Akhavan et al. (2022)).Define the filtrationF= (Ft)t∈Nsuch thatF t =∪m j=1∪t k=1{ζj,k, xk+1}, an...

work page 2022

[1] [1]

Cited on pages 1, 4, 13. 13 H. Fokkema, D. Van der Hoeven, T. Lattimore and J. J. Mayo. Online newton method for bandit convex optimisation extended abstract. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1713–1714, 2024. Cited on page 8. W. Huang, M. Ye and B. Du. Learn from others and be yourself in heterogeneous federated learning. I...

work page arXiv 2024

[2] [2]

We begin by extendingf to a1-Lipschitz function defined on all ofRd

For every 1-Lipschitz functionfon∂B d 1, allr>0, and allδ∈(0,1), P(|f(ζ)−Ef(ζ)|>r)≤361 exp{−0.003rd}.(11) Proof. We begin by extendingf to a1-Lipschitz function defined on all ofRd. This may be done by, for example, taking¯f(x) = inf y∈∂Bd 1 (f(y) +∥x−y∥2). We will writef for the extended function. Now letx′,S′be independent copies ofx,S. By the triangle ...

work page 2024

[3] [3]

Samplesc∼µandζ◦uniformly from∂Bd 2

work page

[4] [4]

Queries the function at two perturbed points: y=f c(x+hζ), y ′=f c(x−hζ), whereh>0

work page

[5] [5]

As Proposition 5.2 relies on Lipschitz concentration on theℓ1-sphere, the analogous proposition forg◦depends on the concentration of a Lipschitz function on theℓ2-sphere

Forms the estimator g◦= d 2h(y−y′)ζ◦.(38) Similar to the justification for constructing gradient estimators based onℓ1-randomisation, the construction in(38) can be motivated via Stokes’ theorem and the fact thatg◦is an unbiased gradient estimator for the following surrogate function: f◦ h(x) =E[f(x+hU)], whereUis uniformly generated fromB d 2. As Proposi...

work page 2018

[6] [6]

Proposition E.2.Suppose Assumption 3.2 holds

For every1-Lipschitz functionfon∂B d 2, allr>0, and allδ∈(0,1), P(|f(X)−Ef(X)|>r)≤exp{−cdr2},(39) wherec>0is a universal constant that is independent ofd. Proposition E.2.Suppose Assumption 3.2 holds. Fort∈[n]and j∈[m], letxt be the iterates ofFedZero, equipped with theℓ2-randomised gradient estimators and letg◦ j,t. Then, for allk≥2, the random variables...

work page 2020

[7] [7]

Let ζ be a random variable that is uniformly generated from∂Bd

work page

[8] [8]

IfΘ ⊂Rd is convex set, andf is a convex function thenfh is convex onΘandf h(x)≥f(x)for allx∈Θ

Then E [ d 2h(fc(x+hζ)−fc(x−hζ)) sign(ζ) ] =∇fh(x).(40) Moreover, for alld≥3andx∈Rd we have |fh(x)−f(x)|≤bq(d)Lh,(41) whereb q(d)is defined in (4). IfΘ ⊂Rd is convex set, andf is a convex function thenfh is convex onΘandf h(x)≥f(x)for allx∈Θ. Lemma F.2(Lemma 4 Akhavan et al. (2022)).Define the filtrationF= (Ft)t∈Nsuch thatF t =∪m j=1∪t k=1{ζj,k, xk+1}, an...

work page 2022