High-probability zeroth-order online convex optimisation beyond Euclidean geometry
Pith reviewed 2026-05-18 13:54 UTC · model grok-4.3
The pith
Random Lipschitz losses with convex means admit high-probability regret bounds in any norm for zeroth-order online convex optimisation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For random Lipschitz losses whose mean is convex, the analysis driven by all-moment bounds for the gradient estimator in the dual FTRL norm yields time-uniform control of the quadratic variation, producing unified high-probability regret bounds that hold for arbitrary p, q, r. The algorithm is anytime and data-driven; in previously studied special cases the rates recover the known in-expectation guarantees while strengthening them to time-uniform high probability. Together with constant-probability lower bounds these results establish optimality for q in [1,2] under appropriate sampling geometry and expose a gap for q greater than 2 that appears intrinsic to the estimators themselves.
What carries the argument
All-moment bounds for the randomised two-point finite-difference gradient estimator in the dual FTRL norm, which deliver time-uniform quadratic variation control.
If this is right
- The algorithm recovers known in-expectation regret rates in special cases while strengthening them to time-uniform high probability.
- Optimality holds for q in [1,2] under appropriate sampling geometry when combined with constant-probability lower bounds.
- A performance gap for q greater than 2 appears intrinsic to the cone-measure two-point estimators.
- The method is anytime and data-driven for every choice of p, q, r in [1, infinity].
Where Pith is reading between the lines
- The all-moment technique could extend to other zeroth-order estimators or to losses whose expectations are only approximately convex.
- Alternative sampling distributions on the sphere might close the identified gap for q greater than 2.
- Time-uniform high-probability control may prove useful in sequential settings where accumulated failure probability must remain small.
Load-bearing premise
The losses are random with convex mean and the gradient estimator admits all-moment bounds in the dual FTRL norm that enable time-uniform quadratic variation control.
What would settle it
An explicit sequence of random Lipschitz losses with convex mean together with a choice of p, q, r where the quadratic variation of the estimator exceeds the claimed all-moment bound and produces regret violating the high-probability guarantee would falsify the result.
read the original abstract
We study online convex optimisation with $\ell_q$-Lipschitz losses, $\ell_p$-regularised FTRL, and randomised two-point finite-difference gradient estimators based on cone-measure sampling from $\ell_r$-spheres. For random Lipschitz losses whose mean is convex, we prove unified high-probability regret bounds for all $p,q,r \in [1,\infty]$. The analysis is driven by all-moment bounds for the gradient estimator in the dual FTRL norm, yielding time-uniform control of the quadratic variation. The algorithm is anytime and data-driven; in the special cases previously studied, its rates recover the known in-expectation guarantees while strengthening them to time-uniform high probability. Together with constant-probability lower bounds, these results establish optimality for $q\in[1,2]$ under appropriate sampling geometry, and expose a gap for $q>2$ that appears intrinsic to the estimators themselves.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies zeroth-order online convex optimization with ℓ_q-Lipschitz losses, ℓ_p-regularized FTRL, and two-point finite-difference gradient estimators obtained by cone-measure sampling on ℓ_r-spheres. For random losses whose mean is convex, it claims unified high-probability regret bounds that hold for every triple (p, q, r) ∈ [1, ∞]^3. The analysis proceeds by establishing all-moment bounds on the estimator in the dual FTRL norm, which are then used to obtain time-uniform control of the quadratic variation process via a suitable martingale inequality. The resulting bounds are anytime and data-driven; they recover the known in-expectation rates in previously studied special cases while strengthening them to high probability, and they are shown to be optimal for q ∈ [1, 2] under matching sampling geometry while exposing an intrinsic gap for q > 2.
Significance. If the moment-bound and quadratic-variation arguments are correct, the work supplies the first unified high-probability theory for non-Euclidean zeroth-order online convex optimization. It strengthens existing in-expectation guarantees to time-uniform statements, identifies matching lower bounds for q ≤ 2, and isolates an estimator-dependent gap for q > 2. These results would be of clear interest to the online-learning and zeroth-order optimization communities.
major comments (2)
- [analysis of moment bounds for the gradient estimator] The central high-probability claim for arbitrary p, q, r rests on the derivation of all-moment bounds E[‖G_t‖_*^k] ≤ C_k (independent of mismatches) for the two-point cone-measure estimator G_t measured in the dual norm induced by the ℓ_p-FTRL regularizer. When q (Lipschitz geometry) differs from r (sampling) and p (regularizer), the norm-equivalence constants that appear in the moment calculations can grow with dimension or with |1/r − 1/p|. The manuscript must explicitly verify that these constants remain bounded for general Lipschitz losses whose mean is convex; otherwise the time-uniform quadratic-variation control invoked later in the proof fails to hold uniformly.
- [optimality and lower-bound section] The optimality statement for q ∈ [1, 2] and the claimed intrinsic gap for q > 2 are load-bearing conclusions. They rely on the same all-moment bounds together with constant-probability lower bounds. The lower-bound construction and its dependence on the choice of sampling radius r should be stated with explicit constants so that the gap can be verified independently of the upper-bound derivation.
minor comments (2)
- [preliminaries] Notation for the dual norm ‖·‖_* and the precise definition of the cone-measure sampling distribution on the ℓ_r-sphere should be introduced once, early in the paper, rather than re-defined in each technical lemma.
- [introduction] The statement that the algorithm is 'data-driven' and 'anytime' is repeated in the abstract and introduction; a single forward reference to the precise stopping-time argument would suffice.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. We respond to each major comment below and will revise the manuscript to address the points raised.
read point-by-point responses
-
Referee: [analysis of moment bounds for the gradient estimator] The central high-probability claim for arbitrary p, q, r rests on the derivation of all-moment bounds E[‖G_t‖_*^k] ≤ C_k (independent of mismatches) for the two-point cone-measure estimator G_t measured in the dual norm induced by the ℓ_p-FTRL regularizer. When q (Lipschitz geometry) differs from r (sampling) and p (regularizer), the norm-equivalence constants that appear in the moment calculations can grow with dimension or with |1/r − 1/p|. The manuscript must explicitly verify that these constants remain bounded for general Lipschitz losses whose mean is convex; otherwise the time-uniform quadratic-variation control invoked later in the proof fails to hold uniformly.
Authors: We thank the referee for this observation. The all-moment bounds appear in Lemma 3.2, where the cone-measure sampling on the ℓ_r-sphere is used to obtain rotational invariance that controls the k-th moments of the two-point estimator directly in the dual norm. While general norm equivalences between ℓ_p, ℓ_q and ℓ_r can involve dimension-dependent factors, the specific construction of the estimator (normalized directional differences along cone-measure samples) combined with the assumption that the mean loss is convex yields moment bounds whose constants depend only on p, q, r and k. These constants remain finite and independent of dimension and time for any fixed triple in [1, ∞]^3. We will add an explicit corollary after Lemma 3.2 that computes the norm-equivalence factors appearing in the proof and confirms they are bounded (in fact, at most polynomial in max(p, q, r)). This will make the subsequent quadratic-variation control fully rigorous and uniform. revision: yes
-
Referee: [optimality and lower-bound section] The optimality statement for q ∈ [1, 2] and the claimed intrinsic gap for q > 2 are load-bearing conclusions. They rely on the same all-moment bounds together with constant-probability lower bounds. The lower-bound construction and its dependence on the choice of sampling radius r should be stated with explicit constants so that the gap can be verified independently of the upper-bound derivation.
Authors: We agree that greater explicitness in the lower-bound section will strengthen the paper. Section 4 currently sketches a standard hard-instance construction for ℓ_q-Lipschitz losses, adapted to the two-point estimator with cone-measure sampling of radius r. The dependence on r enters through the variance term in the constant-probability lower bound. In the revision we will expand this section to state the lower bounds with fully explicit constants, including the precise multiplicative factors involving r. This will allow direct comparison with the upper bounds and confirm that the matching holds (up to universal factors) for q ∈ [1, 2] under suitable choice of r, while the gap for q > 2 is indeed estimator-intrinsic and independent of the upper-bound analysis. revision: yes
Circularity Check
Derivation relies on external all-moment bounds and standard FTRL analysis; no reduction to self-referential inputs
full rationale
The central high-probability regret bounds are obtained by deriving all-moment controls on the two-point cone-measure gradient estimator in the dual FTRL norm and then applying standard martingale inequalities for time-uniform quadratic variation. These moment bounds are presented as proven results for the given sampling geometry rather than fitted parameters or quantities defined in terms of the target regret. Special cases recover prior in-expectation rates without re-deriving them circularly. While self-citations to earlier FTRL or zeroth-order work appear, they are not load-bearing for the unified p,q,r claim; the argument remains self-contained against external benchmarks and does not reduce by construction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Losses are random with convex mean
- domain assumption Gradient estimator admits all-moment bounds in the dual FTRL norm
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
all-moment bounds for the gradient estimator in the dual FTRL norm, yielding time-uniform control of the quadratic variation
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
unified high-probability regret bounds for all p,q,r ∈ [1,∞]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cited on pages 1, 4, 13. 13 H. Fokkema, D. Van der Hoeven, T. Lattimore and J. J. Mayo. Online newton method for bandit convex optimisation extended abstract. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1713–1714, 2024. Cited on page 8. W. Huang, M. Ye and B. Du. Learn from others and be yourself in heterogeneous federated learning. I...
-
[2]
We begin by extendingf to a1-Lipschitz function defined on all ofRd
For every 1-Lipschitz functionfon∂B d 1, allr>0, and allδ∈(0,1), P(|f(ζ)−Ef(ζ)|>r)≤361 exp{−0.003rd}.(11) Proof. We begin by extendingf to a1-Lipschitz function defined on all ofRd. This may be done by, for example, taking¯f(x) = inf y∈∂Bd 1 (f(y) +∥x−y∥2). We will writef for the extended function. Now letx′,S′be independent copies ofx,S. By the triangle ...
work page 2024
-
[3]
Samplesc∼µandζ◦uniformly from∂Bd 2
-
[4]
Queries the function at two perturbed points: y=f c(x+hζ), y ′=f c(x−hζ), whereh>0
-
[5]
Forms the estimator g◦= d 2h(y−y′)ζ◦.(38) Similar to the justification for constructing gradient estimators based onℓ1-randomisation, the construction in(38) can be motivated via Stokes’ theorem and the fact thatg◦is an unbiased gradient estimator for the following surrogate function: f◦ h(x) =E[f(x+hU)], whereUis uniformly generated fromB d 2. As Proposi...
work page 2018
-
[6]
Proposition E.2.Suppose Assumption 3.2 holds
For every1-Lipschitz functionfon∂B d 2, allr>0, and allδ∈(0,1), P(|f(X)−Ef(X)|>r)≤exp{−cdr2},(39) wherec>0is a universal constant that is independent ofd. Proposition E.2.Suppose Assumption 3.2 holds. Fort∈[n]and j∈[m], letxt be the iterates ofFedZero, equipped with theℓ2-randomised gradient estimators and letg◦ j,t. Then, for allk≥2, the random variables...
work page 2020
-
[7]
Let ζ be a random variable that is uniformly generated from∂Bd
-
[8]
IfΘ ⊂Rd is convex set, andf is a convex function thenfh is convex onΘandf h(x)≥f(x)for allx∈Θ
Then E [ d 2h(fc(x+hζ)−fc(x−hζ)) sign(ζ) ] =∇fh(x).(40) Moreover, for alld≥3andx∈Rd we have |fh(x)−f(x)|≤bq(d)Lh,(41) whereb q(d)is defined in (4). IfΘ ⊂Rd is convex set, andf is a convex function thenfh is convex onΘandf h(x)≥f(x)for allx∈Θ. Lemma F.2(Lemma 4 Akhavan et al. (2022)).Define the filtrationF= (Ft)t∈Nsuch thatF t =∪m j=1∪t k=1{ζj,k, xk+1}, an...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.