Recognition: 2 theorem links
· Lean TheoremOptimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback
Pith reviewed 2026-05-15 00:38 UTC · model grok-4.3
The pith
Two-point bandit feedback achieves the first optimal high-probability regret bound O(d (log T + log(1/δ))/μ) for μ-strongly convex losses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper derives the first high-probability regret bound of O(d(log T + log(1/δ))/μ) for online convex optimization with two-point bandit feedback on μ-strongly convex losses. This bound is shown to be minimax optimal with respect to both the time horizon T and the dimension d, resolving the challenge posed by the heavy-tailed distribution of the bandit gradient estimators.
What carries the argument
The two-point gradient estimator together with a new concentration argument that controls its heavy tails under only the standard assumptions of μ-strong convexity and bounded domain.
If this is right
- The high-probability regret scales linearly in dimension d and only logarithmically in T, matching the information-theoretic lower bound.
- No extra assumptions such as bounded gradients or sub-Gaussian noise are required.
- The same bound applies in fully adversarial settings as long as each loss remains μ-strongly convex.
- Reliable algorithms become available for online decision problems where only two-point evaluations are feasible.
Where Pith is reading between the lines
- The same tail-control technique could be reused for other estimators whose variance grows with the query distance.
- Two-point feedback now matches full-information feedback for high-probability analysis on strongly convex problems.
- The result suggests that dimension dependence cannot be removed even with stronger feedback when high-probability guarantees are demanded.
Load-bearing premise
The new concentration argument must control the heavy-tailed distribution of the two-point gradient estimator without post-hoc exclusions or assumptions beyond μ-strong convexity and bounded domain.
What would settle it
A sequence of μ-strongly convex losses on a bounded domain for which the two-point estimator produces regret exceeding O(d log T / μ) with probability larger than δ.
read the original abstract
We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge and provide the first high-probability regret bound of $O(d(\log T + \log(1/\delta))/\mu)$ for $\mu$-strongly convex losses. Our result is minimax optimal with respect to both the time horizon $T$ and the dimension $d$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript considers online convex optimization with two-point bandit feedback for a sequence of adversarially chosen μ-strongly convex losses over a bounded domain. It claims to resolve the open problem of obtaining tight high-probability regret bounds by deriving the first such bound of O(d(log T + log(1/δ))/μ), which is minimax optimal in both T and d. The key technical step is a new concentration argument that controls the heavy-tailed two-point gradient estimator [f(x+δu)−f(x−δu)]/(2δ)·u without additional assumptions beyond strong convexity.
Significance. If the concentration step is valid, the result closes the gap left by Agarwal et al. (2010) and supplies the first high-probability bound that is simultaneously optimal in horizon and dimension for the two-point setting. The derivation appears parameter-free once the estimator is controlled, which would be a notable technical advance for bandit convex optimization.
major comments (2)
- [§4, Theorem 3.1] §4, Theorem 3.1 and the proof of Lemma 4.3: the claimed high-probability bound rests on a new tail bound for the two-point estimator that must hold with only μ-strong convexity and bounded domain. The manuscript must explicitly state the moment or tail condition used and show that no path-dependent truncation or extra Lipschitz constant is introduced; otherwise the O(1/μ) dependence cannot be guaranteed under the stated hypotheses.
- [§5.1, Eq. (18)] §5.1, Eq. (18): the variance proxy for the estimator appears to be bounded by a quantity that still depends on the diameter of the domain and the strong-convexity parameter; confirm that this does not re-introduce a hidden dependence that would prevent the final regret from being O(d log T / μ).
minor comments (2)
- [Abstract] Abstract: the phrase 'still remained open' should read 'still remains open'.
- [§3] Notation: the random direction u is introduced in §3 but its distribution (uniform on the sphere?) is not restated in the main theorem statement; add a one-line reminder for readability.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. We appreciate the recognition of the significance of our result in closing the gap for high-probability bounds in two-point bandit OCO. Below, we address each major comment point by point. We will revise the manuscript to incorporate the requested clarifications.
read point-by-point responses
-
Referee: [§4, Theorem 3.1] §4, Theorem 3.1 and the proof of Lemma 4.3: the claimed high-probability bound rests on a new tail bound for the two-point estimator that must hold with only μ-strong convexity and bounded domain. The manuscript must explicitly state the moment or tail condition used and show that no path-dependent truncation or extra Lipschitz constant is introduced; otherwise the O(1/μ) dependence cannot be guaranteed under the stated hypotheses.
Authors: We thank the referee for pointing this out. In the proof of Lemma 4.3, the tail bound for the two-point estimator [f(x+δu)−f(x−δu)]/(2δ)·u is derived directly from the μ-strong convexity and the boundedness of the domain, without requiring additional Lipschitz assumptions or path-dependent truncation. Specifically, strong convexity implies that the estimator has sub-exponential tails with parameters depending only on μ and the diameter D, leading to the concentration that yields the O(1/μ) factor. We will explicitly state the moment condition (sub-exponential with variance proxy O(D²/μ)) in the revised manuscript and add a remark clarifying the absence of truncation or extra constants. This ensures the bound holds under the stated hypotheses. revision: yes
-
Referee: [§5.1, Eq. (18)] §5.1, Eq. (18): the variance proxy for the estimator appears to be bounded by a quantity that still depends on the diameter of the domain and the strong-convexity parameter; confirm that this does not re-introduce a hidden dependence that would prevent the final regret from being O(d log T / μ).
Authors: We confirm that while the variance proxy in Eq. (18) depends on the domain diameter D and μ, this dependence is absorbed into the O(1/μ) term in the final regret bound without introducing additional factors that degrade the optimality. The regret analysis in Section 5 integrates this proxy such that the overall bound remains O(d (log T + log(1/δ)) / μ), as the D dependence is constant and does not affect the scaling in T and d. We will add a detailed calculation in the revision showing how the proxy leads to the claimed bound without hidden dependencies. revision: yes
Circularity Check
No circularity; new concentration argument is independent of inputs
full rationale
The paper derives a high-probability regret bound O(d(log T + log(1/δ))/μ) for μ-strongly convex losses under two-point bandit feedback by introducing a concentration inequality that controls the heavy tails of the estimator [f(x+δu)−f(x−δu)]/(2δ)·u. This step is not self-definitional, does not rename a fitted parameter as a prediction, and does not rely on load-bearing self-citations; the cited open problem (Agarwal et al. 2010) is external and the claimed minimax optimality references a matching lower bound outside the present derivation. All load-bearing steps remain self-contained against the stated assumptions of strong convexity and bounded domain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Loss functions are μ-strongly convex and convex
- domain assumption Two-point bandit feedback yields function values at two points per round
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We resolve this open challenge and provide the first high-probability regret bound of O(d(log T + log(1/δ))/μ) for μ-strongly convex losses.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The primary challenge lies in the heavy-tailed nature of bandit gradient estimators
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
High-Probability Guarantees for Random Zeroth-Order (Stochastic) Gradient Descent
Random zeroth-order gradient descent reaches ε-suboptimal solutions with probability 1-δ using O((dL/μ)log(1/ε) + log(1/δ)) queries deterministically and O(d log(1/ε)(log(1/ε)+log(1/δ))/ε) queries under bounded stocha...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.