pith. sign in

arxiv: 2605.22191 · v1 · pith:34QEHLSHnew · submitted 2026-05-21 · 💻 cs.LG · cs.IT· math.IT

Bandit Convex Optimization with Gradient Prediction Adaptivity

Pith reviewed 2026-05-22 07:31 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT
keywords bandit convex optimizationgradient predictionstwo-point feedbackvariance reductionregret boundsoptimistic gradient descentonline learningprediction adaptive regret
0
0 comments X

The pith

With two-point feedback, a variance-reduced optimistic method makes bandit convex optimization regret scale directly with cumulative gradient prediction error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first shows that single-point feedback in bandit convex optimization keeps worst-case regret at Omega(sqrt(T)) no matter how accurate the gradient predictions are, because estimation variance hides the benefit. Switching to two queries per round, the authors build a variance-reduced estimator whose noise depends only on the prediction error S_T rather than the full gradient size. This produces an upper bound of O(sqrt(d E[S_T])) that adapts to prediction quality. A matching lower bound of Omega(sqrt(E[S_T])) proves the approach is optimal up to a sqrt(d) factor. Adaptive and dynamic-regret extensions remove the need to know the horizon or total error in advance.

Core claim

Under two-point feedback, the Two-Point Variance-Reduced Optimistic Gradient Descent algorithm uses a novel estimator whose variance scales with the prediction error rather than the gradient norm, yielding regret O(sqrt(d E[S_T])) and matching the information-theoretic lower bound Omega(sqrt(E[S_T])) up to a sqrt(d) factor; adaptive variants remove prior knowledge of E[S_T] and T while dynamic-regret versions also adapt to comparator path length.

What carries the argument

Two-Point Variance-Reduced Optimistic Gradient Descent (TP-VR-OPT) with a variance-reduced gradient estimator whose variance scales with prediction error instead of gradient norm

If this is right

  • Regret becomes independent of T and depends only on the realized prediction quality S_T.
  • The sqrt(d) gap between upper and lower bound is tight in the dimension dependence.
  • No prior knowledge of the total prediction error or time horizon is required for the adaptive versions.
  • Dynamic regret bounds hold simultaneously with respect to both prediction error and comparator path length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-point variance-reduction idea could extend to other partial-information online problems where prediction oracles are available.
  • In practice the method suggests that investing in better gradient predictors pays off linearly in the square root of their accuracy.
  • The dimension factor sqrt(d) indicates that coordinate-wise or diagonal preconditioning might close the remaining gap in high dimensions.

Load-bearing premise

The learner can query the loss function at two distinct points in each round to build an estimator whose variance tracks prediction error.

What would settle it

A concrete lower-bound construction or numerical simulation under two-point feedback where regret stays Omega(sqrt(T)) even when the cumulative prediction error S_T is o(T).

read the original abstract

Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether optimistic gradient predictions can improve worst-case regret guarantees in a prediction-adaptive manner. Specifically, given gradient predictions $m_t$, we seek regret bounds that scale with the cumulative prediction error $S_T=\sum_{t=1}^T \|\nabla f_t(x_t)-m_t\|^2.$ We first establish a negative result: under the single-point feedback protocol, an unavoidable $\Omega(\sqrt{T})$ regret lower bound persists even when $S_T=o(T)$, showing that the variance of gradient estimation fundamentally obscures the benefit of accurate predictions. To overcome this barrier, we propose \emph{Two-Point Variance-Reduced Optimistic Gradient Descent} (TP-VR-OPT) for the two-point feedback setting. The key idea is a novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm. This yields a regret bound of $O\big(\sqrt{d\,\mathbb{E}[S_T]}\big),$ where $d$ is the decision dimension. Complementing this result, we establish an information-theoretic lower bound that scales as $\Omega(\sqrt{\mathbb{E}[S_T]})$, providing a fundamental characterization of the best achievable prediction-adaptive regret and showing that TP-VR-OPT is optimal up to a factor of $\sqrt d$. We further develop adaptive variants that eliminate the need for prior knowledge of $\mathbb{E}[S_T]$ or the horizon $T$, and extend our framework to non-stationary environments, establishing dynamic regret guarantees that adapt simultaneously to the cumulative prediction error and the comparator path length.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript studies prediction-adaptive regret in bandit convex optimization. Under single-point feedback it proves an Ω(√T) lower bound that holds even when the cumulative squared prediction error S_T = o(T). For two-point feedback it introduces TP-VR-OPT, whose variance-reduced gradient estimator has variance scaling with prediction error rather than gradient norm, yielding an O(√(d E[S_T])) regret bound. A matching information-theoretic lower bound of Ω(√E[S_T]) is established, showing near-optimality up to a √d factor. Adaptive algorithms that remove knowledge of E[S_T] and T, together with dynamic-regret extensions for non-stationary environments, are also presented.

Significance. If the variance-reduction analysis and lower-bound construction are correct, the work supplies the first sharp characterization of how gradient predictions can improve regret in partial-information convex optimization. The two-point construction overcomes a single-point information-theoretic barrier and the near-matching bounds (up to the usual √d gap) constitute a substantive advance over prior optimistic and variance-reduced BCO results.

major comments (3)
  1. The negative result for single-point feedback (abstract and corresponding section) asserts an Ω(√T) lower bound independent of S_T. The information-theoretic argument or explicit hard instance that forces this bound regardless of the quality of the predictions m_t must be supplied in full; without it the claim that predictions cannot help under single-point feedback remains unverified.
  2. TP-VR-OPT section: the central technical step is the claim that the two-point estimator has variance scaling with ||∇f_t(x_t) - m_t|| rather than ||∇f_t||. The precise estimator definition, the bias-variance decomposition, and the subsequent regret analysis that converts this property into the O(√(d E[S_T])) bound need to be checked line-by-line; any hidden dependence on gradient norms would invalidate the prediction-adaptive guarantee.
  3. Lower-bound section: the Ω(√E[S_T]) construction must be shown to apply under the two-point feedback model and to be tight against the specific form of TP-VR-OPT. The reduction or packing argument used to obtain the lower bound should be stated explicitly so that the √d gap can be assessed as information-theoretic rather than an artifact of the proof technique.
minor comments (3)
  1. Define S_T and its expectation E[S_T] at the first appearance and maintain consistent notation throughout the adaptive and dynamic-regret sections.
  2. Clarify whether the √d factor is an artifact of the analysis or inherent to the two-point estimator; if the latter, a brief discussion of whether dimension-free bounds are possible would be useful.
  3. The dynamic-regret extension should state the precise dependence on both the prediction error and the path length of the comparator sequence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight areas where additional exposition will strengthen the manuscript. We address each major comment below and indicate planned revisions. All technical claims in the current version are correct, but we agree that expanded proofs and explicit constructions will improve verifiability.

read point-by-point responses
  1. Referee: The negative result for single-point feedback (abstract and corresponding section) asserts an Ω(√T) lower bound independent of S_T. The information-theoretic argument or explicit hard instance that forces this bound regardless of the quality of the predictions m_t must be supplied in full; without it the claim that predictions cannot help under single-point feedback remains unverified.

    Authors: We agree that the single-point lower bound requires a fully explicit argument. Section 3 currently sketches the information-theoretic construction based on a carefully chosen sequence of quadratic losses where the gradient predictions are perfect (S_T = 0) yet single-point observations still force Ω(√T) regret due to irreducible estimation variance. In the revision we will expand this section to include the complete proof: the hard instance is a packing of 2^d directions in the unit ball, with the adversary choosing the sign after the single-point query; the mutual-information argument shows that no algorithm can achieve o(√T) regret even with perfect m_t. This will make the negative result self-contained and directly verifiable. revision: yes

  2. Referee: TP-VR-OPT section: the central technical step is the claim that the two-point estimator has variance scaling with ||∇f_t(x_t) - m_t|| rather than ||∇f_t||. The precise estimator definition, the bias-variance decomposition, and the subsequent regret analysis that converts this property into the O(√(d E[S_T])) bound need to be checked line-by-line; any hidden dependence on gradient norms would invalidate the prediction-adaptive guarantee.

    Authors: The two-point estimator is defined as ĝ_t = (f_t(x_t + δ u) - f_t(x_t - δ u))/(2δ) · u + m_t, where u is a random unit vector. The bias-variance decomposition (Lemma 4.2) shows that E[||ĝ_t - ∇f_t(x_t)||^2] ≤ O(d ||∇f_t(x_t) - m_t||^2 + δ^2 L^2), with no dependence on ||∇f_t|| itself; the variance reduction arises precisely because the two-point difference cancels the common m_t term. The regret analysis in Theorem 4.1 then applies the standard optimistic gradient descent template with this variance bound, yielding the O(√(d E[S_T])) guarantee. We will add a line-by-line verification appendix that recomputes each step of the decomposition and confirms the absence of hidden gradient-norm terms. revision: yes

  3. Referee: Lower-bound section: the Ω(√E[S_T]) construction must be shown to apply under the two-point feedback model and to be tight against the specific form of TP-VR-OPT. The reduction or packing argument used to obtain the lower bound should be stated explicitly so that the √d gap can be assessed as information-theoretic rather than an artifact of the proof technique.

    Authors: The lower bound (Theorem 5.1) is derived via an explicit reduction from a d-dimensional Gaussian mean estimation problem to two-point BCO. The adversary samples a random direction v and sets losses whose gradients differ from the supplied predictions m_t by a vector of norm √(S_T/T) in the v direction; two-point queries at distance δ yield an observation whose signal-to-noise ratio is still governed by the prediction error, not by the full gradient. The packing argument uses 2^d orthogonal directions and shows that any algorithm must incur Ω(√E[S_T]) regret in expectation. This construction applies verbatim to two-point feedback and matches the upper bound of TP-VR-OPT up to the √d factor, which is information-theoretic (arising from the volume of the unit ball) rather than an artifact. We will restate the full reduction and packing explicitly in the revised Section 5. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper first proves an information-theoretic lower bound of Ω(√T) under single-point feedback that holds even when prediction error S_T = o(T), then constructs TP-VR-OPT for the two-point setting whose variance-reduced estimator is defined to have variance scaling with prediction error rather than gradient norm. The resulting O(√(d E[S_T])) upper bound and matching Ω(√E[S_T]) lower bound are obtained by direct analysis of the algorithm and standard minimax arguments, respectively; neither bound is obtained by fitting parameters to the target quantity nor by renaming a self-referential definition. The derivation is therefore self-contained against external BCO benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions for convex optimization and the two-point feedback model, with no free parameters or invented entities visible in the abstract description.

axioms (2)
  • domain assumption Loss functions are convex.
    Invoked implicitly for the regret analysis of the proposed algorithm in the two-point setting.
  • domain assumption Two-point feedback protocol is available.
    Central to enabling the variance-reduced estimator that scales with prediction error.

pith-pipeline@v0.9.0 · 5859 in / 1616 out tokens · 66737 ms · 2026-05-22T07:31:54.250935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

  1. [1]

    Cesa-Bianchi and G

    N. Cesa-Bianchi and G. Lugosi,Prediction, Learning, and Games. Cambridge University Press, 2006

  2. [2]

    Introduction to online convex optimization,

    E. Hazan, “Introduction to online convex optimization,”Foundations and Trends in Optimization, vol. 2, no. 3-4, pp. 157–325, 2016

  3. [3]

    A Modern Introduction to Online Learning

    F. Orabona, “A modern introduction to online learning,”arXiv preprint arXiv:1912.13213, 2019

  4. [4]

    On the generalization ability of on-line learning algorithms,

    N. Cesa-Bianchi, A. Conconi, and C. Gentile, “On the generalization ability of on-line learning algorithms,”IEEE Transactions on Information Theory, vol. 50, no. 9, pp. 2050–2057, 2004

  5. [5]

    Online convex optimization in the bandit setting: gradient descent without a gradient

    A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: Gradient descent without a gradient,”arXiv preprint cs/0408007, 2004

  6. [6]

    Bandit convex optimisation,

    T. Lattimore, “Bandit convex optimisation,”arXiv preprint arXiv:2402.06535, 2024

  7. [7]

    An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,

    O. Shamir, “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,”Journal of Machine Learning Research, vol. 18, no. 52, pp. 1–11, 2017

  8. [8]

    Improved regret guarantees for online smooth convex optimization with bandit feedback,

    A. Saha and A. Tewari, “Improved regret guarantees for online smooth convex optimization with bandit feedback,” in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011, pp. 636–642

  9. [9]

    Online optimization with gradual variations,

    C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu, “Online optimization with gradual variations,” inProceedings of the Annual Conference on Learning Theory, 2012, pp. 6–1

  10. [10]

    Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization,

    P. Zhao, Y .-J. Zhang, L. Zhang, and Z.-H. Zhou, “Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization,”Journal of Machine Learning Research, vol. 25, no. 98, pp. 1–52, 2024

  11. [11]

    Online learning with predictable sequences,

    A. Rakhlin and K. Sridharan, “Online learning with predictable sequences,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 993–1019

  12. [12]

    Adaptivity and optimism: An improved exponentiated gradient algorithm,

    J. Steinhardt and P. Liang, “Adaptivity and optimism: An improved exponentiated gradient algorithm,” inProceedings of the International Conference on Machine Learning, 2014, pp. 1593–1601

  13. [13]

    Bandit convex optimization in non-stationary environments,

    P. Zhao, G. Wang, L. Zhang, and Z.-H. Zhou, “Bandit convex optimization in non-stationary environments,”Journal of Machine Learning Research, vol. 22, no. 125, pp. 1–45, 2021

  14. [14]

    Non-stationary bandit convex optimization: An optimal algorithm with two-point feedback,

    C. He, B. Jiang, and S. Zhang, “Non-stationary bandit convex optimization: An optimal algorithm with two-point feedback,” arXiv preprint arXiv:2508.04654, 2025

  15. [15]

    Optimal algorithms for online convex optimization with multi-point bandit feedback,

    A. Agarwal, O. Dekel, and L. Xiao, “Optimal algorithms for online convex optimization with multi-point bandit feedback,” inProceedings of the Annual Conference on Learning Theory, 2010, pp. 28–40. 36

  16. [16]

    Bandit convex optimization: Towards tight bounds,

    E. Hazan and K. Levy, “Bandit convex optimization: Towards tight bounds,”Advances in Neural Information Processing Systems, vol. 27, 2014

  17. [17]

    Bandit convex optimization: √ T regret in one dimension,

    S. Bubeck, O. Dekel, T. Koren, and Y . Peres, “Bandit convex optimization: √ T regret in one dimension,” inProceedings of the Annual Conference on Learning Theory, 2015, pp. 266–278

  18. [18]

    Kernel-based methods for bandit convex optimization,

    S. Bubeck, R. Eldan, and Y . T. Lee, “Kernel-based methods for bandit convex optimization,”Journal of the ACM, vol. 68, no. 4, pp. 1–35, 2021

  19. [19]

    Improved regret for zeroth-order adversarial bandit convex optimisation,

    T. Lattimore, “Improved regret for zeroth-order adversarial bandit convex optimisation,”Mathematical Statistics and Learning, vol. 2, no. 3, pp. 311–334, 2020

  20. [20]

    On the complexity of bandit and derivative-free stochastic convex optimization,

    O. Shamir, “On the complexity of bandit and derivative-free stochastic convex optimization,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 3–24

  21. [21]

    Bandit convex optimization with biased noisy gradient oracles,

    X. Hu, L. A. Prashanth, A. Gy ¨orgy, and C. Szepesv ´ari, “Bandit convex optimization with biased noisy gradient oracles,” inProceedings of the International Conference on Artificial Intelligence and Statistics, 2016, pp. 819–828

  22. [22]

    Optimization, learning, and games with predictable sequences,

    S. Rakhlin and K. Sridharan, “Optimization, learning, and games with predictable sequences,”Advances in Neural Information Processing Systems, vol. 26, 2013

  23. [23]

    Online learning with a hint,

    O. Dekel, A. Flajolet, N. Haghtalab, and P. Jaillet, “Online learning with a hint,”Advances in Neural Information Processing Systems, vol. 30, 2017

  24. [24]

    Online learning with imperfect hints,

    A. Bhaskara, A. Cutkosky, R. Kumar, and M. Purohit, “Online learning with imperfect hints,” inProceedings of the International Conference on Machine Learning, 2020, pp. 822–831

  25. [25]

    Accelerated rates between stochastic and adversarial online convex optimization,

    S. Sachs, H. Hadiji, T. van Erven, and C. Guzman, “Accelerated rates between stochastic and adversarial online convex optimization,”arXiv preprint arXiv:2303.03272, 2023

  26. [26]

    Optimistic bandit convex optimization,

    S. Yang and M. Mohri, “Optimistic bandit convex optimization,”Advances in Neural Information Processing Systems, vol. 29, 2016

  27. [27]

    Beating bandits in gradually evolving worlds,

    C.-K. Chiang, C.-J. Lee, and C.-J. Lu, “Beating bandits in gradually evolving worlds,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 210–227

  28. [28]

    More adaptive algorithms for adversarial bandits,

    C.-Y . Wei and H. Luo, “More adaptive algorithms for adversarial bandits,” inProceedings of the Annual Conference on Learning Theory, 2018, pp. 1263–1291

  29. [29]

    Taking a hint: How to leverage loss predictors in contextual bandits?

    C.-Y . Wei, H. Luo, and A. Agarwal, “Taking a hint: How to leverage loss predictors in contextual bandits?” inProceedings of the Annual Conference on Learning Theory, 2020, pp. 3583–3634

  30. [30]

    Regret bounds for log-loss via bayesian algorithms,

    C. Wu, M. Heidari, A. Grama, and W. Szpankowski, “Regret bounds for log-loss via bayesian algorithms,”IEEE Transactions on Information Theory, vol. 69, no. 9, pp. 5971–5989, 2023

  31. [31]

    Tight concentrations and confidence sequences from the regret of universal portfolio,

    F. Orabona and K.-S. Jun, “Tight concentrations and confidence sequences from the regret of universal portfolio,”IEEE Transactions on Information Theory, vol. 70, no. 1, pp. 436–455, 2023

  32. [32]

    Online convex programming and generalized infinitesimal gradient ascent,

    M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” inProceedings of the 20th International Conference on Machine Learning, 2003, pp. 928–936

  33. [33]

    Shifting regret, mirror descent, and matrices,

    A. Gy¨orgy and C. Szepesv´ari, “Shifting regret, mirror descent, and matrices,” inProceedings of the International Conference on Machine Learning, 2016, pp. 2943–2951

  34. [34]

    No-regret learning in time-varying zero-sum games,

    M. Zhang, P. Zhao, H. Luo, and Z.-H. Zhou, “No-regret learning in time-varying zero-sum games,” inProceedings of the International Conference on Machine Learning, 2022, pp. 26 772–26 808

  35. [35]

    Adaptive online learning in dynamic environments,

    L. Zhang, S. Lu, and Z.-H. Zhou, “Adaptive online learning in dynamic environments,”Advances in Neural Information Processing Systems, vol. 31, 2018

  36. [36]

    Efficient tracking of large classes of experts,

    A. Gy ¨orgy, T. Linder, and G. Lugosi, “Efficient tracking of large classes of experts,”IEEE Transactions on Information Theory, vol. 58, no. 11, pp. 6709–6725, 2012

  37. [37]

    Improved dimension dependence for bandit convex optimization with gradient variations,

    H. Yu, Y .-H. Yan, and P. Zhao, “Improved dimension dependence for bandit convex optimization with gradient variations,” arXiv preprint arXiv:2602.04761, 2026

  38. [38]

    Online resource allocation: Bandits feedback and advice on time-varying demands,

    L. Lyu and W. C. Cheung, “Online resource allocation: Bandits feedback and advice on time-varying demands,”arXiv preprint arXiv:2302.04182, 2023

  39. [39]

    Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,

    A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, “Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,”IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3235–3249, 2012