Bandit Convex Optimization with Gradient Prediction Adaptivity
Pith reviewed 2026-05-22 07:31 UTC · model grok-4.3
The pith
With two-point feedback, a variance-reduced optimistic method makes bandit convex optimization regret scale directly with cumulative gradient prediction error.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under two-point feedback, the Two-Point Variance-Reduced Optimistic Gradient Descent algorithm uses a novel estimator whose variance scales with the prediction error rather than the gradient norm, yielding regret O(sqrt(d E[S_T])) and matching the information-theoretic lower bound Omega(sqrt(E[S_T])) up to a sqrt(d) factor; adaptive variants remove prior knowledge of E[S_T] and T while dynamic-regret versions also adapt to comparator path length.
What carries the argument
Two-Point Variance-Reduced Optimistic Gradient Descent (TP-VR-OPT) with a variance-reduced gradient estimator whose variance scales with prediction error instead of gradient norm
If this is right
- Regret becomes independent of T and depends only on the realized prediction quality S_T.
- The sqrt(d) gap between upper and lower bound is tight in the dimension dependence.
- No prior knowledge of the total prediction error or time horizon is required for the adaptive versions.
- Dynamic regret bounds hold simultaneously with respect to both prediction error and comparator path length.
Where Pith is reading between the lines
- The same two-point variance-reduction idea could extend to other partial-information online problems where prediction oracles are available.
- In practice the method suggests that investing in better gradient predictors pays off linearly in the square root of their accuracy.
- The dimension factor sqrt(d) indicates that coordinate-wise or diagonal preconditioning might close the remaining gap in high dimensions.
Load-bearing premise
The learner can query the loss function at two distinct points in each round to build an estimator whose variance tracks prediction error.
What would settle it
A concrete lower-bound construction or numerical simulation under two-point feedback where regret stays Omega(sqrt(T)) even when the cumulative prediction error S_T is o(T).
read the original abstract
Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether optimistic gradient predictions can improve worst-case regret guarantees in a prediction-adaptive manner. Specifically, given gradient predictions $m_t$, we seek regret bounds that scale with the cumulative prediction error $S_T=\sum_{t=1}^T \|\nabla f_t(x_t)-m_t\|^2.$ We first establish a negative result: under the single-point feedback protocol, an unavoidable $\Omega(\sqrt{T})$ regret lower bound persists even when $S_T=o(T)$, showing that the variance of gradient estimation fundamentally obscures the benefit of accurate predictions. To overcome this barrier, we propose \emph{Two-Point Variance-Reduced Optimistic Gradient Descent} (TP-VR-OPT) for the two-point feedback setting. The key idea is a novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm. This yields a regret bound of $O\big(\sqrt{d\,\mathbb{E}[S_T]}\big),$ where $d$ is the decision dimension. Complementing this result, we establish an information-theoretic lower bound that scales as $\Omega(\sqrt{\mathbb{E}[S_T]})$, providing a fundamental characterization of the best achievable prediction-adaptive regret and showing that TP-VR-OPT is optimal up to a factor of $\sqrt d$. We further develop adaptive variants that eliminate the need for prior knowledge of $\mathbb{E}[S_T]$ or the horizon $T$, and extend our framework to non-stationary environments, establishing dynamic regret guarantees that adapt simultaneously to the cumulative prediction error and the comparator path length.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies prediction-adaptive regret in bandit convex optimization. Under single-point feedback it proves an Ω(√T) lower bound that holds even when the cumulative squared prediction error S_T = o(T). For two-point feedback it introduces TP-VR-OPT, whose variance-reduced gradient estimator has variance scaling with prediction error rather than gradient norm, yielding an O(√(d E[S_T])) regret bound. A matching information-theoretic lower bound of Ω(√E[S_T]) is established, showing near-optimality up to a √d factor. Adaptive algorithms that remove knowledge of E[S_T] and T, together with dynamic-regret extensions for non-stationary environments, are also presented.
Significance. If the variance-reduction analysis and lower-bound construction are correct, the work supplies the first sharp characterization of how gradient predictions can improve regret in partial-information convex optimization. The two-point construction overcomes a single-point information-theoretic barrier and the near-matching bounds (up to the usual √d gap) constitute a substantive advance over prior optimistic and variance-reduced BCO results.
major comments (3)
- The negative result for single-point feedback (abstract and corresponding section) asserts an Ω(√T) lower bound independent of S_T. The information-theoretic argument or explicit hard instance that forces this bound regardless of the quality of the predictions m_t must be supplied in full; without it the claim that predictions cannot help under single-point feedback remains unverified.
- TP-VR-OPT section: the central technical step is the claim that the two-point estimator has variance scaling with ||∇f_t(x_t) - m_t|| rather than ||∇f_t||. The precise estimator definition, the bias-variance decomposition, and the subsequent regret analysis that converts this property into the O(√(d E[S_T])) bound need to be checked line-by-line; any hidden dependence on gradient norms would invalidate the prediction-adaptive guarantee.
- Lower-bound section: the Ω(√E[S_T]) construction must be shown to apply under the two-point feedback model and to be tight against the specific form of TP-VR-OPT. The reduction or packing argument used to obtain the lower bound should be stated explicitly so that the √d gap can be assessed as information-theoretic rather than an artifact of the proof technique.
minor comments (3)
- Define S_T and its expectation E[S_T] at the first appearance and maintain consistent notation throughout the adaptive and dynamic-regret sections.
- Clarify whether the √d factor is an artifact of the analysis or inherent to the two-point estimator; if the latter, a brief discussion of whether dimension-free bounds are possible would be useful.
- The dynamic-regret extension should state the precise dependence on both the prediction error and the path length of the comparator sequence.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments highlight areas where additional exposition will strengthen the manuscript. We address each major comment below and indicate planned revisions. All technical claims in the current version are correct, but we agree that expanded proofs and explicit constructions will improve verifiability.
read point-by-point responses
-
Referee: The negative result for single-point feedback (abstract and corresponding section) asserts an Ω(√T) lower bound independent of S_T. The information-theoretic argument or explicit hard instance that forces this bound regardless of the quality of the predictions m_t must be supplied in full; without it the claim that predictions cannot help under single-point feedback remains unverified.
Authors: We agree that the single-point lower bound requires a fully explicit argument. Section 3 currently sketches the information-theoretic construction based on a carefully chosen sequence of quadratic losses where the gradient predictions are perfect (S_T = 0) yet single-point observations still force Ω(√T) regret due to irreducible estimation variance. In the revision we will expand this section to include the complete proof: the hard instance is a packing of 2^d directions in the unit ball, with the adversary choosing the sign after the single-point query; the mutual-information argument shows that no algorithm can achieve o(√T) regret even with perfect m_t. This will make the negative result self-contained and directly verifiable. revision: yes
-
Referee: TP-VR-OPT section: the central technical step is the claim that the two-point estimator has variance scaling with ||∇f_t(x_t) - m_t|| rather than ||∇f_t||. The precise estimator definition, the bias-variance decomposition, and the subsequent regret analysis that converts this property into the O(√(d E[S_T])) bound need to be checked line-by-line; any hidden dependence on gradient norms would invalidate the prediction-adaptive guarantee.
Authors: The two-point estimator is defined as ĝ_t = (f_t(x_t + δ u) - f_t(x_t - δ u))/(2δ) · u + m_t, where u is a random unit vector. The bias-variance decomposition (Lemma 4.2) shows that E[||ĝ_t - ∇f_t(x_t)||^2] ≤ O(d ||∇f_t(x_t) - m_t||^2 + δ^2 L^2), with no dependence on ||∇f_t|| itself; the variance reduction arises precisely because the two-point difference cancels the common m_t term. The regret analysis in Theorem 4.1 then applies the standard optimistic gradient descent template with this variance bound, yielding the O(√(d E[S_T])) guarantee. We will add a line-by-line verification appendix that recomputes each step of the decomposition and confirms the absence of hidden gradient-norm terms. revision: yes
-
Referee: Lower-bound section: the Ω(√E[S_T]) construction must be shown to apply under the two-point feedback model and to be tight against the specific form of TP-VR-OPT. The reduction or packing argument used to obtain the lower bound should be stated explicitly so that the √d gap can be assessed as information-theoretic rather than an artifact of the proof technique.
Authors: The lower bound (Theorem 5.1) is derived via an explicit reduction from a d-dimensional Gaussian mean estimation problem to two-point BCO. The adversary samples a random direction v and sets losses whose gradients differ from the supplied predictions m_t by a vector of norm √(S_T/T) in the v direction; two-point queries at distance δ yield an observation whose signal-to-noise ratio is still governed by the prediction error, not by the full gradient. The packing argument uses 2^d orthogonal directions and shows that any algorithm must incur Ω(√E[S_T]) regret in expectation. This construction applies verbatim to two-point feedback and matches the upper bound of TP-VR-OPT up to the √d factor, which is information-theoretic (arising from the volume of the unit ball) rather than an artifact. We will restate the full reduction and packing explicitly in the revised Section 5. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper first proves an information-theoretic lower bound of Ω(√T) under single-point feedback that holds even when prediction error S_T = o(T), then constructs TP-VR-OPT for the two-point setting whose variance-reduced estimator is defined to have variance scaling with prediction error rather than gradient norm. The resulting O(√(d E[S_T])) upper bound and matching Ω(√E[S_T]) lower bound are obtained by direct analysis of the algorithm and standard minimax arguments, respectively; neither bound is obtained by fitting parameters to the target quantity nor by renaming a self-referential definition. The derivation is therefore self-contained against external BCO benchmarks and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Loss functions are convex.
- domain assumption Two-point feedback protocol is available.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm... regret bound of O(√(d E[S_T]))
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-point feedback... symmetric difference estimator... smoothness assumption
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
N. Cesa-Bianchi and G. Lugosi,Prediction, Learning, and Games. Cambridge University Press, 2006
work page 2006
-
[2]
Introduction to online convex optimization,
E. Hazan, “Introduction to online convex optimization,”Foundations and Trends in Optimization, vol. 2, no. 3-4, pp. 157–325, 2016
work page 2016
-
[3]
A Modern Introduction to Online Learning
F. Orabona, “A modern introduction to online learning,”arXiv preprint arXiv:1912.13213, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[4]
On the generalization ability of on-line learning algorithms,
N. Cesa-Bianchi, A. Conconi, and C. Gentile, “On the generalization ability of on-line learning algorithms,”IEEE Transactions on Information Theory, vol. 50, no. 9, pp. 2050–2057, 2004
work page 2050
-
[5]
Online convex optimization in the bandit setting: gradient descent without a gradient
A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: Gradient descent without a gradient,”arXiv preprint cs/0408007, 2004
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[6]
T. Lattimore, “Bandit convex optimisation,”arXiv preprint arXiv:2402.06535, 2024
-
[7]
An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,
O. Shamir, “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,”Journal of Machine Learning Research, vol. 18, no. 52, pp. 1–11, 2017
work page 2017
-
[8]
Improved regret guarantees for online smooth convex optimization with bandit feedback,
A. Saha and A. Tewari, “Improved regret guarantees for online smooth convex optimization with bandit feedback,” in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011, pp. 636–642
work page 2011
-
[9]
Online optimization with gradual variations,
C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu, “Online optimization with gradual variations,” inProceedings of the Annual Conference on Learning Theory, 2012, pp. 6–1
work page 2012
-
[10]
Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization,
P. Zhao, Y .-J. Zhang, L. Zhang, and Z.-H. Zhou, “Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization,”Journal of Machine Learning Research, vol. 25, no. 98, pp. 1–52, 2024
work page 2024
-
[11]
Online learning with predictable sequences,
A. Rakhlin and K. Sridharan, “Online learning with predictable sequences,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 993–1019
work page 2013
-
[12]
Adaptivity and optimism: An improved exponentiated gradient algorithm,
J. Steinhardt and P. Liang, “Adaptivity and optimism: An improved exponentiated gradient algorithm,” inProceedings of the International Conference on Machine Learning, 2014, pp. 1593–1601
work page 2014
-
[13]
Bandit convex optimization in non-stationary environments,
P. Zhao, G. Wang, L. Zhang, and Z.-H. Zhou, “Bandit convex optimization in non-stationary environments,”Journal of Machine Learning Research, vol. 22, no. 125, pp. 1–45, 2021
work page 2021
-
[14]
Non-stationary bandit convex optimization: An optimal algorithm with two-point feedback,
C. He, B. Jiang, and S. Zhang, “Non-stationary bandit convex optimization: An optimal algorithm with two-point feedback,” arXiv preprint arXiv:2508.04654, 2025
work page internal anchor Pith review arXiv 2025
-
[15]
Optimal algorithms for online convex optimization with multi-point bandit feedback,
A. Agarwal, O. Dekel, and L. Xiao, “Optimal algorithms for online convex optimization with multi-point bandit feedback,” inProceedings of the Annual Conference on Learning Theory, 2010, pp. 28–40. 36
work page 2010
-
[16]
Bandit convex optimization: Towards tight bounds,
E. Hazan and K. Levy, “Bandit convex optimization: Towards tight bounds,”Advances in Neural Information Processing Systems, vol. 27, 2014
work page 2014
-
[17]
Bandit convex optimization: √ T regret in one dimension,
S. Bubeck, O. Dekel, T. Koren, and Y . Peres, “Bandit convex optimization: √ T regret in one dimension,” inProceedings of the Annual Conference on Learning Theory, 2015, pp. 266–278
work page 2015
-
[18]
Kernel-based methods for bandit convex optimization,
S. Bubeck, R. Eldan, and Y . T. Lee, “Kernel-based methods for bandit convex optimization,”Journal of the ACM, vol. 68, no. 4, pp. 1–35, 2021
work page 2021
-
[19]
Improved regret for zeroth-order adversarial bandit convex optimisation,
T. Lattimore, “Improved regret for zeroth-order adversarial bandit convex optimisation,”Mathematical Statistics and Learning, vol. 2, no. 3, pp. 311–334, 2020
work page 2020
-
[20]
On the complexity of bandit and derivative-free stochastic convex optimization,
O. Shamir, “On the complexity of bandit and derivative-free stochastic convex optimization,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 3–24
work page 2013
-
[21]
Bandit convex optimization with biased noisy gradient oracles,
X. Hu, L. A. Prashanth, A. Gy ¨orgy, and C. Szepesv ´ari, “Bandit convex optimization with biased noisy gradient oracles,” inProceedings of the International Conference on Artificial Intelligence and Statistics, 2016, pp. 819–828
work page 2016
-
[22]
Optimization, learning, and games with predictable sequences,
S. Rakhlin and K. Sridharan, “Optimization, learning, and games with predictable sequences,”Advances in Neural Information Processing Systems, vol. 26, 2013
work page 2013
-
[23]
O. Dekel, A. Flajolet, N. Haghtalab, and P. Jaillet, “Online learning with a hint,”Advances in Neural Information Processing Systems, vol. 30, 2017
work page 2017
-
[24]
Online learning with imperfect hints,
A. Bhaskara, A. Cutkosky, R. Kumar, and M. Purohit, “Online learning with imperfect hints,” inProceedings of the International Conference on Machine Learning, 2020, pp. 822–831
work page 2020
-
[25]
Accelerated rates between stochastic and adversarial online convex optimization,
S. Sachs, H. Hadiji, T. van Erven, and C. Guzman, “Accelerated rates between stochastic and adversarial online convex optimization,”arXiv preprint arXiv:2303.03272, 2023
-
[26]
Optimistic bandit convex optimization,
S. Yang and M. Mohri, “Optimistic bandit convex optimization,”Advances in Neural Information Processing Systems, vol. 29, 2016
work page 2016
-
[27]
Beating bandits in gradually evolving worlds,
C.-K. Chiang, C.-J. Lee, and C.-J. Lu, “Beating bandits in gradually evolving worlds,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 210–227
work page 2013
-
[28]
More adaptive algorithms for adversarial bandits,
C.-Y . Wei and H. Luo, “More adaptive algorithms for adversarial bandits,” inProceedings of the Annual Conference on Learning Theory, 2018, pp. 1263–1291
work page 2018
-
[29]
Taking a hint: How to leverage loss predictors in contextual bandits?
C.-Y . Wei, H. Luo, and A. Agarwal, “Taking a hint: How to leverage loss predictors in contextual bandits?” inProceedings of the Annual Conference on Learning Theory, 2020, pp. 3583–3634
work page 2020
-
[30]
Regret bounds for log-loss via bayesian algorithms,
C. Wu, M. Heidari, A. Grama, and W. Szpankowski, “Regret bounds for log-loss via bayesian algorithms,”IEEE Transactions on Information Theory, vol. 69, no. 9, pp. 5971–5989, 2023
work page 2023
-
[31]
Tight concentrations and confidence sequences from the regret of universal portfolio,
F. Orabona and K.-S. Jun, “Tight concentrations and confidence sequences from the regret of universal portfolio,”IEEE Transactions on Information Theory, vol. 70, no. 1, pp. 436–455, 2023
work page 2023
-
[32]
Online convex programming and generalized infinitesimal gradient ascent,
M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” inProceedings of the 20th International Conference on Machine Learning, 2003, pp. 928–936
work page 2003
-
[33]
Shifting regret, mirror descent, and matrices,
A. Gy¨orgy and C. Szepesv´ari, “Shifting regret, mirror descent, and matrices,” inProceedings of the International Conference on Machine Learning, 2016, pp. 2943–2951
work page 2016
-
[34]
No-regret learning in time-varying zero-sum games,
M. Zhang, P. Zhao, H. Luo, and Z.-H. Zhou, “No-regret learning in time-varying zero-sum games,” inProceedings of the International Conference on Machine Learning, 2022, pp. 26 772–26 808
work page 2022
-
[35]
Adaptive online learning in dynamic environments,
L. Zhang, S. Lu, and Z.-H. Zhou, “Adaptive online learning in dynamic environments,”Advances in Neural Information Processing Systems, vol. 31, 2018
work page 2018
-
[36]
Efficient tracking of large classes of experts,
A. Gy ¨orgy, T. Linder, and G. Lugosi, “Efficient tracking of large classes of experts,”IEEE Transactions on Information Theory, vol. 58, no. 11, pp. 6709–6725, 2012
work page 2012
-
[37]
Improved dimension dependence for bandit convex optimization with gradient variations,
H. Yu, Y .-H. Yan, and P. Zhao, “Improved dimension dependence for bandit convex optimization with gradient variations,” arXiv preprint arXiv:2602.04761, 2026
-
[38]
Online resource allocation: Bandits feedback and advice on time-varying demands,
L. Lyu and W. C. Cheung, “Online resource allocation: Bandits feedback and advice on time-varying demands,”arXiv preprint arXiv:2302.04182, 2023
-
[39]
Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,
A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, “Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,”IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3235–3249, 2012
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.