Bandit Convex Optimization with Gradient Prediction Adaptivity

Adarsh Barik; Shuche Wang; Vincent Y. F. Tan

arxiv: 2605.22191 · v1 · pith:34QEHLSHnew · submitted 2026-05-21 · 💻 cs.LG · cs.IT· math.IT

Bandit Convex Optimization with Gradient Prediction Adaptivity

Shuche Wang , Adarsh Barik , Vincent Y. F. Tan This is my paper

Pith reviewed 2026-05-22 07:31 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords bandit convex optimizationgradient predictionstwo-point feedbackvariance reductionregret boundsoptimistic gradient descentonline learningprediction adaptive regret

0 comments

The pith

With two-point feedback, a variance-reduced optimistic method makes bandit convex optimization regret scale directly with cumulative gradient prediction error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first shows that single-point feedback in bandit convex optimization keeps worst-case regret at Omega(sqrt(T)) no matter how accurate the gradient predictions are, because estimation variance hides the benefit. Switching to two queries per round, the authors build a variance-reduced estimator whose noise depends only on the prediction error S_T rather than the full gradient size. This produces an upper bound of O(sqrt(d E[S_T])) that adapts to prediction quality. A matching lower bound of Omega(sqrt(E[S_T])) proves the approach is optimal up to a sqrt(d) factor. Adaptive and dynamic-regret extensions remove the need to know the horizon or total error in advance.

Core claim

Under two-point feedback, the Two-Point Variance-Reduced Optimistic Gradient Descent algorithm uses a novel estimator whose variance scales with the prediction error rather than the gradient norm, yielding regret O(sqrt(d E[S_T])) and matching the information-theoretic lower bound Omega(sqrt(E[S_T])) up to a sqrt(d) factor; adaptive variants remove prior knowledge of E[S_T] and T while dynamic-regret versions also adapt to comparator path length.

What carries the argument

Two-Point Variance-Reduced Optimistic Gradient Descent (TP-VR-OPT) with a variance-reduced gradient estimator whose variance scales with prediction error instead of gradient norm

If this is right

Regret becomes independent of T and depends only on the realized prediction quality S_T.
The sqrt(d) gap between upper and lower bound is tight in the dimension dependence.
No prior knowledge of the total prediction error or time horizon is required for the adaptive versions.
Dynamic regret bounds hold simultaneously with respect to both prediction error and comparator path length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-point variance-reduction idea could extend to other partial-information online problems where prediction oracles are available.
In practice the method suggests that investing in better gradient predictors pays off linearly in the square root of their accuracy.
The dimension factor sqrt(d) indicates that coordinate-wise or diagonal preconditioning might close the remaining gap in high dimensions.

Load-bearing premise

The learner can query the loss function at two distinct points in each round to build an estimator whose variance tracks prediction error.

What would settle it

A concrete lower-bound construction or numerical simulation under two-point feedback where regret stays Omega(sqrt(T)) even when the cumulative prediction error S_T is o(T).

read the original abstract

Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether optimistic gradient predictions can improve worst-case regret guarantees in a prediction-adaptive manner. Specifically, given gradient predictions $m_t$, we seek regret bounds that scale with the cumulative prediction error $S_T=\sum_{t=1}^T \|\nabla f_t(x_t)-m_t\|^2.$ We first establish a negative result: under the single-point feedback protocol, an unavoidable $\Omega(\sqrt{T})$ regret lower bound persists even when $S_T=o(T)$, showing that the variance of gradient estimation fundamentally obscures the benefit of accurate predictions. To overcome this barrier, we propose \emph{Two-Point Variance-Reduced Optimistic Gradient Descent} (TP-VR-OPT) for the two-point feedback setting. The key idea is a novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm. This yields a regret bound of $O\big(\sqrt{d\,\mathbb{E}[S_T]}\big),$ where $d$ is the decision dimension. Complementing this result, we establish an information-theoretic lower bound that scales as $\Omega(\sqrt{\mathbb{E}[S_T]})$, providing a fundamental characterization of the best achievable prediction-adaptive regret and showing that TP-VR-OPT is optimal up to a factor of $\sqrt d$. We further develop adaptive variants that eliminate the need for prior knowledge of $\mathbb{E}[S_T]$ or the horizon $T$, and extend our framework to non-stationary environments, establishing dynamic regret guarantees that adapt simultaneously to the cumulative prediction error and the comparator path length.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gradient predictions only help in two-point BCO, where the new variance-reduced estimator gives near-optimal regret scaling with prediction error.

read the letter

The main thing to know is that this paper shows gradient predictions cannot beat the usual sqrt(T) barrier in single-point bandit convex optimization, but they do yield better bounds once you allow two-point feedback. The negative result is the clearest part of the contribution and explains why the rest of the work needs the two-point model. TP-VR-OPT then uses a variance-reduced estimator whose variance tracks the prediction error rather than the full gradient norm, producing an O(sqrt(d E[S_T])) regret bound. They pair it with an Omega(sqrt(E[S_T])) lower bound, so the result is tight up to the dimension factor, and they add adaptive versions plus a dynamic-regret extension for non-stationary losses. The technical move on the estimator is the real novelty here and looks like a useful idea for other prediction-adaptive settings. The bounds are stated cleanly and the information-theoretic argument for the single-point barrier is straightforward. The sqrt(d) gap between upper and lower bounds is the most obvious loose end, though it is typical for this style of analysis and does not undermine the central claim. The variance-reduction step will need careful checking in the full proofs, but nothing in the abstract suggests a hidden bias or circular step. This paper is aimed at people who already work on bandit convex optimization or adaptive online learning. A reader who knows the standard BCO regret results will see the value immediately. It has enough new content and a coherent argument to deserve referee time.

Referee Report

3 major / 3 minor

Summary. The manuscript studies prediction-adaptive regret in bandit convex optimization. Under single-point feedback it proves an Ω(√T) lower bound that holds even when the cumulative squared prediction error S_T = o(T). For two-point feedback it introduces TP-VR-OPT, whose variance-reduced gradient estimator has variance scaling with prediction error rather than gradient norm, yielding an O(√(d E[S_T])) regret bound. A matching information-theoretic lower bound of Ω(√E[S_T]) is established, showing near-optimality up to a √d factor. Adaptive algorithms that remove knowledge of E[S_T] and T, together with dynamic-regret extensions for non-stationary environments, are also presented.

Significance. If the variance-reduction analysis and lower-bound construction are correct, the work supplies the first sharp characterization of how gradient predictions can improve regret in partial-information convex optimization. The two-point construction overcomes a single-point information-theoretic barrier and the near-matching bounds (up to the usual √d gap) constitute a substantive advance over prior optimistic and variance-reduced BCO results.

major comments (3)

The negative result for single-point feedback (abstract and corresponding section) asserts an Ω(√T) lower bound independent of S_T. The information-theoretic argument or explicit hard instance that forces this bound regardless of the quality of the predictions m_t must be supplied in full; without it the claim that predictions cannot help under single-point feedback remains unverified.
TP-VR-OPT section: the central technical step is the claim that the two-point estimator has variance scaling with ||∇f_t(x_t) - m_t|| rather than ||∇f_t||. The precise estimator definition, the bias-variance decomposition, and the subsequent regret analysis that converts this property into the O(√(d E[S_T])) bound need to be checked line-by-line; any hidden dependence on gradient norms would invalidate the prediction-adaptive guarantee.
Lower-bound section: the Ω(√E[S_T]) construction must be shown to apply under the two-point feedback model and to be tight against the specific form of TP-VR-OPT. The reduction or packing argument used to obtain the lower bound should be stated explicitly so that the √d gap can be assessed as information-theoretic rather than an artifact of the proof technique.

minor comments (3)

Define S_T and its expectation E[S_T] at the first appearance and maintain consistent notation throughout the adaptive and dynamic-regret sections.
Clarify whether the √d factor is an artifact of the analysis or inherent to the two-point estimator; if the latter, a brief discussion of whether dimension-free bounds are possible would be useful.
The dynamic-regret extension should state the precise dependence on both the prediction error and the path length of the comparator sequence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight areas where additional exposition will strengthen the manuscript. We address each major comment below and indicate planned revisions. All technical claims in the current version are correct, but we agree that expanded proofs and explicit constructions will improve verifiability.

read point-by-point responses

Referee: The negative result for single-point feedback (abstract and corresponding section) asserts an Ω(√T) lower bound independent of S_T. The information-theoretic argument or explicit hard instance that forces this bound regardless of the quality of the predictions m_t must be supplied in full; without it the claim that predictions cannot help under single-point feedback remains unverified.

Authors: We agree that the single-point lower bound requires a fully explicit argument. Section 3 currently sketches the information-theoretic construction based on a carefully chosen sequence of quadratic losses where the gradient predictions are perfect (S_T = 0) yet single-point observations still force Ω(√T) regret due to irreducible estimation variance. In the revision we will expand this section to include the complete proof: the hard instance is a packing of 2^d directions in the unit ball, with the adversary choosing the sign after the single-point query; the mutual-information argument shows that no algorithm can achieve o(√T) regret even with perfect m_t. This will make the negative result self-contained and directly verifiable. revision: yes
Referee: TP-VR-OPT section: the central technical step is the claim that the two-point estimator has variance scaling with ||∇f_t(x_t) - m_t|| rather than ||∇f_t||. The precise estimator definition, the bias-variance decomposition, and the subsequent regret analysis that converts this property into the O(√(d E[S_T])) bound need to be checked line-by-line; any hidden dependence on gradient norms would invalidate the prediction-adaptive guarantee.

Authors: The two-point estimator is defined as ĝ_t = (f_t(x_t + δ u) - f_t(x_t - δ u))/(2δ) · u + m_t, where u is a random unit vector. The bias-variance decomposition (Lemma 4.2) shows that E[||ĝ_t - ∇f_t(x_t)||^2] ≤ O(d ||∇f_t(x_t) - m_t||^2 + δ^2 L^2), with no dependence on ||∇f_t|| itself; the variance reduction arises precisely because the two-point difference cancels the common m_t term. The regret analysis in Theorem 4.1 then applies the standard optimistic gradient descent template with this variance bound, yielding the O(√(d E[S_T])) guarantee. We will add a line-by-line verification appendix that recomputes each step of the decomposition and confirms the absence of hidden gradient-norm terms. revision: yes
Referee: Lower-bound section: the Ω(√E[S_T]) construction must be shown to apply under the two-point feedback model and to be tight against the specific form of TP-VR-OPT. The reduction or packing argument used to obtain the lower bound should be stated explicitly so that the √d gap can be assessed as information-theoretic rather than an artifact of the proof technique.

Authors: The lower bound (Theorem 5.1) is derived via an explicit reduction from a d-dimensional Gaussian mean estimation problem to two-point BCO. The adversary samples a random direction v and sets losses whose gradients differ from the supplied predictions m_t by a vector of norm √(S_T/T) in the v direction; two-point queries at distance δ yield an observation whose signal-to-noise ratio is still governed by the prediction error, not by the full gradient. The packing argument uses 2^d orthogonal directions and shows that any algorithm must incur Ω(√E[S_T]) regret in expectation. This construction applies verbatim to two-point feedback and matches the upper bound of TP-VR-OPT up to the √d factor, which is information-theoretic (arising from the volume of the unit ball) rather than an artifact. We will restate the full reduction and packing explicitly in the revised Section 5. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper first proves an information-theoretic lower bound of Ω(√T) under single-point feedback that holds even when prediction error S_T = o(T), then constructs TP-VR-OPT for the two-point setting whose variance-reduced estimator is defined to have variance scaling with prediction error rather than gradient norm. The resulting O(√(d E[S_T])) upper bound and matching Ω(√E[S_T]) lower bound are obtained by direct analysis of the algorithm and standard minimax arguments, respectively; neither bound is obtained by fitting parameters to the target quantity nor by renaming a self-referential definition. The derivation is therefore self-contained against external BCO benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions for convex optimization and the two-point feedback model, with no free parameters or invented entities visible in the abstract description.

axioms (2)

domain assumption Loss functions are convex.
Invoked implicitly for the regret analysis of the proposed algorithm in the two-point setting.
domain assumption Two-point feedback protocol is available.
Central to enabling the variance-reduced estimator that scales with prediction error.

pith-pipeline@v0.9.0 · 5859 in / 1616 out tokens · 66737 ms · 2026-05-22T07:31:54.250935+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm... regret bound of O(√(d E[S_T]))
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-point feedback... symmetric difference estimator... smoothness assumption

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

[1]

Cesa-Bianchi and G

N. Cesa-Bianchi and G. Lugosi,Prediction, Learning, and Games. Cambridge University Press, 2006

work page 2006
[2]

Introduction to online convex optimization,

E. Hazan, “Introduction to online convex optimization,”Foundations and Trends in Optimization, vol. 2, no. 3-4, pp. 157–325, 2016

work page 2016
[3]

A Modern Introduction to Online Learning

F. Orabona, “A modern introduction to online learning,”arXiv preprint arXiv:1912.13213, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[4]

On the generalization ability of on-line learning algorithms,

N. Cesa-Bianchi, A. Conconi, and C. Gentile, “On the generalization ability of on-line learning algorithms,”IEEE Transactions on Information Theory, vol. 50, no. 9, pp. 2050–2057, 2004

work page 2050
[5]

Online convex optimization in the bandit setting: gradient descent without a gradient

A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: Gradient descent without a gradient,”arXiv preprint cs/0408007, 2004

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

Bandit convex optimisation,

T. Lattimore, “Bandit convex optimisation,”arXiv preprint arXiv:2402.06535, 2024

work page arXiv 2024
[7]

An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,

O. Shamir, “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,”Journal of Machine Learning Research, vol. 18, no. 52, pp. 1–11, 2017

work page 2017
[8]

Improved regret guarantees for online smooth convex optimization with bandit feedback,

A. Saha and A. Tewari, “Improved regret guarantees for online smooth convex optimization with bandit feedback,” in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011, pp. 636–642

work page 2011
[9]

Online optimization with gradual variations,

C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu, “Online optimization with gradual variations,” inProceedings of the Annual Conference on Learning Theory, 2012, pp. 6–1

work page 2012
[10]

Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization,

P. Zhao, Y .-J. Zhang, L. Zhang, and Z.-H. Zhou, “Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization,”Journal of Machine Learning Research, vol. 25, no. 98, pp. 1–52, 2024

work page 2024
[11]

Online learning with predictable sequences,

A. Rakhlin and K. Sridharan, “Online learning with predictable sequences,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 993–1019

work page 2013
[12]

Adaptivity and optimism: An improved exponentiated gradient algorithm,

J. Steinhardt and P. Liang, “Adaptivity and optimism: An improved exponentiated gradient algorithm,” inProceedings of the International Conference on Machine Learning, 2014, pp. 1593–1601

work page 2014
[13]

Bandit convex optimization in non-stationary environments,

P. Zhao, G. Wang, L. Zhang, and Z.-H. Zhou, “Bandit convex optimization in non-stationary environments,”Journal of Machine Learning Research, vol. 22, no. 125, pp. 1–45, 2021

work page 2021
[14]

Non-stationary bandit convex optimization: An optimal algorithm with two-point feedback,

C. He, B. Jiang, and S. Zhang, “Non-stationary bandit convex optimization: An optimal algorithm with two-point feedback,” arXiv preprint arXiv:2508.04654, 2025

work page internal anchor Pith review arXiv 2025
[15]

Optimal algorithms for online convex optimization with multi-point bandit feedback,

A. Agarwal, O. Dekel, and L. Xiao, “Optimal algorithms for online convex optimization with multi-point bandit feedback,” inProceedings of the Annual Conference on Learning Theory, 2010, pp. 28–40. 36

work page 2010
[16]

Bandit convex optimization: Towards tight bounds,

E. Hazan and K. Levy, “Bandit convex optimization: Towards tight bounds,”Advances in Neural Information Processing Systems, vol. 27, 2014

work page 2014
[17]

Bandit convex optimization: √ T regret in one dimension,

S. Bubeck, O. Dekel, T. Koren, and Y . Peres, “Bandit convex optimization: √ T regret in one dimension,” inProceedings of the Annual Conference on Learning Theory, 2015, pp. 266–278

work page 2015
[18]

Kernel-based methods for bandit convex optimization,

S. Bubeck, R. Eldan, and Y . T. Lee, “Kernel-based methods for bandit convex optimization,”Journal of the ACM, vol. 68, no. 4, pp. 1–35, 2021

work page 2021
[19]

Improved regret for zeroth-order adversarial bandit convex optimisation,

T. Lattimore, “Improved regret for zeroth-order adversarial bandit convex optimisation,”Mathematical Statistics and Learning, vol. 2, no. 3, pp. 311–334, 2020

work page 2020
[20]

On the complexity of bandit and derivative-free stochastic convex optimization,

O. Shamir, “On the complexity of bandit and derivative-free stochastic convex optimization,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 3–24

work page 2013
[21]

Bandit convex optimization with biased noisy gradient oracles,

X. Hu, L. A. Prashanth, A. Gy ¨orgy, and C. Szepesv ´ari, “Bandit convex optimization with biased noisy gradient oracles,” inProceedings of the International Conference on Artificial Intelligence and Statistics, 2016, pp. 819–828

work page 2016
[22]

Optimization, learning, and games with predictable sequences,

S. Rakhlin and K. Sridharan, “Optimization, learning, and games with predictable sequences,”Advances in Neural Information Processing Systems, vol. 26, 2013

work page 2013
[23]

Online learning with a hint,

O. Dekel, A. Flajolet, N. Haghtalab, and P. Jaillet, “Online learning with a hint,”Advances in Neural Information Processing Systems, vol. 30, 2017

work page 2017
[24]

Online learning with imperfect hints,

A. Bhaskara, A. Cutkosky, R. Kumar, and M. Purohit, “Online learning with imperfect hints,” inProceedings of the International Conference on Machine Learning, 2020, pp. 822–831

work page 2020
[25]

Accelerated rates between stochastic and adversarial online convex optimization,

S. Sachs, H. Hadiji, T. van Erven, and C. Guzman, “Accelerated rates between stochastic and adversarial online convex optimization,”arXiv preprint arXiv:2303.03272, 2023

work page arXiv 2023
[26]

Optimistic bandit convex optimization,

S. Yang and M. Mohri, “Optimistic bandit convex optimization,”Advances in Neural Information Processing Systems, vol. 29, 2016

work page 2016
[27]

Beating bandits in gradually evolving worlds,

C.-K. Chiang, C.-J. Lee, and C.-J. Lu, “Beating bandits in gradually evolving worlds,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 210–227

work page 2013
[28]

More adaptive algorithms for adversarial bandits,

C.-Y . Wei and H. Luo, “More adaptive algorithms for adversarial bandits,” inProceedings of the Annual Conference on Learning Theory, 2018, pp. 1263–1291

work page 2018
[29]

Taking a hint: How to leverage loss predictors in contextual bandits?

C.-Y . Wei, H. Luo, and A. Agarwal, “Taking a hint: How to leverage loss predictors in contextual bandits?” inProceedings of the Annual Conference on Learning Theory, 2020, pp. 3583–3634

work page 2020
[30]

Regret bounds for log-loss via bayesian algorithms,

C. Wu, M. Heidari, A. Grama, and W. Szpankowski, “Regret bounds for log-loss via bayesian algorithms,”IEEE Transactions on Information Theory, vol. 69, no. 9, pp. 5971–5989, 2023

work page 2023
[31]

Tight concentrations and confidence sequences from the regret of universal portfolio,

F. Orabona and K.-S. Jun, “Tight concentrations and confidence sequences from the regret of universal portfolio,”IEEE Transactions on Information Theory, vol. 70, no. 1, pp. 436–455, 2023

work page 2023
[32]

Online convex programming and generalized infinitesimal gradient ascent,

M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” inProceedings of the 20th International Conference on Machine Learning, 2003, pp. 928–936

work page 2003
[33]

Shifting regret, mirror descent, and matrices,

A. Gy¨orgy and C. Szepesv´ari, “Shifting regret, mirror descent, and matrices,” inProceedings of the International Conference on Machine Learning, 2016, pp. 2943–2951

work page 2016
[34]

No-regret learning in time-varying zero-sum games,

M. Zhang, P. Zhao, H. Luo, and Z.-H. Zhou, “No-regret learning in time-varying zero-sum games,” inProceedings of the International Conference on Machine Learning, 2022, pp. 26 772–26 808

work page 2022
[35]

Adaptive online learning in dynamic environments,

L. Zhang, S. Lu, and Z.-H. Zhou, “Adaptive online learning in dynamic environments,”Advances in Neural Information Processing Systems, vol. 31, 2018

work page 2018
[36]

Efficient tracking of large classes of experts,

A. Gy ¨orgy, T. Linder, and G. Lugosi, “Efficient tracking of large classes of experts,”IEEE Transactions on Information Theory, vol. 58, no. 11, pp. 6709–6725, 2012

work page 2012
[37]

Improved dimension dependence for bandit convex optimization with gradient variations,

H. Yu, Y .-H. Yan, and P. Zhao, “Improved dimension dependence for bandit convex optimization with gradient variations,” arXiv preprint arXiv:2602.04761, 2026

work page arXiv 2026
[38]

Online resource allocation: Bandits feedback and advice on time-varying demands,

L. Lyu and W. C. Cheung, “Online resource allocation: Bandits feedback and advice on time-varying demands,”arXiv preprint arXiv:2302.04182, 2023

work page arXiv 2023
[39]

Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,

A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, “Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,”IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3235–3249, 2012

work page 2012

[1] [1]

Cesa-Bianchi and G

N. Cesa-Bianchi and G. Lugosi,Prediction, Learning, and Games. Cambridge University Press, 2006

work page 2006

[2] [2]

Introduction to online convex optimization,

E. Hazan, “Introduction to online convex optimization,”Foundations and Trends in Optimization, vol. 2, no. 3-4, pp. 157–325, 2016

work page 2016

[3] [3]

A Modern Introduction to Online Learning

F. Orabona, “A modern introduction to online learning,”arXiv preprint arXiv:1912.13213, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[4] [4]

On the generalization ability of on-line learning algorithms,

N. Cesa-Bianchi, A. Conconi, and C. Gentile, “On the generalization ability of on-line learning algorithms,”IEEE Transactions on Information Theory, vol. 50, no. 9, pp. 2050–2057, 2004

work page 2050

[5] [5]

Online convex optimization in the bandit setting: gradient descent without a gradient

A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: Gradient descent without a gradient,”arXiv preprint cs/0408007, 2004

work page internal anchor Pith review Pith/arXiv arXiv 2004

[6] [6]

Bandit convex optimisation,

T. Lattimore, “Bandit convex optimisation,”arXiv preprint arXiv:2402.06535, 2024

work page arXiv 2024

[7] [7]

An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,

O. Shamir, “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,”Journal of Machine Learning Research, vol. 18, no. 52, pp. 1–11, 2017

work page 2017

[8] [8]

Improved regret guarantees for online smooth convex optimization with bandit feedback,

A. Saha and A. Tewari, “Improved regret guarantees for online smooth convex optimization with bandit feedback,” in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011, pp. 636–642

work page 2011

[9] [9]

Online optimization with gradual variations,

C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu, “Online optimization with gradual variations,” inProceedings of the Annual Conference on Learning Theory, 2012, pp. 6–1

work page 2012

[10] [10]

Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization,

P. Zhao, Y .-J. Zhang, L. Zhang, and Z.-H. Zhou, “Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization,”Journal of Machine Learning Research, vol. 25, no. 98, pp. 1–52, 2024

work page 2024

[11] [11]

Online learning with predictable sequences,

A. Rakhlin and K. Sridharan, “Online learning with predictable sequences,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 993–1019

work page 2013

[12] [12]

Adaptivity and optimism: An improved exponentiated gradient algorithm,

J. Steinhardt and P. Liang, “Adaptivity and optimism: An improved exponentiated gradient algorithm,” inProceedings of the International Conference on Machine Learning, 2014, pp. 1593–1601

work page 2014

[13] [13]

Bandit convex optimization in non-stationary environments,

P. Zhao, G. Wang, L. Zhang, and Z.-H. Zhou, “Bandit convex optimization in non-stationary environments,”Journal of Machine Learning Research, vol. 22, no. 125, pp. 1–45, 2021

work page 2021

[14] [14]

Non-stationary bandit convex optimization: An optimal algorithm with two-point feedback,

C. He, B. Jiang, and S. Zhang, “Non-stationary bandit convex optimization: An optimal algorithm with two-point feedback,” arXiv preprint arXiv:2508.04654, 2025

work page internal anchor Pith review arXiv 2025

[15] [15]

Optimal algorithms for online convex optimization with multi-point bandit feedback,

A. Agarwal, O. Dekel, and L. Xiao, “Optimal algorithms for online convex optimization with multi-point bandit feedback,” inProceedings of the Annual Conference on Learning Theory, 2010, pp. 28–40. 36

work page 2010

[16] [16]

Bandit convex optimization: Towards tight bounds,

E. Hazan and K. Levy, “Bandit convex optimization: Towards tight bounds,”Advances in Neural Information Processing Systems, vol. 27, 2014

work page 2014

[17] [17]

Bandit convex optimization: √ T regret in one dimension,

S. Bubeck, O. Dekel, T. Koren, and Y . Peres, “Bandit convex optimization: √ T regret in one dimension,” inProceedings of the Annual Conference on Learning Theory, 2015, pp. 266–278

work page 2015

[18] [18]

Kernel-based methods for bandit convex optimization,

S. Bubeck, R. Eldan, and Y . T. Lee, “Kernel-based methods for bandit convex optimization,”Journal of the ACM, vol. 68, no. 4, pp. 1–35, 2021

work page 2021

[19] [19]

Improved regret for zeroth-order adversarial bandit convex optimisation,

T. Lattimore, “Improved regret for zeroth-order adversarial bandit convex optimisation,”Mathematical Statistics and Learning, vol. 2, no. 3, pp. 311–334, 2020

work page 2020

[20] [20]

On the complexity of bandit and derivative-free stochastic convex optimization,

O. Shamir, “On the complexity of bandit and derivative-free stochastic convex optimization,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 3–24

work page 2013

[21] [21]

Bandit convex optimization with biased noisy gradient oracles,

X. Hu, L. A. Prashanth, A. Gy ¨orgy, and C. Szepesv ´ari, “Bandit convex optimization with biased noisy gradient oracles,” inProceedings of the International Conference on Artificial Intelligence and Statistics, 2016, pp. 819–828

work page 2016

[22] [22]

Optimization, learning, and games with predictable sequences,

S. Rakhlin and K. Sridharan, “Optimization, learning, and games with predictable sequences,”Advances in Neural Information Processing Systems, vol. 26, 2013

work page 2013

[23] [23]

Online learning with a hint,

O. Dekel, A. Flajolet, N. Haghtalab, and P. Jaillet, “Online learning with a hint,”Advances in Neural Information Processing Systems, vol. 30, 2017

work page 2017

[24] [24]

Online learning with imperfect hints,

A. Bhaskara, A. Cutkosky, R. Kumar, and M. Purohit, “Online learning with imperfect hints,” inProceedings of the International Conference on Machine Learning, 2020, pp. 822–831

work page 2020

[25] [25]

Accelerated rates between stochastic and adversarial online convex optimization,

S. Sachs, H. Hadiji, T. van Erven, and C. Guzman, “Accelerated rates between stochastic and adversarial online convex optimization,”arXiv preprint arXiv:2303.03272, 2023

work page arXiv 2023

[26] [26]

Optimistic bandit convex optimization,

S. Yang and M. Mohri, “Optimistic bandit convex optimization,”Advances in Neural Information Processing Systems, vol. 29, 2016

work page 2016

[27] [27]

Beating bandits in gradually evolving worlds,

C.-K. Chiang, C.-J. Lee, and C.-J. Lu, “Beating bandits in gradually evolving worlds,” inProceedings of the Annual Conference on Learning Theory, 2013, pp. 210–227

work page 2013

[28] [28]

More adaptive algorithms for adversarial bandits,

C.-Y . Wei and H. Luo, “More adaptive algorithms for adversarial bandits,” inProceedings of the Annual Conference on Learning Theory, 2018, pp. 1263–1291

work page 2018

[29] [29]

Taking a hint: How to leverage loss predictors in contextual bandits?

C.-Y . Wei, H. Luo, and A. Agarwal, “Taking a hint: How to leverage loss predictors in contextual bandits?” inProceedings of the Annual Conference on Learning Theory, 2020, pp. 3583–3634

work page 2020

[30] [30]

Regret bounds for log-loss via bayesian algorithms,

C. Wu, M. Heidari, A. Grama, and W. Szpankowski, “Regret bounds for log-loss via bayesian algorithms,”IEEE Transactions on Information Theory, vol. 69, no. 9, pp. 5971–5989, 2023

work page 2023

[31] [31]

Tight concentrations and confidence sequences from the regret of universal portfolio,

F. Orabona and K.-S. Jun, “Tight concentrations and confidence sequences from the regret of universal portfolio,”IEEE Transactions on Information Theory, vol. 70, no. 1, pp. 436–455, 2023

work page 2023

[32] [32]

Online convex programming and generalized infinitesimal gradient ascent,

M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” inProceedings of the 20th International Conference on Machine Learning, 2003, pp. 928–936

work page 2003

[33] [33]

Shifting regret, mirror descent, and matrices,

A. Gy¨orgy and C. Szepesv´ari, “Shifting regret, mirror descent, and matrices,” inProceedings of the International Conference on Machine Learning, 2016, pp. 2943–2951

work page 2016

[34] [34]

No-regret learning in time-varying zero-sum games,

M. Zhang, P. Zhao, H. Luo, and Z.-H. Zhou, “No-regret learning in time-varying zero-sum games,” inProceedings of the International Conference on Machine Learning, 2022, pp. 26 772–26 808

work page 2022

[35] [35]

Adaptive online learning in dynamic environments,

L. Zhang, S. Lu, and Z.-H. Zhou, “Adaptive online learning in dynamic environments,”Advances in Neural Information Processing Systems, vol. 31, 2018

work page 2018

[36] [36]

Efficient tracking of large classes of experts,

A. Gy ¨orgy, T. Linder, and G. Lugosi, “Efficient tracking of large classes of experts,”IEEE Transactions on Information Theory, vol. 58, no. 11, pp. 6709–6725, 2012

work page 2012

[37] [37]

Improved dimension dependence for bandit convex optimization with gradient variations,

H. Yu, Y .-H. Yan, and P. Zhao, “Improved dimension dependence for bandit convex optimization with gradient variations,” arXiv preprint arXiv:2602.04761, 2026

work page arXiv 2026

[38] [38]

Online resource allocation: Bandits feedback and advice on time-varying demands,

L. Lyu and W. C. Cheung, “Online resource allocation: Bandits feedback and advice on time-varying demands,”arXiv preprint arXiv:2302.04182, 2023

work page arXiv 2023

[39] [39]

Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,

A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, “Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,”IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3235–3249, 2012

work page 2012