Non-Stationary Bandit Convex Optimization: An Optimal Algorithm with Two-Point Feedback
read the original abstract
This paper studies bandit convex optimization in non-stationary environments with two-point feedback, using dynamic regret as the performance measure. We propose an algorithm based on bandit mirror descent that extends naturally to non-Euclidean settings. Let $T$ be the total number of iterations and $\mathcal{P}_{T,p}$ the path variation with respect to the $\ell_p$-norm. In Euclidean space, our algorithm matches the optimal regret bound $\mathcal{O}(\sqrt{dT(1+\mathcal{P}_{T,2})})$, improving upon \citet{zhao2021bandit} by a factor of $\mathcal{O}(\sqrt{d})$. Beyond Euclidean settings, our algorithm achieves an upper bound of $\mathcal{O}(\sqrt{d\log(d)T\log(T)(1 + \mathcal{P}_{T,1})})$ on the simplex, which is nearly optimal up to log factors. For the cross-polytope, the bound reduces to $\mathcal{O}(\sqrt{d\log(d)T(1+\mathcal{P}_{T,p})})$ for some $p = 1 + 1/\log(d)$.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Bandit Convex Optimization with Gradient Prediction Adaptivity
TP-VR-OPT achieves O(√(d E[S_T])) prediction-adaptive regret in two-point bandit convex optimization, with a matching Ω(√E[S_T]) lower bound up to √d, while single-point feedback cannot benefit from predictions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.