Tight Bounds for $\gamma$-Regret via the Decision-Estimation Coefficient

Alexander Rakhlin; Margalit Glasgow

arxiv: 2303.03327 · v2 · pith:H7TYZ6JVnew · submitted 2023-03-06 · 💻 cs.LG · stat.ML

Tight Bounds for γ-Regret via the Decision-Estimation Coefficient

Margalit Glasgow , Alexander Rakhlin This is my paper

classification 💻 cs.LG stat.ML

keywords gammaregretmathcalalgorithmclassbanditboundbounds

0 comments

read the original abstract

In this work, we give a statistical characterization of the $\gamma$-regret for arbitrary structured bandit problems, the regret which arises when comparing against a benchmark that is $\gamma$ times the optimal solution. The $\gamma$-regret emerges in structured bandit problems over a function class $\mathcal{F}$ where finding an exact optimum of $f \in \mathcal{F}$ is intractable. Our characterization is given in terms of the $\gamma$-DEC, a statistical complexity parameter for the class $\mathcal{F}$, which is a modification of the constrained Decision-Estimation Coefficient (DEC) of Foster et al., 2023 (and closely related to the original offset DEC of Foster et al., 2021). Our lower bound shows that the $\gamma$-DEC is a fundamental limit for any model class $\mathcal{F}$: for any algorithm, there exists some $f \in \mathcal{F}$ for which the $\gamma$-regret of that algorithm scales (nearly) with the $\gamma$-DEC of $\mathcal{F}$. We provide an upper bound showing that there exists an algorithm attaining a nearly matching $\gamma$-regret. Due to significant challenges in applying the prior results on the DEC to the $\gamma$-regret case, both our lower and upper bounds require novel techniques and a new algorithm.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Sample Complexity of Multiclass and Sparse Contextual Bandits
cs.LG 2026-05 unverdicted novelty 8.0

Algorithms and matching lower bounds for s-sparse contextual bandits yield Õ((s/ε² + |A|/ε) log |Π|/δ) samples to output an ε-optimal policy.