pith. sign in

arxiv: 2010.03104 · v1 · pith:FG5JC3IAnew · submitted 2020-10-07 · 💻 cs.LG · math.ST· stat.ML· stat.TH

Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective

classification 💻 cs.LG math.STstat.MLstat.TH
keywords instance-dependentlearningbanditscomplexitycontextualreinforcementalgorithmsregret
0
0 comments X
read the original abstract

In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. Are similar guarantees possible for contextual bandits? While positive results are known for certain special cases, there is no general theory characterizing when and how instance-dependent regret bounds for contextual bandits can be achieved for rich, general classes of policies. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case. Finally, we provide structural results that tie together a number of complexity measures previously proposed throughout contextual bandits, reinforcement learning, and active learning and elucidate their role in determining the optimal instance-dependent regret. In a large-scale empirical evaluation, we find that our approach often gives superior results for challenging exploration problems. Turning our focus to reinforcement learning with function approximation, we develop new oracle-efficient algorithms for reinforcement learning with rich observations that obtain optimal gap-dependent sample complexity.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Sample Complexity of Multiclass and Sparse Contextual Bandits

    cs.LG 2026-05 unverdicted novelty 8.0

    Algorithms and matching lower bounds for s-sparse contextual bandits yield Õ((s/ε² + |A|/ε) log |Π|/δ) samples to output an ε-optimal policy.