On Lower Bounds for Regret in Reinforcement Learning

Benjamin Van Roy; Ian Osband

arxiv: 1608.02732 · v1 · pith:6B3P3IF2new · submitted 2016-08-09 · 📊 stat.ML · cs.LG

On Lower Bounds for Regret in Reinforcement Learning

Ian Osband , Benjamin Van Roy This is my paper

classification 📊 stat.ML cs.LG

keywords lowerlearningreinforcementboundboundsregretbartlettclarify

0 comments

read the original abstract

This is a brief technical note to clarify the state of lower bounds on regret for reinforcement learning. In particular, this paper: - Reproduces a lower bound on regret for reinforcement learning, similar to the result of Theorem 5 in the journal UCRL2 paper (Jaksch et al 2010). - Clarifies that the proposed proof of Theorem 6 in the REGAL paper (Bartlett and Tewari 2009) does not hold using the standard techniques without further work. We suggest that this result should instead be considered a conjecture as it has no rigorous proof. - Suggests that the conjectured lower bound given by (Bartlett and Tewari 2009) is incorrect and, in fact, it is possible to improve the scaling of the upper bound to match the weaker lower bounds presented in this paper. We hope that this note serves to clarify existing results in the field of reinforcement learning and provides interesting motivation for future work.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Behavior-Consistent Deep Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

QED sets state-dependent temperature proportional to double-critic disagreement to bound pairwise KL divergence between Boltzmann policies, cutting cross-run divergence by two orders of magnitude on 18 continuous-cont...
Behavior-Consistent Deep Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.