pith. sign in

arxiv: 1709.05380 · v4 · pith:7YZKHVF4new · submitted 2017-09-15 · 💻 cs.AI · cs.LG· math.OC· stat.ML

The Uncertainty Bellman Equation and Exploration

classification 💻 cs.AI cs.LGmath.OCstat.ML
keywords bellmanequationtime-stepsuncertaintyboundconnectsconsiderexpected
0
0 comments X
read the original abstract

We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $\epsilon$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Optimistic Proximal Policy Optimization

    cs.LG 2019-06 unverdicted novelty 4.0

    OPPO augments PPO with optimistic policy evaluation driven by return uncertainty estimates and shows improved results over prior methods on a tabular sparse-reward task.