The Uncertainty Bellman Equation and Exploration

Brendan O'Donoghue; Ian Osband; Remi Munos; Volodymyr Mnih

arxiv: 1709.05380 · v4 · pith:7YZKHVF4new · submitted 2017-09-15 · 💻 cs.AI · cs.LG· math.OC· stat.ML

The Uncertainty Bellman Equation and Exploration

Brendan O'Donoghue , Ian Osband , Remi Munos , Volodymyr Mnih This is my paper

classification 💻 cs.AI cs.LGmath.OCstat.ML

keywords bellmanequationtime-stepsuncertaintyboundconnectsconsiderexpected

0 comments

read the original abstract

We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $\epsilon$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Optimistic Proximal Policy Optimization
cs.LG 2019-06 unverdicted novelty 4.0

OPPO augments PPO with optimistic policy evaluation driven by return uncertainty estimates and shows improved results over prior methods on a tabular sparse-reward task.