Recall Traces: Backtracking Models for Efficient Reinforcement Learning

Anirudh Goyal; Hugo Larochelle; Philemon Brakel; Sergey Levine; Soumye Singhal; Timothy Lillicrap; William Fedus; Yoshua Bengio

arxiv: 1804.00379 · v2 · pith:FT4KAS4Knew · submitted 2018-04-02 · 💻 cs.LG · stat.ML

Recall Traces: Backtracking Models for Efficient Reinforcement Learning

Anirudh Goyal , Philemon Brakel , William Fedus , Soumye Singhal , Timothy Lillicrap , Sergey Levine , Hugo Larochelle , Yoshua Bengio This is my paper

classification 💻 cs.LG stat.ML

keywords statehighbacktrackingmodelstatestracesvalueaction

0 comments

read the original abstract

In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and sample for which the (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning the Arrow of Time
cs.LG 2019-07 unverdicted novelty 7.0

Introduces a learned arrow of time in MDPs that aligns with the Jordan-Kinderlehrer-Otto notion for stochastic processes and enables practical RL utilities like reachability and side-effect detection.