Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

Jian Peng; Qiang Liu; Xi Chen; Yang Liu; Yuanyi Zhong; Yunan Luo

arxiv: 1905.13420 · v1 · pith:ZFDRAIU4new · submitted 2019-05-31 · 💻 cs.LG · stat.ML

Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

Yang Liu , Yunan Luo , Yuanyi Zhong , Xi Chen , Qiang Liu , Jian Peng This is my paper

classification 💻 cs.LG stat.ML

keywords episodiclearningrewardreinforcementtrajectoryalgorithmalgorithmsassignment

0 comments

read the original abstract

Recent advances in deep reinforcement learning algorithms have shown great potential and success for solving many challenging real-world problems, including Go game and robotic applications. Usually, these algorithms need a carefully designed reward function to guide training in each time step. However, in real world, it is non-trivial to design such a reward function, and the only signal available is usually obtained at the end of a trajectory, also known as the episodic reward or return. In this work, we introduce a new algorithm for temporal credit assignment, which learns to decompose the episodic return back to each time-step in the trajectory using deep neural networks. With this learned reward signal, the learning efficiency can be substantially improved for episodic reinforcement learning. In particular, we find that expressive language models such as the Transformer can be adopted for learning the importance and the dependency of states in the trajectory, therefore providing high-quality and interpretable learned reward signals. We have performed extensive experiments on a set of MuJoCo continuous locomotive control tasks with only episodic returns and demonstrated the effectiveness of our algorithm.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decision Transformer: Reinforcement Learning via Sequence Modeling
cs.LG 2021-06 accept novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.