Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

Byron Boots; J. Andrew Bagnell; Wen Sun

arxiv: 1805.11240 · v1 · pith:BFM3LT55new · submitted 2018-05-29 · 💻 cs.LG · stat.ML

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

Wen Sun , J. Andrew Bagnell , Byron Boots This is my paper

classification 💻 cs.LG stat.ML

keywords horizonoraclelearningplanningimitationreinforcementachievebaselines

0 comments

read the original abstract

In this paper, we propose to combine imitation and reinforcement learning via the idea of reward shaping using an oracle. We study the effectiveness of the near-optimal cost-to-go oracle on the planning horizon and demonstrate that the cost-to-go oracle shortens the learner's planning horizon as function of its accuracy: a globally optimal oracle can shorten the planning horizon to one, leading to a one-step greedy Markov Decision Process which is much easier to optimize, while an oracle that is far away from the optimality requires planning over a longer horizon to achieve near-optimal performance. Hence our new insight bridges the gap and interpolates between imitation learning and reinforcement learning. Motivated by the above mentioned insights, we propose Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximize the total reshaped reward over a finite planning horizon when the oracle is sub-optimal. We experimentally demonstrate that a gradient-based implementation of THOR can achieve superior performance compared to RL baselines and IL baselines even when the oracle is sub-optimal.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Leveraging Experience in Lazy Search
cs.RO 2019-07 unverdicted novelty 6.0

Uses imitation learning from oracles to train an edge-evaluation policy for lazy graph search, outperforming heuristics on 2D and 7D motion planning problems when test instances are similar to training.