pith. machine review for the scientific record. sign in

arxiv: 1903.04193 · v2 · submitted 2019-03-11 · 💻 cs.LG · cs.AI

Recognition: unknown

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords criticsbdpioff-policyactoralgorithmsmodel-freeactor-criticbootstrapped
0
0 comments X
read the original abstract

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.