ERO alternates updates between an agent policy maximizing cumulative reward and a replay policy selecting useful experiences, with experiments showing improved performance on continuous control tasks.
Self-improving reactive agents based on reinforcement learning, planning and teaching
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2019 2verdicts
UNVERDICTED 2representative citing papers
Adding a hindsight factor that integrates historic temporal differences into the Q-learning loss reduces overestimation and yields higher average scores than DQN, DDQN and dueling networks on ATARI games after 10 million frames.
citing papers explorer
-
Experience Replay Optimization
ERO alternates updates between an agent policy maximizing cumulative reward and a replay policy selecting useful experiences, with experiments showing improved performance on continuous control tasks.
-
In Hindsight: A Smooth Reward for Steady Exploration
Adding a hindsight factor that integrates historic temporal differences into the Q-learning loss reduces overestimation and yields higher average scores than DQN, DDQN and dueling networks on ATARI games after 10 million frames.