Distributional Reinforcement Learning for Efficient Exploration

Borislav Mavrin; Hengshuai Yao; Kaiwen Wu; Linglong Kong; Shangtong Zhang; Yaoliang Yu

arxiv: 1905.06125 · v1 · pith:IBNWYLZPnew · submitted 2019-05-13 · 💻 cs.LG · cs.AI· stat.ML

Distributional Reinforcement Learning for Efficient Exploration

Borislav Mavrin , Shangtong Zhang , Hengshuai Yao , Linglong Kong , Kaiwen Wu , Yaoliang Yu This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords explorationgamesqr-dqnalgorithmdistributiondistributionalefficientintrinsic

0 comments

read the original abstract

In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties. We propose a novel and efficient exploration method for deep RL that has two components. The first is a decaying schedule to suppress the intrinsic uncertainty. The second is an exploration bonus calculated from the upper quantiles of the learned distribution. In Atari 2600 games, our method outperforms QR-DQN in 12 out of 14 hard games (achieving 483 \% average gain across 49 games in cumulative rewards over QR-DQN with a big win in Venture). We also compared our algorithm with QR-DQN in a challenging 3D driving simulator (CARLA). Results show that our algorithm achieves near-optimal safety rewards twice faster than QRDQN.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Training Language Models to Self-Correct via Reinforcement Learning
cs.LG 2024-09 unverdicted novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.