Recognition: unknown
UCB Exploration via Q-Ensembles
classification
💻 cs.LG
stat.ML
keywords
explorationlearningsettingadaptalgorithmsataribanditbenchmark
read the original abstract
We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Distributional Off-Policy Evaluation with Deep Quantile Process Regression
DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.