UCB Exploration via Q-Ensembles

Richard Y. Chen , Szymon Sidor , Pieter Abbeel , John Schulman

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords explorationlearningsettingadaptalgorithmsataribanditbenchmark

read the original abstract

We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distributional Off-Policy Evaluation with Deep Quantile Process Regression
stat.ML 2026-04 unverdicted novelty 6.0

DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.