UCB Exploration via Q-Ensembles

John Schulman; Pieter Abbeel; Richard Y. Chen; Szymon Sidor

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1706.01502 v3 pith:JL5OSIW4 submitted 2017-06-05 cs.LG stat.ML

UCB Exploration via Q-Ensembles

Richard Y. Chen , Szymon Sidor , Pieter Abbeel , John Schulman This is my paper

classification cs.LG stat.ML

keywords explorationlearningsettingadaptalgorithmsataribanditbenchmark

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Ensemble Aggregation for Actor-Critics
cs.LG 2025-07 unverdicted novelty 7.0

AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.
Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning
cs.LG 2026-06 unverdicted novelty 6.0

A quantile-of-means ensemble method achieves minimax optimal variance-dependent regret bounds for finite-horizon MDPs without count-based uncertainty estimates.
Distributional Off-Policy Evaluation with Deep Quantile Process Regression
stat.ML 2026-04 unverdicted novelty 6.0

DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
cs.LG 2026-02 unverdicted novelty 6.0

PEPO uses pessimistic ensembling of DPO policies on data subsets to achieve single-policy concentrability sample bounds and avoid over-optimization in tabular settings.
Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making
cs.LG 2025-12 unverdicted novelty 6.0

An adaptive RL-MPC framework uses RL to inform MPPI sampling and aggregates MPPI samples for value estimation, delivering up to 72% higher success rates and 2.1x faster convergence on tasks like race driving and Lunar...
DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning
cs.RO 2026-06 unverdicted novelty 5.0

DF-ExpEnse improves sample efficiency in finetuning diffusion-based robotic policies by filtering diffusion-generated actions with critic ensembles and enabling fleet-level collaboration.
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
cs.LG 2026-02 unverdicted novelty 5.0

PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution o...