pith. sign in

arxiv: 1706.01502 · v3 · pith:JL5OSIW4new · submitted 2017-06-05 · 💻 cs.LG · stat.ML

UCB Exploration via Q-Ensembles

classification 💻 cs.LG stat.ML
keywords explorationlearningsettingadaptalgorithmsataribanditbenchmark
0
0 comments X
read the original abstract

We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adaptive Ensemble Aggregation for Actor-Critics

    cs.LG 2025-07 unverdicted novelty 7.0

    AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.

  2. Distributional Off-Policy Evaluation with Deep Quantile Process Regression

    stat.ML 2026-04 unverdicted novelty 6.0

    DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.

  3. Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

    cs.LG 2026-02 unverdicted novelty 6.0

    PEPO uses pessimistic ensembling of DPO policies on data subsets to achieve single-policy concentrability sample bounds and avoid over-optimization in tabular settings.

  4. Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making

    cs.LG 2025-12 unverdicted novelty 6.0

    An adaptive RL-MPC framework uses RL to inform MPPI sampling and aggregates MPPI samples for value estimation, delivering up to 72% higher success rates and 2.1x faster convergence on tasks like race driving and Lunar...

  5. Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

    cs.LG 2026-02 unverdicted novelty 5.0

    PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution o...