pith. machine review for the scientific record. sign in

arxiv: 1802.09127 · v1 · submitted 2018-02-26 · 📊 stat.ML · cs.LG

Recognition: unknown

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

Authors on Pith no claims yet
classification 📊 stat.ML cs.LG
keywords samplingthompsonapproximatebayesianposteriordeeplearningmethods
0
0 comments X
read the original abstract

Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to posterior samples of the model. At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical. Thus, it is attractive to consider approximate Bayesian neural networks in a Thompson Sampling framework. To understand the impact of using an approximate posterior on Thompson Sampling, we benchmark well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems. We found that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario. In particular, we highlight the challenge of adapting slowly converging uncertainty estimates to the online setting.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives ...

  2. Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking

    cs.RO 2026-04 unverdicted novelty 7.0

    A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.