Extends Thompson sampling analysis to Borel MDPs via a three-term regret decomposition and shows exponential convergence of residual regret to zero under extended assumptions.
Thompson Sampling is Asymptotically Optimal in General Environments
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
abstract
We discuss a variant of Thompson sampling for nonparametric reinforcement learning in a countable classes of general stochastic environments. These environments can be non-Markov, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges to the optimal value in mean and (2) given a recoverability assumption regret is sublinear.
fields
stat.ML 1years
2024 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Thompson Sampling for Infinite-Horizon Discounted Decision Processes
Extends Thompson sampling analysis to Borel MDPs via a three-term regret decomposition and shows exponential convergence of residual regret to zero under extended assumptions.