Thompson Sampling is Asymptotically Optimal in General Environments
classification
💻 cs.LG
cs.AIstat.ML
keywords
environmentssamplingthompsonasymptoticallygeneraloptimalvalueassumption
read the original abstract
We discuss a variant of Thompson sampling for nonparametric reinforcement learning in a countable classes of general stochastic environments. These environments can be non-Markov, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges to the optimal value in mean and (2) given a recoverability assumption regret is sublinear.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Thompson Sampling for Infinite-Horizon Discounted Decision Processes
Extends Thompson sampling analysis to Borel MDPs via a three-term regret decomposition and shows exponential convergence of residual regret to zero under extended assumptions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.