pith. sign in

arxiv: 1202.2277 · v2 · pith:6NU55XKTnew · submitted 2012-02-10 · 🧮 math.ST · math.PR· stat.TH

Finite-time Regret Bound of a Bandit Algorithm for the Semi-bounded Support Model

classification 🧮 math.ST math.PRstat.TH
keywords boundregretasymptoticbanditfinitefinite-timeformmodel
0
0 comments X
read the original abstract

In this paper we consider stochastic multiarmed bandit problems. Recently a policy, DMED, is proposed and proved to achieve the asymptotic bound for the model that each reward distribution is supported in a known bounded interval, e.g. [0,1]. However, the derived regret bound is described in an asymptotic form and the performance in finite time has been unknown. We inspect this policy and derive a finite-time regret bound by refining large deviation probabilities to a simple finite form. Further, this observation reveals that the assumption on the lower-boundedness of the support is not essential and can be replaced with a weaker one, the existence of the moment generating function.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.