pith. sign in

arxiv: 1704.06431 · v1 · pith:T56PVSZJnew · submitted 2017-04-21 · 🧮 math.ST · stat.TH

Faster Rates for Policy Learning

classification 🧮 math.ST stat.TH
keywords regretpolicydecayeitherfasteroptimalvalueestimation
0
0 comments X
read the original abstract

This article improves the existing proven rates of regret decay in optimal policy estimation. We give a margin-free result showing that the regret decay for estimating a within-class optimal policy is second-order for empirical risk minimizers over Donsker classes, with regret decaying at a faster rate than the standard error of an efficient estimator of the value of an optimal policy. We also give a result from the classification literature that shows that faster regret decay is possible via plug-in estimation provided a margin condition holds. Four examples are considered. In these examples, the regret is expressed in terms of either the mean value or the median value; the number of possible actions is either two or finitely many; and the sampling scheme is either independent and identically distributed or sequential, where the latter represents a contextual bandit sampling scheme.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.