pith. sign in

arxiv: 1809.03084 · v3 · pith:PRMHRQDJnew · submitted 2018-09-10 · 💻 cs.LG · cs.AI· cs.IR· stat.ME· stat.ML

Efficient Counterfactual Learning from Bandit Feedback

classification 💻 cs.LG cs.AIcs.IRstat.MEstat.ML
keywords estimatorsbanditadvertisementcounterfactualdataefficientfeedbackimprove
0
0 comments X
read the original abstract

What is the most statistically efficient way to do off-policy evaluation and optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-the-art benchmark.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.