Efficient Counterfactual Learning from Bandit Feedback

Kohei Yata; Shota Yasui; Yusuke Narita

arxiv: 1809.03084 · v3 · pith:PRMHRQDJnew · submitted 2018-09-10 · 💻 cs.LG · cs.AI· cs.IR· stat.ME· stat.ML

Efficient Counterfactual Learning from Bandit Feedback

Yusuke Narita , Shota Yasui , Kohei Yata This is my paper

classification 💻 cs.LG cs.AIcs.IRstat.MEstat.ML

keywords estimatorsbanditadvertisementcounterfactualdataefficientfeedbackimprove

0 comments

read the original abstract

What is the most statistically efficient way to do off-policy evaluation and optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-the-art benchmark.

This paper has not been read by Pith yet.

Efficient Counterfactual Learning from Bandit Feedback

discussion (0)