Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Alekh Agarwal; Miroslav Dudik; Yu-Xiang Wang

arxiv: 1612.01205 · v2 · pith:BLELLBMWnew · submitted 2016-12-04 · 📊 stat.ML · cs.LG

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Yu-Xiang Wang , Alekh Agarwal , Miroslav Dudik This is my paper

classification 📊 stat.ML cs.LG

keywords contextualmodelbanditsboundconsistentaccessagnosticdata

0 comments

read the original abstract

We study the off-policy evaluation problem---estimating the value of a target policy using data collected by another policy---under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation
cs.LG 2025-07 unverdicted novelty 5.0

Greedy linear models without exploration consistently achieve top-tier performance in over 90% of offline dataset evaluations for linear bandit recommenders, with hyperparameter tuning favoring minimal exploration and...