Doubly Robust Policy Evaluation and Learning

Miroslav Dudik , John Langford , Lihong Li

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.ROstat.APstat.ML

keywords policydoublyrobustapproachevaluationpastrewardsactions

read the original abstract

We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Smooth Multi-Policy Causal Effect Estimation in Longitudinal Settings
cs.LG 2026-05 unverdicted novelty 7.0

PEQ-Net jointly estimates multiple longitudinal treatment policies via a shared policy encoder and kernel mean embeddings to constrain second-order bias after LTMLE correction.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
cs.LG 2026-05 unverdicted novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
cs.LG 2026-05 unverdicted novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 7.0

InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
cs.LG 2026-05 unverdicted novelty 6.0

Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
ALM-MTA:Front-Door Causal Multi-Touch Attribution Method for Creator-Ecosystem Optimization
cs.SI 2026-05 unverdicted novelty 6.0

ALM-MTA uses front-door causal inference with an adversarially trained mediator and contrastive learning to improve multi-touch attribution, reporting gains in DAU, creator activity, exposure efficiency, AUUC, and upl...
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
cs.SE 2026-05 unverdicted novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 6.0

InvEvolve uses LLMs and RL to generate certified inventory policies that outperform classical and deep learning methods on synthetic and real data while providing multi-period performance guarantees.
CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CoFi-PGMA derives a unified counterfactual policy gradient objective based on marginal contribution to correct filtered feedback for both routing and collaborative multi-agent LLM training.