Recognition: unknown
Doubly Robust Policy Evaluation and Learning
read the original abstract
We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice.
This paper has not been read by Pith yet.
Forward citations
Cited by 9 Pith papers
-
Smooth Multi-Policy Causal Effect Estimation in Longitudinal Settings
PEQ-Net jointly estimates multiple longitudinal treatment policies via a shared policy encoder and kernel mean embeddings to constrain second-order bias after LTMLE correction.
-
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
-
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
-
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
-
ALM-MTA:Front-Door Causal Multi-Touch Attribution Method for Creator-Ecosystem Optimization
ALM-MTA uses front-door causal inference with an adversarially trained mediator and contrastive learning to improve multi-touch attribution, reporting gains in DAU, creator activity, exposure efficiency, AUUC, and upl...
-
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve uses LLMs and RL to generate certified inventory policies that outperform classical and deep learning methods on synthetic and real data while providing multi-period performance guarantees.
-
CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs
CoFi-PGMA derives a unified counterfactual policy gradient objective based on marginal contribution to correct filtered feedback for both routing and collaborative multi-agent LLM training.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.