Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Arthur Guez; Jean-Baptiste Lespiau; Lars Buesing; Nicolas Heess; Sebastien Racaniere; Theophane Weber; Yori Zwols

arxiv: 1811.06272 · v1 · pith:GPLEA2RInew · submitted 2018-11-15 · 💻 cs.LG · stat.ML

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Lars Buesing , Theophane Weber , Yori Zwols , Sebastien Racaniere , Arthur Guez , Jean-Baptiste Lespiau , Nicolas Heess This is my paper

classification 💻 cs.LG stat.ML

keywords experiencedatapolicysearchalgorithmscf-gpscounterfactualevaluation

0 comments

read the original abstract

Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for model-based policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies in POMDPs from off-policy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual off-policy episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use of available logged data to de-bias model predictions. In contrast to off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a non-trivial grid-world task. Finally, we show that CF-GPS generalizes the previously proposed Guided Policy Search and that reparameterization-based algorithms such Stochastic Value Gradient can be interpreted as counterfactual methods.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Robust Counterfactual Inference in Markov Decision Processes
cs.AI 2025-02 unverdicted novelty 7.0

Non-parametric closed-form bounds on counterfactual MDP transitions across compatible causal models, supporting robust policy optimization under interval uncertainty.
Bayesian Inverse Transition Learning: Learning Dynamics From Near-Optimal Trajectories
cs.LG 2024-11 unverdicted novelty 6.0

A Bayesian method uses near-optimality constraints from expert trajectories to estimate transition dynamics in offline model-based reinforcement learning.
Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark
cs.LG 2026-05 unverdicted novelty 5.0

MTG-Causal-RL is a new benchmark for causal RL using Magic: The Gathering with an explicit SCM, five archetypes, and CGFA-PPO agent showing competitive win rates plus diagnostic metrics.