Safe and Efficient Off-Policy Reinforcement Learning

R\'emi Munos , Tom Stepleton , Anna Harutyunyan , Marc G. Bellemare

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIstat.ML

keywords off-policylambdaalgorithmalgorithmsbehaviourcollectedcontrolderive

read the original abstract

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace($\lambda$), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q($\lambda$), which was an open problem since 1989. We illustrate the benefits of Retrace($\lambda$) on a standard suite of Atari 2600 games.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
cs.LG 2026-05 unverdicted novelty 7.0

Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Beyond Distribution Sharpening: The Importance of Task Rewards
cs.LG 2026-04 unverdicted novelty 5.0

Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.