Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Shixiang Gu , Timothy Lillicrap , Zoubin Ghahramani , Richard E. Turner , Sergey Levine

Authors on Pith no claims yet

classification 💻 cs.LG

keywords methodsoff-policypolicyq-propgradientcontroldeepsample

read the original abstract

Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often requires large batches. TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation. We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo continuous control environments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Soft Actor-Critic Algorithms and Applications
cs.LG 2018-12 unverdicted novelty 7.0

SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
cs.LG 2018-01 accept novelty 7.0

Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
Input-Side Variance Suppression under Non-Normal Transient Amplification in Continuous-Control Reinforcement Learning
eess.SY 2026-04 unverdicted novelty 6.0

Non-normal transient amplification is an important contributor to closed-loop variance in RL, and input-side suppression can reduce downstream covariance without altering peak gain.
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
cs.LG 2026-04 unverdicted novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.