hub

Equivalence Between Policy Gradients and Soft Q-Learning

Schulman, J · 2017 · cs.LG · arXiv 1704.06440

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

open full Pith review browse 14 citing papers arXiv PDF

abstract

Two of the leading approaches for model-free reinforcement learning are policy gradient methods and $Q$-learning methods. $Q$-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the $Q$-values they estimate are very inaccurate. A partial explanation may be that $Q$-learning methods are secretly implementing policy gradient updates: we show that there is a precise equivalence between $Q$-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized) $Q$-learning is exactly equivalent to a policy gradient method. We also point out a connection between $Q$-learning methods and natural policy gradient methods. Experimentally, we explore the entropy-regularized versions of $Q$-learning and policy gradients, and we find them to perform as well as (or slightly better than) the standard variants on the Atari benchmark. We also show that the equivalence holds in practical settings by constructing a $Q$-learning method that closely matches the learning dynamics of A3C without using a target network or $\epsilon$-greedy exploration schedule.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

cs.LG · 2026-02-02 · unverdicted · novelty 7.0

Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

Planning in entropy-regularized Markov decision processes and games

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.

Soft Actor-Critic Algorithms and Applications

cs.LG · 2018-12-13 · unverdicted · novelty 7.0

SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

cs.LG · 2018-01-04 · accept · novelty 7.0

Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

cs.LG · 2025-10-09 · unverdicted · novelty 6.0

SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.

Guidance for twisted particle filter: a continuous-time perspective

stat.CO · 2024-09-04 · unverdicted · novelty 6.0

The Twisted-Path Particle Filter parameterizes twisting functions via neural networks and optimizes them against a path-measure KL divergence to improve continuous-time particle filtering.

Soft $Q(\lambda)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Soft Q(λ) unifies an n-step formulation of soft Q-learning with a novel Soft Tree Backup operator into an online off-policy eligibility trace framework for learning entropy-regularized value functions.

D2 Actor Critic: Diffusion Actor Meets Distributional Critic

cs.LG · 2025-10-03 · unverdicted · novelty 5.0

D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

cs.LG · 2026-04-16 · unverdicted · novelty 5.0

LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.

Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

cs.LG · 2025-12-17

citing papers explorer

Showing 14 of 14 citing papers.

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation cs.LG · 2026-05-18 · unverdicted · none · ref 30 · internal anchor
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum cs.LG · 2026-02-02 · unverdicted · none · ref 48 · internal anchor
Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 58
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Interpreting Reinforcement Learning Agents with Susceptibilities cs.LG · 2026-05-08 · unverdicted · none · ref 46
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
Planning in entropy-regularized Markov decision processes and games cs.LG · 2026-04-21 · unverdicted · none · ref 23
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
Soft Actor-Critic Algorithms and Applications cs.LG · 2018-12-13 · unverdicted · none · ref 11
SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor cs.LG · 2018-01-04 · accept · none · ref 26
Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training cs.LG · 2025-10-09 · unverdicted · none · ref 14 · internal anchor
SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.
Guidance for twisted particle filter: a continuous-time perspective stat.CO · 2024-09-04 · unverdicted · none · ref 56 · internal anchor
The Twisted-Path Particle Filter parameterizes twisting functions via neural networks and optimizes them against a path-measure KL divergence to improve continuous-time particle filtering.
Soft $Q(\lambda)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces cs.LG · 2026-04-15 · unverdicted · none · ref 3
Soft Q(λ) unifies an n-step formulation of soft Q-learning with a novel Soft Tree Backup operator into an online off-policy eligibility trace framework for learning entropy-regularized value functions.
D2 Actor Critic: Diffusion Actor Meets Distributional Critic cs.LG · 2025-10-03 · unverdicted · none · ref 29 · internal anchor
D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning cs.LG · 2026-04-16 · unverdicted · none · ref 22
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
Targeted Exploration via Unified Entropy Control for Reinforcement Learning cs.AI · 2026-04-16 · unverdicted · none · ref 11
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.
Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction cs.LG · 2025-12-17 · unreviewed · ref 38 · internal anchor

Equivalence Between Policy Gradients and Soft Q-Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer