RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
hub
Equivalence Between Policy Gradients and Soft Q-Learning
14 Pith papers cite this work. Polarity classification is still indexing.
abstract
Two of the leading approaches for model-free reinforcement learning are policy gradient methods and $Q$-learning methods. $Q$-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the $Q$-values they estimate are very inaccurate. A partial explanation may be that $Q$-learning methods are secretly implementing policy gradient updates: we show that there is a precise equivalence between $Q$-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized) $Q$-learning is exactly equivalent to a policy gradient method. We also point out a connection between $Q$-learning methods and natural policy gradient methods. Experimentally, we explore the entropy-regularized versions of $Q$-learning and policy gradients, and we find them to perform as well as (or slightly better than) the standard variants on the Atari benchmark. We also show that the equivalence holds in practical settings by constructing a $Q$-learning method that closely matches the learning dynamics of A3C without using a target network or $\epsilon$-greedy exploration schedule.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.
The Twisted-Path Particle Filter parameterizes twisting functions via neural networks and optimizes them against a path-measure KL divergence to improve continuous-time particle filtering.
Soft Q(λ) unifies an n-step formulation of soft Q-learning with a novel Soft Tree Backup operator into an online off-policy eligibility trace framework for learning entropy-regularized value functions.
D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.