hub

A unified view of entropy-regularized Markov decision processes

A unified view of entropy-regularized markov decision processes , author= · 2017 · cs.LG · arXiv 1705.07798

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

open full Pith review browse 13 citing papers arXiv PDF

abstract

We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various regularization techniques on learning performance in a simple reinforcement learning setup.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Generative Modeling by Value-Driven Transport

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

A control-theoretic linear program yields value-driven transport policies for generative modeling with straight paths and simulation-free training.

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

Sharp Spectral Thresholds for Logit Fixed Points

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

For finite-dimensional affine logit systems the sharp dimension-free stability threshold is β‖ΠWΠ‖_{T→T}<2, extending the certified regime beyond classical conservative bounds.

Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

cs.LG · 2025-05-30 · unverdicted · novelty 7.0

Derives PAC-type upper bounds and matching lower bounds on sample complexity for value and policy learning under recursive entropic risk measures, with exponential dependence on |β|/(1-γ).

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

Planning in entropy-regularized Markov decision processes and games

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.

Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration

stat.ML · 2025-12-30 · unverdicted · novelty 6.0

Stationary reweighting of soft fitted Q-iteration yields finite-sample local linear convergence to the projected fixed point under approximate realizability and controlled weighting error, even without Bellman completeness.

Entropic Regularization of Markov Decision Processes

cs.LG · 2019-07-06 · unverdicted · novelty 6.0

Using alpha-divergences for entropic regularization in MDPs unifies actor-critic architectures via closed-form policy improvement and provides asymptotic analysis on standard RL problems.

A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.

A note on convergence of Wasserstein policy optimization

cs.LG · 2026-05-21 · unverdicted · novelty 4.0

The note claims linear convergence of WPO in entropy-regularized MDPs by combining mean-field gradient flow analysis with a local log-Sobolev inequality under a regularity assumption.

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

cs.LG · 2026-04-19

citing papers explorer

Showing 13 of 13 citing papers.

Generative Modeling by Value-Driven Transport cs.LG · 2026-05-21 · unverdicted · none · ref 38 · internal anchor
A control-theoretic linear program yields value-driven transport policies for generative modeling with straight paths and simulation-free training.
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation cs.LG · 2026-05-18 · unverdicted · none · ref 131 · internal anchor
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Sharp Spectral Thresholds for Logit Fixed Points cs.LG · 2026-05-15 · unverdicted · none · ref 12 · internal anchor
For finite-dimensional affine logit systems the sharp dimension-free stability threshold is β‖ΠWΠ‖_{T→T}<2, extending the certified regime beyond classical conservative bounds.
Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model cs.LG · 2025-05-30 · unverdicted · none · ref 38 · internal anchor
Derives PAC-type upper bounds and matching lower bounds on sample complexity for value and policy learning under recursive entropic risk measures, with exponential dependence on |β|/(1-γ).
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates cs.LG · 2026-05-10 · unverdicted · none · ref 49
TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 28
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Planning in entropy-regularized Markov decision processes and games cs.LG · 2026-04-21 · unverdicted · none · ref 20
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration stat.ML · 2025-12-30 · unverdicted · none · ref 11 · internal anchor
Stationary reweighting of soft fitted Q-iteration yields finite-sample local linear convergence to the projected fixed point under approximate realizability and controlled weighting error, even without Bellman completeness.
Entropic Regularization of Markov Decision Processes cs.LG · 2019-07-06 · unverdicted · none · ref 10 · internal anchor
Using alpha-divergences for entropic regularization in MDPs unifies actor-critic architectures via closed-form policy improvement and provides asymptotic analysis on standard RL problems.
A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets cs.LG · 2026-05-09 · unverdicted · none · ref 44
A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles cs.LG · 2026-05-08 · unverdicted · none · ref 110
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.
A note on convergence of Wasserstein policy optimization cs.LG · 2026-05-21 · unverdicted · none · ref 100 · internal anchor
The note claims linear convergence of WPO in entropy-regularized MDPs by combining mean-field gradient flow analysis with a local log-Sobolev inequality under a regularity assumption.
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models cs.LG · 2026-04-19 · unreviewed · ref 29

A unified view of entropy-regularized Markov decision processes

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer