pith. sign in

hub

A unified view of entropy-regularized Markov decision processes

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it
abstract

We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various regularization techniques on learning performance in a simple reinforcement learning setup.

hub tools

citation-role summary

method 1

citation-polarity summary

roles

method 1

polarities

use method 1

representative citing papers

Generative Modeling by Value-Driven Transport

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

A control-theoretic linear program yields value-driven transport policies for generative modeling with straight paths and simulation-free training.

Sharp Spectral Thresholds for Logit Fixed Points

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

For finite-dimensional affine logit systems the sharp dimension-free stability threshold is β‖ΠWΠ‖_{T→T}<2, extending the certified regime beyond classical conservative bounds.

Entropic Regularization of Markov Decision Processes

cs.LG · 2019-07-06 · unverdicted · novelty 6.0

Using alpha-divergences for entropic regularization in MDPs unifies actor-critic architectures via closed-form policy improvement and provides asymptotic analysis on standard RL problems.

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.

A note on convergence of Wasserstein policy optimization

cs.LG · 2026-05-21 · unverdicted · novelty 4.0

The note claims linear convergence of WPO in entropy-regularized MDPs by combining mean-field gradient flow analysis with a local log-Sobolev inequality under a regularity assumption.

citing papers explorer

Showing 13 of 13 citing papers.