Maximum a Posteriori Policy Optimisation

Abbas Abdolmaleki , Jost Tobias Springenberg , Yuval Tassa , Remi Munos , Nicolas Heess , Martin Riedmiller

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.ITcs.ROmath.ITstat.ML

keywords existinglearningmaximummethodsoptimisationpolicyreinforcementachieving

read the original abstract

We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings while achieving similar or better final performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
cs.LG 2026-05 unverdicted novelty 7.0

Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Soft Actor-Critic Algorithms and Applications
cs.LG 2018-12 unverdicted novelty 7.0

SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
cs.LG 2026-05 unverdicted novelty 6.0

LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.
An adaptive variance estimator for relative sparsity
stat.ME 2026-05 unverdicted novelty 6.0

A new adaptive variance estimator for relative sparsity coefficients is introduced that fully utilizes the prior asymptotic normality theorem and incorporates variable selection effects.
Beyond Importance Sampling: Rejection-Gated Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
Behavior Regularized Offline Reinforcement Learning
cs.LG 2019-11 unverdicted novelty 6.0

Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies
cs.LG 2026-05 unverdicted novelty 5.0

Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.