pith. machine review for the scientific record. sign in

arxiv: 1806.06920 · v1 · submitted 2018-06-14 · 💻 cs.LG · cs.AI· cs.IT· cs.RO· math.IT· stat.ML

Recognition: unknown

Maximum a Posteriori Policy Optimisation

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIcs.ITcs.ROmath.ITstat.ML
keywords existinglearningmaximummethodsoptimisationpolicyreinforcementachieving
0
0 comments X
read the original abstract

We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings while achieving similar or better final performance.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

    cs.LG 2026-05 unverdicted novelty 7.0

    Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...

  2. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  3. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  4. Soft Actor-Critic Algorithms and Applications

    cs.LG 2018-12 unverdicted novelty 7.0

    SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.

  5. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.

  6. An adaptive variance estimator for relative sparsity

    stat.ME 2026-05 unverdicted novelty 6.0

    A new adaptive variance estimator for relative sparsity coefficients is introduced that fully utilizes the prior asymptotic normality theorem and incorporates variable selection effects.

  7. Beyond Importance Sampling: Rejection-Gated Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

  8. Behavior Regularized Offline Reinforcement Learning

    cs.LG 2019-11 unverdicted novelty 6.0

    Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.

  9. Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.