pith. sign in

arxiv: 1806.06920 · v1 · pith:UAXZQNFHnew · submitted 2018-06-14 · 💻 cs.LG · cs.AI· cs.IT· cs.RO· math.IT· stat.ML

Maximum a Posteriori Policy Optimisation

classification 💻 cs.LG cs.AIcs.ITcs.ROmath.ITstat.ML
keywords existinglearningmaximummethodsoptimisationpolicyreinforcementachieving
0
0 comments X
read the original abstract

We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings while achieving similar or better final performance.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Not all uncertainty is alike: volatility, stochasticity, and exploration

    cs.AI 2026-05 unverdicted novelty 7.0

    Volatility promotes exploration and stochasticity suppresses it in Gaussian state-space bandits, shown by extending Gittins indices and deriving the CAUSE exploration bonus via control-as-inference.

  2. Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

    cs.LG 2026-05 unverdicted novelty 7.0

    RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

  3. Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

    cs.LG 2026-05 unverdicted novelty 7.0

    Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...

  4. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  5. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  6. Soft Actor-Critic Algorithms and Applications

    cs.LG 2018-12 unverdicted novelty 7.0

    SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.

  7. Dynamic Plasma Shape Control with Arbitrary Sensor Subsets

    cs.RO 2026-05 unverdicted novelty 6.0

    Reinforcement learning agent trained in DIII-D tokamak simulator achieves 2.01 cm mean shape error on held-out data, tracks dynamic targets, and remains functional under 30% random sensor dropout with direct transfer ...

  8. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.

  9. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.

  10. An adaptive variance estimator for relative sparsity

    stat.ME 2026-05 unverdicted novelty 6.0

    A new adaptive variance estimator for relative sparsity coefficients is introduced that fully utilizes the prior asymptotic normality theorem and incorporates variable selection effects.

  11. Beyond Importance Sampling: Rejection-Gated Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

  12. Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

    cs.LG 2025-09 unverdicted novelty 6.0

    A method trains discrete diffusion policies for combinatorial RL by matching to a PMD-regularized target distribution, reporting SOTA performance and sample efficiency on DNA generation, macro-action, and multi-agent ...

  13. Is Conditional Generative Modeling all you need for Decision-Making?

    cs.LG 2022-11 unverdicted novelty 6.0

    Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.

  14. Solving math word problems with process- and outcome-based feedback

    cs.LG 2022-11 unverdicted novelty 6.0

    On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.

  15. Behavior Regularized Offline Reinforcement Learning

    cs.LG 2019-11 unverdicted novelty 6.0

    Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.

  16. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

    cs.LG 2019-06 unverdicted novelty 6.0

    Develops Way Off-Policy batch RL algorithms with pre-trained model priors, KL-control, and dropout uncertainty estimates to learn implicit rewards from offline human dialog data, reporting live deployment gains over p...

  17. Disentangled Skill Embeddings for Reinforcement Learning

    cs.LG 2019-06 unverdicted novelty 6.0

    Disentangled Skill Embeddings (DSE) is a variational inference framework for multi-task RL using shared parameters and task-specific latent embeddings for generalization to unseen conditions and as skills in hierarchical RL.

  18. Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.

  19. D2 Actor Critic: Diffusion Actor Meets Distributional Critic

    cs.LG 2025-10 unverdicted novelty 5.0

    D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.

  20. Failure Modes of Maximum Entropy RLHF

    cs.LG 2025-09 unverdicted novelty 5.0

    Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.