pith. machine review for the scientific record. sign in

arxiv: 1901.10031 · v2 · pith:KZ3SOVG2new · submitted 2019-01-28 · 💻 cs.LG · cs.AI· stat.ML

Lyapunov-based Safe Policy Optimization for Continuous Control

classification 💻 cs.LG cs.AIstat.ML
keywords policyalgorithmsoptimizationsafeactionagentconstrainedcontinuous
0
0 comments X
read the original abstract

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Self-Organizing Dual-Buffer Adaptive Clustering Experience Replay (SODACER) for Safe Reinforcement Learning in Optimal Control

    eess.SY 2026-01 unverdicted novelty 7.0

    SODACER uses fast and slow buffers with adaptive clustering for experience replay in safe RL, integrated with CBFs and Sophia optimizer to achieve faster convergence and safety on nonlinear systems like HPV transmission.

  2. Safe-Support Q-Learning: Learning without Unsafe Exploration

    cs.LG 2026-04 unverdicted novelty 5.0

    Safe-Support Q-Learning trains Q-functions and policies in reinforcement learning without ever visiting unsafe states by constraining the behavior policy to a safe set and using KL-regularized Bellman targets in a two...

  3. Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

    cs.AI 2026-04 unverdicted novelty 5.0

    PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.