Lyapunov-based Safe Policy Optimization for Continuous Control

Aleksandra Faust; Edgar Duenez-Guzman; Mohammad Ghavamzadeh; Ofir Nachum; Yinlam Chow

Lyapunov-based Safe Policy Optimization for Continuous Control

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1901.10031 v2 pith:KZ3SOVG2 submitted 2019-01-28 cs.LG cs.AIstat.ML

Lyapunov-based Safe Policy Optimization for Continuous Control

Yinlam Chow , Ofir Nachum , Aleksandra Faust , Edgar Duenez-Guzman , Mohammad Ghavamzadeh This is my paper

classification cs.LG cs.AIstat.ML

keywords policyalgorithmsoptimizationsafeactionagentconstrainedcontinuous

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Organizing Dual-Buffer Adaptive Clustering Experience Replay (SODACER) for Safe Reinforcement Learning in Optimal Control
eess.SY 2026-01 unverdicted novelty 7.0

SODACER uses fast and slow buffers with adaptive clustering for experience replay in safe RL, integrated with CBFs and Sophia optimizer to achieve faster convergence and safety on nonlinear systems like HPV transmission.
Robust Shielding for Safe Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

A sound and optimal shielding method for robust MDPs ensures LTL safety under worst-case transitions and combines with PAC sampling to produce minimally restrictive shields for learned models.
Iteratively Learning Muscle Memory for Legged Robots to Master Adaptive and High Precision Locomotion
cs.RO 2025-07 unverdicted novelty 6.0

Integrates iterative learning control with a torque library to enable high-precision adaptive locomotion on bipedal and quadrupedal robots, reducing tracking errors by up to 85% and achieving over 30x faster control rates.
Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning
eess.SY 2019-06 unverdicted novelty 6.0

Develops a learning-based MPC algorithm that uses confidence intervals on trajectories and terminal set constraints to guarantee safety throughout RL exploration and training.
Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control
cs.AI 2026-06 unverdicted novelty 5.0

Proposes hierarchical MARL framework enforcing safety via constraint manifold at low level with theoretical guarantees and stationary dynamics for stable training and generalization.
Safe-Support Q-Learning: Learning without Unsafe Exploration
cs.LG 2026-04 unverdicted novelty 5.0

Safe-Support Q-Learning trains Q-functions and policies in reinforcement learning without ever visiting unsafe states by constraining the behavior policy to a safe set and using KL-regularized Bellman targets in a two...
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production
cs.AI 2026-04 unverdicted novelty 5.0

PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.
A Review On Safe Reinforcement Learning Using Lyapunov and Barrier Functions
eess.SY 2025-08 unverdicted novelty 2.0

A literature review of safe RL using Lyapunov and barrier functions that identifies a shift to model-free methods since 2017, well-defined open problems per approach class, and high-dimensional scalability as the main...