Safe Exploration in Continuous Action Spaces

Gal Dalal , Krishnamurthy Dvijotham , Matej Vecerik , Todd Hester , Cosmin Paduraru , Yuval Tassa

Authors on Pith no claims yet

classification 💻 cs.AI

keywords actionconstraintslearningneverpolicyactionsaddressagent

read the original abstract

We address the problem of deploying a reinforcement learning (RL) agent on a physical system such as a datacenter cooling unit or robot, where critical constraints must never be violated. We show how to exploit the typically smooth dynamics of these systems and enable RL algorithms to never violate constraints during learning. Our technique is to directly add to the policy a safety layer that analytically solves an action correction formulation per each state. The novelty of obtaining an elegant closed-form solution is attained due to a linearized model, learned on past trajectories consisting of arbitrary actions. This is to mimic the real-world circumstances where data logs were generated with a behavior policy that is implausible to describe mathematically; such cases render the known safety-aware off-policy methods inapplicable. We demonstrate the efficacy of our approach on new representative physics-based environments, and prevail where reward shaping fails by maintaining zero constraint violations.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
cs.RO 2026-04 unverdicted novelty 7.0

A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing
cs.LG 2026-05 unverdicted novelty 6.0

DD-SRad is a new RL constraint technique that adapts per-actuator radii dynamically to achieve zero violations and unconstrained-level task performance on heterogeneous robotic joints.
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
cs.RO 2026-04 unverdicted novelty 6.0

CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.