Safe exploration in markov decision processes

Teodor Mihai Moldovan, Pieter Abbeel · 2012 · cs.LG · arXiv 1205.4810

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

In environments with uncertain dynamics exploration is necessary to learn how to perform well. Existing reinforcement learning algorithms provide strong exploration guarantees, but they tend to rely on an ergodicity assumption. The essence of ergodicity is that any state is eventually reachable from any other state by following a suitable policy. This assumption allows for exploration algorithms that operate by simply favoring states that have rarely been visited before. For most physical systems this assumption is impractical as the systems would break before any reasonable exploration has taken place, i.e., most physical systems don't satisfy the ergodicity assumption. In this paper we address the need for safe exploration methods in Markov decision processes. We first propose a general formulation of safety through ergodicity. We show that imposing safety by restricting attention to the resulting set of guaranteed safe policies is NP-hard. We then present an efficient algorithm for guaranteed safe, but potentially suboptimal, exploration. At the core is an optimization formulation in which the constraints restrict attention to a subset of the guaranteed safe policies and the objective favors exploration policies. Our framework is compatible with the majority of previously proposed exploration methods, which rely on an exploration bonus. Our experiments, which include a Martian terrain exploration problem, show that our method is able to explore better than classical exploration methods.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Learning the Arrow of Time

cs.LG · 2019-07-02 · unverdicted · novelty 7.0

Introduces a learned arrow of time in MDPs that aligns with the Jordan-Kinderlehrer-Otto notion for stochastic processes and enables practical RL utilities like reachability and side-effect detection.

Concrete Problems in AI Safety

cs.AI · 2016-06-21 · accept · novelty 7.0

The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

Sampling-Based Safe Reinforcement Learning

cs.LG · 2026-05-19 · conditional · novelty 6.0

SBSRL approximates worst-case safety optimization over uncertain dynamics via finite sampling, adds epistemic-uncertainty-constrained exploration, and supplies high-probability safety guarantees plus finite-time sample-complexity bounds for near-optimal policies.

Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

cs.AI · 2026-04-14 · unverdicted · novelty 5.0

PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.

citing papers explorer

Showing 4 of 4 citing papers.

Learning the Arrow of Time cs.LG · 2019-07-02 · unverdicted · none · ref 4 · internal anchor
Introduces a learned arrow of time in MDPs that aligns with the Jordan-Kinderlehrer-Otto notion for stochastic processes and enables practical RL utilities like reachability and side-effect detection.
Concrete Problems in AI Safety cs.AI · 2016-06-21 · accept · none · ref 105
The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
Sampling-Based Safe Reinforcement Learning cs.LG · 2026-05-19 · conditional · none · ref 27 · internal anchor
SBSRL approximates worst-case safety optimization over uncertain dynamics via finite sampling, adds epistemic-uncertainty-constrained exploration, and supplies high-probability safety guarantees plus finite-time sample-complexity bounds for near-optimal policies.
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production cs.AI · 2026-04-14 · unverdicted · none · ref 59
PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.

Safe exploration in markov decision processes

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer