Learning over Forward-Invariant Policy Classes: Reinforcement Learning without Safety Concerns

Chieh Tsai; Hossein Rastgoftar; Muhammad Junayed Hasan Zahed; Salim Hariri

arxiv: 2604.07875 · v1 · submitted 2026-04-09 · 📡 eess.SY · cs.SY

Learning over Forward-Invariant Policy Classes: Reinforcement Learning without Safety Concerns

Chieh Tsai , Muhammad Junayed Hasan Zahed , Salim Hariri , Hossein Rastgoftar This is my paper

Pith reviewed 2026-05-10 17:26 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords safe reinforcement learningforward invarianceaction space designpolicy classquadcopter controlnonlinear systems

0 comments

The pith

Reinforcement learning can optimize performance by restricting to actions that preserve forward invariance of a safe set by construction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a safe RL framework that embeds safety into the action representation rather than using penalties or runtime shields. It constructs a finite admissible action set where each discrete action is a stabilizing feedback law that maintains forward invariance of a prescribed safe state set. Consequently the RL agent optimizes policies only over this safe-by-construction class. Simulation on a quadcopter hover task under disturbance shows improved performance and switching efficiency while every evaluated policy remains safety-preserving. The formulation decouples safety assurance from performance optimization.

Core claim

By casting the control problem as an MDP whose action space consists of a finite set of forward-invariance-preserving feedback laws, the resulting policy class is guaranteed to be safe without additional interventions, as validated on the quadcopter hover-regulation problem where all learned policies preserved the safe set.

What carries the argument

The finite admissible action set in which each discrete action corresponds to a stabilizing feedback law that preserves forward invariance of the safe state set.

If this is right

Every policy the RL agent considers or evaluates remains safety-preserving by construction.
Safety assurance is separated from the performance optimization step in the learning loop.
Closed-loop performance and switching efficiency improve on the quadcopter hover-regulation task under disturbance while safety is maintained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same action-space construction could be applied to other nonlinear systems for which forward-invariant sets and stabilizing laws can be identified.
Real-world deployment risk may decrease because unsafe exploration is structurally prevented rather than corrected after the fact.
Extensions could examine how to generate the admissible action set automatically when analytic stabilizing laws are unavailable.

Load-bearing premise

A finite admissible action set can be constructed in which each discrete action corresponds to a stabilizing feedback law that preserves forward invariance of a prescribed safe state set.

What would settle it

An experiment showing that a policy selected from the admissible action set violates the boundaries of the safe state set under the modeled dynamics.

Figures

Figures reproduced from arXiv: 2604.07875 by Chieh Tsai, Hossein Rastgoftar, Muhammad Junayed Hasan Zahed, Salim Hariri.

**Figure 2.** Figure 2: Representative rollout of the translational gains selected by the trained [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Physical evaluation of inertial position tracking. The quadcopter [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Physical evaluation of the Euler angles (ϕ, θ, ψ). Attitude excursions remain bounded during the maneuver and decay toward small values as the vehicle approaches steady hover [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Physical evaluation of the control inputs. The thrust second-derivative ¨ [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Physical evaluation of the per-step reward. The reward improves [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

This paper proposes a safe reinforcement learning (RL) framework based on forward-invariance-induced action-space design. The control problem is cast as a Markov decision process, but instead of relying on runtime shielding or penalty-based constraints, safety is embedded directly into the action representation. Specifically, we construct a finite admissible action set in which each discrete action corresponds to a stabilizing feedback law that preserves forward invariance of a prescribed safe state set. Consequently, the RL agent optimizes policies over a safe-by-construction policy class. We validate the framework on a quadcopter hover-regulation problem under disturbance. Simulation results show that the learned policy improves closed-loop performance and switching efficiency, while all evaluated policies remain safety-preserving. The proposed formulation decouples safety assurance from performance optimization and provides a promising foundation for safe learning in nonlinear systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper embeds safety into RL by restricting the action space to a finite set of forward-invariant stabilizing controllers, which works cleanly in their quadcopter example.

read the letter

The core move here is to define the RL action space as a small collection of feedback laws, each one already guaranteed to keep the state inside a prescribed safe set via forward invariance. Once that set is fixed, the agent just optimizes performance over it and never needs runtime shielding or penalties. That separation is the real contribution, and it shows up clearly in the quadcopter hover task under disturbance: the learned policies improve tracking and switching efficiency while every tested trajectory stays safe by construction. The simulations are straightforward and the results line up with the claim. The main limitation is that the abstract gives almost no information on how the admissible action set is actually built or how the invariance proofs are carried out. It is not obvious whether the construction is systematic for general nonlinear systems or whether it stays hand-tuned for the quadcopter. If the full paper only shows the one example without broader theorems or more cases, the generality claim will need work. This is worth sending to referees who know both control theory and RL. They can check the invariance arguments and see how far the method extends beyond the reported simulation. A reader already working on safe learning will find the idea useful even if the current evidence is still narrow.

Referee Report

2 major / 1 minor

Summary. The paper proposes a safe RL framework that embeds safety by construction into a finite discrete action space, where each action is a stabilizing feedback law preserving forward invariance of a prescribed safe set. The RL agent optimizes performance over this safe policy class without runtime shielding or penalties. The approach is demonstrated via simulation on a quadcopter hover-regulation task under disturbance, with reported improvements in closed-loop performance and switching efficiency while all evaluated policies remain safety-preserving.

Significance. If the action-set construction and invariance proofs hold, the framework would meaningfully advance safe RL by decoupling safety assurance from learning, enabling standard RL algorithms on nonlinear systems like quadcopters without additional interventions. This could reduce the complexity of safe learning in control applications, provided the method generalizes beyond the reported example.

major comments (2)

[Method and Validation] The central claim that safety is guaranteed by construction depends on the existence of a finite admissible action set in which each discrete action is a stabilizing feedback law preserving forward invariance. The manuscript provides no explicit construction of this set for the quadcopter, nor any proof or verification of the invariance property under the considered disturbances (see abstract and validation description).
[Results] The simulation results claim that 'all evaluated policies remain safety-preserving,' but without the specific feedback laws, the admissible set definition, or invariance analysis in the quadcopter example, it is impossible to assess whether the safety guarantee actually holds or if the reported outcomes are merely empirical.

minor comments (1)

[Abstract] The abstract refers to 'switching efficiency' as an improved metric but does not define or report how it is quantified or measured in the simulations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The comments highlight important aspects of clarity in the presentation of the safety guarantees. We address each major comment below and will revise the manuscript accordingly to strengthen the exposition without altering the core contributions.

read point-by-point responses

Referee: [Method and Validation] The central claim that safety is guaranteed by construction depends on the existence of a finite admissible action set in which each discrete action is a stabilizing feedback law preserving forward invariance. The manuscript provides no explicit construction of this set for the quadcopter, nor any proof or verification of the invariance property under the considered disturbances (see abstract and validation description).

Authors: We agree that the quadcopter example would benefit from greater specificity. Section 3 of the manuscript presents the general procedure for constructing the finite admissible action set: given a safe set defined via a control barrier function or Lyapunov level set, each action is chosen as a feedback law (e.g., linear or nonlinear) whose closed-loop vector field renders the safe set forward invariant. For the quadcopter hover task, the safe set is the set of states with position and velocity errors bounded by prescribed thresholds, and the discrete actions are stabilizing controllers synthesized around the hover equilibrium that keep trajectories inside this set for bounded disturbances. In the revised manuscript we will add an explicit subsection (or appendix) that (i) states the precise safe-set parameters, (ii) gives the feedback gains or control law expressions used for each action, and (iii) sketches the invariance verification (via a Lyapunov or barrier-function argument) under the disturbance model employed in the simulations. revision: yes
Referee: [Results] The simulation results claim that 'all evaluated policies remain safety-preserving,' but without the specific feedback laws, the admissible set definition, or invariance analysis in the quadcopter example, it is impossible to assess whether the safety guarantee actually holds or if the reported outcomes are merely empirical.

Authors: The safety claim rests on the theoretical property that every policy composed of actions from the admissible set inherits forward invariance; the simulation results are therefore not purely empirical. Nevertheless, we recognize that readers cannot verify this without the concrete construction. The revision will supply the missing details listed in the response to the first comment, allowing independent assessment that the reported trajectories remain inside the safe set precisely because each constituent action is invariance-preserving. No change to the numerical results themselves is required. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core construction defines a finite admissible action set by mapping each discrete action to a stabilizing feedback law that preserves forward invariance of a prescribed safe set, drawing directly from standard control-theoretic results on forward invariance rather than from the RL objective or any fitted parameters. The subsequent RL optimization simply selects among these pre-defined safe actions to improve performance; no equation or claim equates a learned prediction back to the safety construction by definition, and no load-bearing step reduces to a self-citation chain, ansatz smuggled via prior work, or renaming of an empirical pattern. The decoupling of safety (by action-space design) from performance (by RL) is therefore logically independent and externally grounded, consistent with the absence of any self-referential reduction in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the ability to pre-design safe actions using forward invariance principles from control theory.

axioms (1)

domain assumption Existence of a finite set of stabilizing feedback laws that preserve forward invariance of the safe set
The method requires constructing such laws for the given system.

pith-pipeline@v0.9.0 · 5443 in / 1163 out tokens · 102480 ms · 2026-05-10T17:26:39.939600+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Design and control of an indoor micro quadrotor,

S. Bouabdallah, P. Murrieri, and R. Siegwart, “Design and control of an indoor micro quadrotor,” inProceedings of the 2004 IEEE International Conference on Robotics and Automation (ICRA), vol. 5, 2004, pp. 4393– 4398

work page 2004
[2]

PID vs LQ control techniques applied to an indoor micro quadrotor,

S. Bouabdallah, A. Noth, and R. Siegwart, “PID vs LQ control techniques applied to an indoor micro quadrotor,” in2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2004, pp. 2451–2456

work page 2004
[3]

Real-time stabilization and tracking of a four-rotor mini rotorcraft,

P. Castillo, A. Dzul, and R. Lozano, “Real-time stabilization and tracking of a four-rotor mini rotorcraft,”IEEE Transactions on Control Systems Technology, vol. 12, no. 4, pp. 510–516, 2004

work page 2004
[4]

Backstepping control for a quadrotor helicopter,

T. Madani and A. Benallegue, “Backstepping control for a quadrotor helicopter,” in2006 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2006, pp. 3255–3260

work page 2006
[5]

Quadro- tor helicopter flight dynamics and control: Theory and experiment,

G. M. Hoffmann, H. Huang, S. L. Waslander, and C. J. Tomlin, “Quadro- tor helicopter flight dynamics and control: Theory and experiment,” in AIAA Guidance, Navigation and Control Conference and Exhibit, 2007, aIAA Paper 2007-6461

work page 2007
[6]

Full control of a quadrotor,

S. Bouabdallah and R. Siegwart, “Full control of a quadrotor,” in2007 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2007, pp. 153–158

work page 2007
[7]

Control and navigation framework for quadrotor helicopters,

A. Nagaty, S. Saeedi, C. Thibault, M. L. Seto, and H. Li, “Control and navigation framework for quadrotor helicopters,”Journal of Intelligent & Robotic Systems, vol. 70, no. 1–4, pp. 1–12, 2013

work page 2013
[8]

Geometric tracking control of a quadrotor UA V on SE(3),

T. Lee, M. Leok, and N. H. McClamroch, “Geometric tracking control of a quadrotor UA V on SE(3),” in49th IEEE Conference on Decision and Control (CDC), 2010, pp. 5420–5425

work page 2010
[9]

Minimum snap trajectory generation and control for quadrotors,

D. Mellinger and V . Kumar, “Minimum snap trajectory generation and control for quadrotors,” in2011 IEEE International Conference on Robotics and Automation (ICRA), 2011, pp. 2520–2525

work page 2011
[10]

Differential flatness of quadrotor dynamics subject to rotor drag for accurate tracking of high- speed trajectories,

M. Faessler, A. Franchi, and D. Scaramuzza, “Differential flatness of quadrotor dynamics subject to rotor drag for accurate tracking of high- speed trajectories,”IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 620–626, 2018

work page 2018
[11]

Gain scheduling based PID controller for fault tolerant control of quad-rotor UA V,

A. Milhim, Y . Zhang, and C. Rabbath, “Gain scheduling based PID controller for fault tolerant control of quad-rotor UA V,” inAIAA In- fotech@Aerospace Conference, Atlanta, GA, USA, 2010

work page 2010
[12]

Fault-tolerant fuzzy gain-scheduled PID for a quadrotor helicopter testbed in the presence of actuator faults,

M. H. Amoozgar, A. Chamseddine, and Y . Zhang, “Fault-tolerant fuzzy gain-scheduled PID for a quadrotor helicopter testbed in the presence of actuator faults,”IFAC Proceedings Volumes, vol. 45, no. 3, pp. 282–287, 2012. 8

work page 2012
[13]

Set invariance in control,

F. Blanchini, “Set invariance in control,”Automatica, vol. 35, no. 11, pp. 1747–1767, 1999

work page 1999
[14]

A framework for worst- case and stochastic safety verification using barrier certificates,

S. Prajna, A. Jadbabaie, and G. J. Pappas, “A framework for worst- case and stochastic safety verification using barrier certificates,”IEEE Transactions on Automatic Control, vol. 52, no. 8, pp. 1415–1428, 2007

work page 2007
[15]

Control barrier function based quadratic programs for safety-critical systems,

A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety-critical systems,”IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2017

work page 2017
[16]

Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes,

F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes,” in2016 IEEE 55th Conference on Decision and Control (CDC), 2016, pp. 4661–4666

work page 2016
[17]

Safe model-based reinforcement learning with stability guarantees,

F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” inAd- vances in Neural Information Processing Systems 30 (NeurIPS), 2017, pp. 909–919

work page 2017
[18]

R. E. Bellman,Dynamic Programming. Princeton, NJ: Princeton University Press, 1957

work page 1957
[19]

M. L. Puterman,Markov Decision Processes: Discrete Stochastic Dy- namic Programming. New York, NY: John Wiley & Sons, 1994

work page 1994
[20]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA: MIT Press, 2018

work page 2018
[21]

Constrained policy optimization,

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inProceedings of the 34th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 70, 2017, pp. 22–31

work page 2017
[22]

A lyapunov-based approach to safe reinforcement learning,

Y . Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” inAdvances in Neural Information Processing Systems 31 (NeurIPS), 2018

work page 2018
[23]

Reinforcement learning for UA V attitude control,

W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforcement learning for UA V attitude control,”ACM Transactions on Cyber-Physical Systems, vol. 3, no. 2, pp. 22:1–22:21, 2019

work page 2019
[24]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015

work page 2015
[25]

End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,

R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, 2019, pp. 3387–3395

work page 2019

[1] [1]

Design and control of an indoor micro quadrotor,

S. Bouabdallah, P. Murrieri, and R. Siegwart, “Design and control of an indoor micro quadrotor,” inProceedings of the 2004 IEEE International Conference on Robotics and Automation (ICRA), vol. 5, 2004, pp. 4393– 4398

work page 2004

[2] [2]

PID vs LQ control techniques applied to an indoor micro quadrotor,

S. Bouabdallah, A. Noth, and R. Siegwart, “PID vs LQ control techniques applied to an indoor micro quadrotor,” in2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2004, pp. 2451–2456

work page 2004

[3] [3]

Real-time stabilization and tracking of a four-rotor mini rotorcraft,

P. Castillo, A. Dzul, and R. Lozano, “Real-time stabilization and tracking of a four-rotor mini rotorcraft,”IEEE Transactions on Control Systems Technology, vol. 12, no. 4, pp. 510–516, 2004

work page 2004

[4] [4]

Backstepping control for a quadrotor helicopter,

T. Madani and A. Benallegue, “Backstepping control for a quadrotor helicopter,” in2006 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2006, pp. 3255–3260

work page 2006

[5] [5]

Quadro- tor helicopter flight dynamics and control: Theory and experiment,

G. M. Hoffmann, H. Huang, S. L. Waslander, and C. J. Tomlin, “Quadro- tor helicopter flight dynamics and control: Theory and experiment,” in AIAA Guidance, Navigation and Control Conference and Exhibit, 2007, aIAA Paper 2007-6461

work page 2007

[6] [6]

Full control of a quadrotor,

S. Bouabdallah and R. Siegwart, “Full control of a quadrotor,” in2007 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2007, pp. 153–158

work page 2007

[7] [7]

Control and navigation framework for quadrotor helicopters,

A. Nagaty, S. Saeedi, C. Thibault, M. L. Seto, and H. Li, “Control and navigation framework for quadrotor helicopters,”Journal of Intelligent & Robotic Systems, vol. 70, no. 1–4, pp. 1–12, 2013

work page 2013

[8] [8]

Geometric tracking control of a quadrotor UA V on SE(3),

T. Lee, M. Leok, and N. H. McClamroch, “Geometric tracking control of a quadrotor UA V on SE(3),” in49th IEEE Conference on Decision and Control (CDC), 2010, pp. 5420–5425

work page 2010

[9] [9]

Minimum snap trajectory generation and control for quadrotors,

D. Mellinger and V . Kumar, “Minimum snap trajectory generation and control for quadrotors,” in2011 IEEE International Conference on Robotics and Automation (ICRA), 2011, pp. 2520–2525

work page 2011

[10] [10]

Differential flatness of quadrotor dynamics subject to rotor drag for accurate tracking of high- speed trajectories,

M. Faessler, A. Franchi, and D. Scaramuzza, “Differential flatness of quadrotor dynamics subject to rotor drag for accurate tracking of high- speed trajectories,”IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 620–626, 2018

work page 2018

[11] [11]

Gain scheduling based PID controller for fault tolerant control of quad-rotor UA V,

A. Milhim, Y . Zhang, and C. Rabbath, “Gain scheduling based PID controller for fault tolerant control of quad-rotor UA V,” inAIAA In- fotech@Aerospace Conference, Atlanta, GA, USA, 2010

work page 2010

[12] [12]

Fault-tolerant fuzzy gain-scheduled PID for a quadrotor helicopter testbed in the presence of actuator faults,

M. H. Amoozgar, A. Chamseddine, and Y . Zhang, “Fault-tolerant fuzzy gain-scheduled PID for a quadrotor helicopter testbed in the presence of actuator faults,”IFAC Proceedings Volumes, vol. 45, no. 3, pp. 282–287, 2012. 8

work page 2012

[13] [13]

Set invariance in control,

F. Blanchini, “Set invariance in control,”Automatica, vol. 35, no. 11, pp. 1747–1767, 1999

work page 1999

[14] [14]

A framework for worst- case and stochastic safety verification using barrier certificates,

S. Prajna, A. Jadbabaie, and G. J. Pappas, “A framework for worst- case and stochastic safety verification using barrier certificates,”IEEE Transactions on Automatic Control, vol. 52, no. 8, pp. 1415–1428, 2007

work page 2007

[15] [15]

Control barrier function based quadratic programs for safety-critical systems,

A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety-critical systems,”IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2017

work page 2017

[16] [16]

Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes,

F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes,” in2016 IEEE 55th Conference on Decision and Control (CDC), 2016, pp. 4661–4666

work page 2016

[17] [17]

Safe model-based reinforcement learning with stability guarantees,

F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” inAd- vances in Neural Information Processing Systems 30 (NeurIPS), 2017, pp. 909–919

work page 2017

[18] [18]

R. E. Bellman,Dynamic Programming. Princeton, NJ: Princeton University Press, 1957

work page 1957

[19] [19]

M. L. Puterman,Markov Decision Processes: Discrete Stochastic Dy- namic Programming. New York, NY: John Wiley & Sons, 1994

work page 1994

[20] [20]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA: MIT Press, 2018

work page 2018

[21] [21]

Constrained policy optimization,

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inProceedings of the 34th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 70, 2017, pp. 22–31

work page 2017

[22] [22]

A lyapunov-based approach to safe reinforcement learning,

Y . Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” inAdvances in Neural Information Processing Systems 31 (NeurIPS), 2018

work page 2018

[23] [23]

Reinforcement learning for UA V attitude control,

W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforcement learning for UA V attitude control,”ACM Transactions on Cyber-Physical Systems, vol. 3, no. 2, pp. 22:1–22:21, 2019

work page 2019

[24] [24]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015

work page 2015

[25] [25]

End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,

R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, 2019, pp. 3387–3395

work page 2019