Safe Approximate Dynamic Programming Via Kernelized Lipschitz Estimation

Ankush Chakrabarty; Devesh K. Jha; Gregery T. Buzzard; Kyriakos Vamvoudakis; Yebin Wang

arxiv: 1907.02151 · v1 · pith:CWFVY6BAnew · submitted 2019-07-03 · 📡 eess.SY · cs.LG· cs.SY· math.DS· math.OC

Safe Approximate Dynamic Programming Via Kernelized Lipschitz Estimation

Ankush Chakrabarty , Devesh K. Jha , Gregery T. Buzzard , Yebin Wang , Kyriakos Vamvoudakis This is my paper

Pith reviewed 2026-05-25 09:30 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SYmath.DSmath.OC

keywords approximate dynamic programmingkernelized Lipschitz estimationsemidefinite programmingsafe reinforcement learningexponential stabilityadmissible policiesuncertain systemsdiscrete-time dynamics

0 comments

The pith

Kernelized Lipschitz estimation and semidefinite programming compute admissible initial control policies for approximate dynamic programming that stabilize uncertain discrete-time systems with high probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to obtain safe initial policies for reinforcement learning in systems with unknown discrete-time dynamics. It combines kernelized Lipschitz estimation to bound system changes with semidefinite programming to find policies that are admissible with high probability. These policies support safe starting points for learning, enforce constraints, and ensure the closed-loop system reaches exponential stability. A reader would care because finding safe starting controllers is a key barrier in applying reinforcement learning to real uncertain systems.

Core claim

We employ kernelized Lipschitz estimation and semidefinite programming for computing admissible initial control policies with provably high probability. Such admissible controllers enable safe initialization and constraint enforcement while providing exponential stability of the equilibrium of the closed-loop system.

What carries the argument

kernelized Lipschitz estimation paired with semidefinite programming to generate admissible initial policies

If this is right

Admissible initial policies allow safe initialization of reinforcement learning algorithms.
Constraint enforcement is possible during the initial phase of learning.
The closed-loop system achieves exponential stability around the equilibrium.
Policies come with probabilistic guarantees of admissibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on benchmark control problems to measure the probability of success in practice.
It might extend to systems where the kernel choice affects the tightness of the bounds.
Similar ideas could apply to continuous-time systems if the estimation is adapted accordingly.

Load-bearing premise

The kernelized Lipschitz estimator gives bounds that are tight enough for the semidefinite program to find a policy stabilizing the actual unknown system.

What would settle it

Running the computed policy on the true system and observing that the state diverges or constraints are violated at a rate exceeding the claimed probability.

Figures

Figures reproduced from arXiv: 1907.02151 by Ankush Chakrabarty, Devesh K. Jha, Gregery T. Buzzard, Kyriakos Vamvoudakis, Yebin Wang.

**Figure 3.** Figure 3: Illustration of constrained-input policy-iteration based ADP with safe [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

We develop a method for obtaining safe initial policies for reinforcement learning via approximate dynamic programming (ADP) techniques for uncertain systems evolving with discrete-time dynamics. We employ kernelized Lipschitz estimation and semidefinite programming for computing admissible initial control policies with provably high probability. Such admissible controllers enable safe initialization and constraint enforcement while providing exponential stability of the equilibrium of the closed-loop system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines kernelized Lipschitz estimation with SDP to generate safe initial policies for ADP on uncertain discrete-time systems, but the exponential stability claim depends on the estimator bounds being tight enough for the true dynamics.

read the letter

The main point is a method for safe initial policies in ADP for uncertain discrete-time dynamics. They use kernelized Lipschitz estimation and semidefinite programming to compute admissible controllers that work with high probability, giving exponential stability and constraint enforcement. What stands out as new is the specific integration of these tools to produce initial policies that are safe to start ADP from. It addresses the practical problem of finding admissible starting points for learning-based control on systems with uncertainty. The abstract frames it as enabling safe initialization while providing the stability guarantee. The approach seems reasonable on the surface for targeting that gap. It builds on known techniques in a way that could be useful for practitioners who need to initialize their ADP algorithms safely. The potential issue is whether the Lipschitz bounds from the kernel method are tight enough. The SDP uses those bounds to certify stability and constraints, but if the true system deviates more than the bounds allow, the policy may not stabilize the actual plant even if it satisfies the estimated conditions. This is the load-bearing assumption, and without seeing the error analysis or examples, it's difficult to tell how robust it is. The stress-test note highlights this, and from the abstract it does seem like the key step that needs careful justification. This is aimed at control theorists and RL researchers working on safe learning for dynamical systems. A reader in that area could find the method worth examining if the proofs hold up under scrutiny. I would send it for peer review to get the details checked, as it engages honestly with a real problem in the field.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a method for computing safe initial control policies for approximate dynamic programming on uncertain discrete-time systems. It combines kernelized Lipschitz estimation of the dynamics with semidefinite programming to produce admissible controllers that, with high probability, enforce state and input constraints while guaranteeing exponential stability of the closed-loop equilibrium.

Significance. If the high-probability stability and feasibility claims hold with explicit, non-vacuous bounds, the approach would address a practical barrier in safe RL initialization by supplying certifiably admissible starting policies rather than relying on post-hoc verification. The combination of nonparametric estimation with SDP-based Lyapunov synthesis is a natural direction for uncertain control systems.

major comments (2)

[main stability theorem (likely §4 or §5)] The central claim (abstract and main stability theorem) requires that the kernelized Lipschitz estimator produces bounds tight enough that any SDP-feasible policy on the estimated model remains exponentially stabilizing and constraint-satisfying on the true unknown dynamics. The manuscript must show explicitly how the estimation error radius is absorbed into the Lyapunov decrease margin; otherwise the high-probability guarantee does not transfer.
[probability bound derivation] The probability statement is invoked directly for both stability and constraint satisfaction. The derivation should clarify whether the failure probability is over the kernel estimator alone or also accounts for the SDP solver returning a policy whose margin is insufficient once the true Lipschitz constant is substituted.

minor comments (2)

[preliminaries / method section] Notation for the kernelized estimator (e.g., the definition of the Lipschitz bound function) should be introduced earlier and used consistently when stating the SDP constraints.
[abstract] The abstract states 'provably high probability' without quantifying the probability or the sample complexity; a brief statement of the dependence on number of samples would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments help us strengthen the presentation of the stability and probability guarantees. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [main stability theorem (likely §4 or §5)] The central claim (abstract and main stability theorem) requires that the kernelized Lipschitz estimator produces bounds tight enough that any SDP-feasible policy on the estimated model remains exponentially stabilizing and constraint-satisfying on the true unknown dynamics. The manuscript must show explicitly how the estimation error radius is absorbed into the Lyapunov decrease margin; otherwise the high-probability guarantee does not transfer.

Authors: We agree that the absorption step should be stated more explicitly. In the proof of Theorem 5.1, the kernelized Lipschitz radius enters the robust LMI as an additive perturbation on the system matrices; this perturbation is subtracted from the nominal Lyapunov decrease margin before the SDP is solved, ensuring the closed-loop decrease remains negative for all realizations inside the high-probability ball. We will insert a dedicated remark immediately after the theorem statement that isolates this margin adjustment and references the relevant lines in the proof. revision: yes
Referee: [probability bound derivation] The probability statement is invoked directly for both stability and constraint satisfaction. The derivation should clarify whether the failure probability is over the kernel estimator alone or also accounts for the SDP solver returning a policy whose margin is insufficient once the true Lipschitz constant is substituted.

Authors: The probability bound applies exclusively to the kernelized Lipschitz estimator (via the concentration result in Lemma 3.2). The SDP is a deterministic feasibility problem once the estimated Lipschitz constant and uncertainty radius are fixed; any feasible solution of the SDP is guaranteed to satisfy the robust decrease and constraint conditions for all true dynamics inside that radius. Consequently, the overall failure probability equals the estimator failure probability. We will add one clarifying sentence to the statement of Theorem 5.1 and to the paragraph following Lemma 3.2. revision: yes

Circularity Check

0 steps flagged

No circularity: method constructs policies from independent kernel estimates and SDP

full rationale

The paper's chain starts from data-driven kernelized Lipschitz bounds on unknown discrete-time dynamics, feeds those bounds into an SDP that enforces a Lyapunov decrease and input constraints, and outputs a policy with high-probability stability guarantees on the true system. None of the target claims (exponential stability, constraint satisfaction) are presupposed in the estimator definition or recovered by fitting a parameter whose value is defined by the desired outcome. No load-bearing self-citation reduces the central guarantee to prior work by the same authors; the probabilistic certificate is derived from the SDP feasibility margin and concentration properties of the kernel estimator. The derivation is therefore self-contained and does not collapse to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on unstated modeling assumptions about the discrete-time dynamics and the validity of the kernel Lipschitz estimator that cannot be audited from the given text.

pith-pipeline@v0.9.0 · 5609 in / 1163 out tokens · 28413 ms · 2026-05-25T09:30:23.093512+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

Autonomy and machine intelligence in complex systems: A tutorial,

K. Vamvoudakis, P. Antsaklis, W. Dixon, J. Hespanha, F. Lewis, H. Modares, and B. Kiumarsi, “Autonomy and machine intelligence in complex systems: A tutorial,” in American Control Conference (ACC), 2015, July 2015, pp. 5062–5079

work page 2015
[2]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (2nd Edition). MIT press Cambridge, 2018, vol. 1

work page 2018
[3]

Vrabie, K

D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles , ser. IET control engineering series. Institution of Engineering and Technology, 2013

work page 2013
[4]

Mastering the game of go with deep neural networks and tree search,

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016. 13

work page 2016
[5]

Mastering the game of go without human knowledge,

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017

work page 2017
[6]

The optimistic exploration value function,

M. Gregor and J. Spalek, “The optimistic exploration value function,” in 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES), Sep. 2015, pp. 119–123

work page 2015
[7]

R-max-a general polynomial time algorithm for near-optimal reinforcement learning,

R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time algorithm for near-optimal reinforcement learning,” Journal of Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002

work page 2002
[8]

High conﬁdence policy improvement,

P. Thomas, G. Theocharous, and M. Ghavamzadeh, “High conﬁdence policy improvement,” in International Conference on Machine Learning, 2015, pp. 2380–2388

work page 2015
[9]

Data-driven anytime algo- rithms for motion planning with safety guarantees,

D. K. Jha, M. Zhu, Y . Wang, and A. Ray, “Data-driven anytime algo- rithms for motion planning with safety guarantees,” in 2016 American Control Conference (ACC). IEEE, 2016, pp. 5716–5721

work page 2016
[10]

End-to-end training of deep visuomotor policies,

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research , vol. 17, no. 1, pp. 1334–1373, 2016

work page 2016
[11]

Bench- marking deep reinforcement learning for continuous control,

Y . Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench- marking deep reinforcement learning for continuous control,” in Inter- national Conference on Machine Learning , 2016, pp. 1329–1338

work page 2016
[12]

Constrained policy optimization,

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 22–31

work page 2017
[13]

A lyapunov-based approach to safe reinforcement learning,

Y . Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” in Advances in Neural Information Processing Systems , 2018, pp. 8092–8101

work page 2018
[14]

A comprehensive survey on safe reinforce- ment learning,

J. Garcıa and F. Fern ´andez, “A comprehensive survey on safe reinforce- ment learning,” Journal of Machine Learning Research , vol. 16, no. 1, pp. 1437–1480, 2015

work page 2015
[15]

Online semi- parametric learning for inverse dynamics modeling,

D. Romeres, M. Zorzi, R. Camoriano, and A. Chiuso, “Online semi- parametric learning for inverse dynamics modeling,” in Proc. of the IEEE Conf. Dec. and Ctrl , 2016, pp. 2945–2950

work page 2016
[17]

Semiparametrical Gaussian Processes Learning of Forward Dynamical Models for Navigating in a Circular Maze

[Online]. Available: http://arxiv.org/abs/1809.04993

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian processes,

F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian processes,” Proc. of the IEEE Conf. Decision and Control , pp. 4661–4666, 2016

work page 2016
[19]

Zeilinger

L. Hewing and M. N. Zeilinger, “Cautious model predictive control using gaussian process regression,” CoRR, vol. abs/1705.10702, 2017. [Online]. Available: http://arxiv.org/abs/1705.10702

work page arXiv 2017
[20]

Optimal co-design of nonlinear control systems based on a modiﬁed policy iteration method,

Y . Jiang, Y . Wang, S. A. Bortoff, and Z.-P. Jiang, “Optimal co-design of nonlinear control systems based on a modiﬁed policy iteration method,” IEEE Transactions on Neural Networks and Learning Systems , vol. 26, no. 2, pp. 409–414, 2015

work page 2015
[21]

Data-driven control of nonlinear systems: An on-line direct approach,

M. Tanaskovic, L. Fagiano, C. Novara, and M. Morari, “Data-driven control of nonlinear systems: An on-line direct approach,” Automatica, vol. 75, pp. 1–10, 2017

work page 2017
[22]

Direct data-driven control of constrained systems,

D. Piga, S. Formentin, and A. Bemporad, “Direct data-driven control of constrained systems,” IEEE Transactions on Control Systems Technol- ogy, vol. 26, no. 4, pp. 1422–1429, 2018

work page 2018
[23]

Optimal and autonomous control using reinforcement learning: A survey,

B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE transactions on neural networks and learning systems , vol. 29, no. 6, pp. 2042–2062, 2018

work page 2042
[24]

Automatic crosswind ﬂight of tethered wings for airborne wind energy: a direct data-driven approach,

L. Fagiano and C. Novara, “Automatic crosswind ﬂight of tethered wings for airborne wind energy: a direct data-driven approach,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 4927–4932, 2014

work page 2014
[25]

Support vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences

A. Chakrabarty, V . C. Dinh, M. J. Corless, A. E. Rundell, S. H. Zak, G. T. Buzzard et al. , “Support vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences.” IEEE Trans. Automat. Contr., vol. 62, no. 1, pp. 135–148, 2017

work page 2017
[26]

Conservative decision-making and inference in uncer- tain dynamical systems,

J.-P. Calliess, “Conservative decision-making and inference in uncer- tain dynamical systems,” Ph.D. dissertation, PhD thesis, University of Oxford, 2014

work page 2014
[27]

Learning-based nonlinear model predictive control,

D. Limon, J. Calliess, and J. M. Maciejowski, “Learning-based nonlinear model predictive control,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 7769– 7776, 2017

work page 2017
[28]

State and unknown input observers for nonlinear systems with bounded exogenous inputs,

A. Chakrabarty, M. J. Corless, G. T. Buzzard, S. H. ˙Zak, and A. E. Rundell, “State and unknown input observers for nonlinear systems with bounded exogenous inputs,” IEEE Transactions on Automatic Control , vol. 62, no. 11, pp. 5497–5510, 2017

work page 2017
[29]

Observer-based output feedback control design for systems with incrementally conic nonlinearities,

X. Xu, B. Ac ¸ıkmes ¸e, M. Corless, and H. Sartipizadeh, “Observer-based output feedback control design for systems with incrementally conic nonlinearities,” in Proc. of the American Control Conference (ACC) , 2018, pp. 1364–1369

work page 2018
[30]

Estimation of the Lipschitz constant of a function,

G. Wood and B. Zhang, “Estimation of the Lipschitz constant of a function,” Journal of Global Optimization , vol. 8, no. 1, pp. 91–103, 1996

work page 1996
[31]

On the convergence of an algorithm for ﬁnding a global extremum,

R. G. Strongin, “On the convergence of an algorithm for ﬁnding a global extremum,” Engineering Cybernetics, vol. 11, pp. 549–555, 1973

work page 1973
[32]

On using estimates of Lipschitz constants in global optimization,

P. Hansen, B. Jaumard, and S.-H. Lu, “On using estimates of Lipschitz constants in global optimization,” Journal of Optimization Theory and Applications, vol. 75, no. 1, pp. 195–200, 1992

work page 1992
[33]

Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,

F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Systems, vol. 32, no. 6, pp. 76–105, 2012

work page 2012
[34]

Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online,

K. G. Vamvoudakis, H. Modares, B. Kiumarsi, and F. L. Lewis, “Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online,”IEEE Control Systems Magazine, vol. 37, no. 1, pp. 33–52, Feb 2017

work page 2017
[35]

Sparse bayesian learning and the relevance vector machine,

M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” Journal of machine learning research , vol. 1, no. Jun, pp. 211–244, 2001

work page 2001
[36]

Preference learning with gaussian pro- cesses,

W. Chu and Z. Ghahramani, “Preference learning with gaussian pro- cesses,” in Proceedings of the 22nd international conference on Machine learning. ACM, 2005, pp. 137–144

work page 2005
[37]

Lipschitz optimisation for Lipschitz interpolation,

J.-P. Calliess, “Lipschitz optimisation for Lipschitz interpolation,” in American Control Conference (ACC), 2017 , 2017, pp. 3141–3146

work page 2017
[38]

A plug-in approach to support estimation,

A. Cuevas and R. Fraiman, “A plug-in approach to support estimation,” The Annals of Statistics , vol. 25, no. 6, pp. 2300–2312, 1997

work page 1997
[39]

H. K. Khalil, Nonlinear control. Pearson New York, 2015

work page 2015
[40]

Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,

D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Trans. on Neural Networks and Learning Systems , vol. 25, no. 3, pp. 621–634, 2014

work page 2014
[41]

S. Boyd, L. El Ghaoui, E. Feron, and V . Balakrishnan, Linear matrix inequalities in system and control theory . SIAM, 1994, vol. 15

work page 1994
[42]

On the exponential stability of discrete-time systems with applications in observer design,

V . C. Aitken and H. M. Schwartz, “On the exponential stability of discrete-time systems with applications in observer design,” IEEE Trans- actions on Automatic Control , vol. 39, no. 9, pp. 1959–1962, 1994

work page 1959
[43]

Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,

M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005

work page 2005
[44]

Optimal control for discrete-time systems with actuator saturation,

Q. Lin, Q. Wei, and B. Zhao, “Optimal control for discrete-time systems with actuator saturation,” Optimal Control Applications and Methods , vol. 38, no. 6, pp. 1071–1080, 2017

work page 2017
[45]

Neural-network-based near-optimal control for a class of discrete-time afﬁne nonlinear systems with control constraints,

H. Zhang, Y . Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time afﬁne nonlinear systems with control constraints,” IEEE Transactions on Neural Networks , vol. 20, no. 9, pp. 1490–1503, 2009

work page 2009
[46]

Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,

H. Modares, F. L. Lewis, and M.-B. Naghibi-Sistani, “Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 10, pp. 1513–1525, 2013

work page 2013
[47]

Asymptoti- cally stable adaptiveoptimal control algorithm with saturating actuators and relaxed persistence of excitation,

K. G. Vamvoudakis, M. F. Miranda, and J. P. Hespanha, “Asymptoti- cally stable adaptiveoptimal control algorithm with saturating actuators and relaxed persistence of excitation,” IEEE Transactions on Neural Networks and Learning Systems , vol. 27, no. 11, pp. 2386–2398, Nov 2016

work page 2016
[48]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning , vol. 8, no. 3, pp. 279–292, May 1992

work page 1992
[49]

Asynchronous stochastic approximation and q- learning,

J. N. Tsitsiklis, “Asynchronous stochastic approximation and q- learning,” Machine Learning, vol. 16, no. 3, pp. 185–202, Sep 1994

work page 1994
[50]

Q-learning and Pontryagin’s minimum princi- ple,

P. Mehta and S. Meyn, “Q-learning and Pontryagin’s minimum princi- ple,” in Proc. of the 48th IEEE Conference on Decision and Control (CDC). IEEE, 2009, pp. 3598–3605

work page 2009
[51]

Q-learning for continuous-time linear systems: A model-free inﬁnite horizon optimal control approach,

K. G. Vamvoudakis, “Q-learning for continuous-time linear systems: A model-free inﬁnite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14 – 20, 2017

work page 2017

[1] [1]

Autonomy and machine intelligence in complex systems: A tutorial,

K. Vamvoudakis, P. Antsaklis, W. Dixon, J. Hespanha, F. Lewis, H. Modares, and B. Kiumarsi, “Autonomy and machine intelligence in complex systems: A tutorial,” in American Control Conference (ACC), 2015, July 2015, pp. 5062–5079

work page 2015

[2] [2]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (2nd Edition). MIT press Cambridge, 2018, vol. 1

work page 2018

[3] [3]

Vrabie, K

D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles , ser. IET control engineering series. Institution of Engineering and Technology, 2013

work page 2013

[4] [4]

Mastering the game of go with deep neural networks and tree search,

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016. 13

work page 2016

[5] [5]

Mastering the game of go without human knowledge,

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017

work page 2017

[6] [6]

The optimistic exploration value function,

M. Gregor and J. Spalek, “The optimistic exploration value function,” in 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES), Sep. 2015, pp. 119–123

work page 2015

[7] [7]

R-max-a general polynomial time algorithm for near-optimal reinforcement learning,

R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time algorithm for near-optimal reinforcement learning,” Journal of Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002

work page 2002

[8] [8]

High conﬁdence policy improvement,

P. Thomas, G. Theocharous, and M. Ghavamzadeh, “High conﬁdence policy improvement,” in International Conference on Machine Learning, 2015, pp. 2380–2388

work page 2015

[9] [9]

Data-driven anytime algo- rithms for motion planning with safety guarantees,

D. K. Jha, M. Zhu, Y . Wang, and A. Ray, “Data-driven anytime algo- rithms for motion planning with safety guarantees,” in 2016 American Control Conference (ACC). IEEE, 2016, pp. 5716–5721

work page 2016

[10] [10]

End-to-end training of deep visuomotor policies,

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research , vol. 17, no. 1, pp. 1334–1373, 2016

work page 2016

[11] [11]

Bench- marking deep reinforcement learning for continuous control,

Y . Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench- marking deep reinforcement learning for continuous control,” in Inter- national Conference on Machine Learning , 2016, pp. 1329–1338

work page 2016

[12] [12]

Constrained policy optimization,

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 22–31

work page 2017

[13] [13]

A lyapunov-based approach to safe reinforcement learning,

Y . Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” in Advances in Neural Information Processing Systems , 2018, pp. 8092–8101

work page 2018

[14] [14]

A comprehensive survey on safe reinforce- ment learning,

J. Garcıa and F. Fern ´andez, “A comprehensive survey on safe reinforce- ment learning,” Journal of Machine Learning Research , vol. 16, no. 1, pp. 1437–1480, 2015

work page 2015

[15] [15]

Online semi- parametric learning for inverse dynamics modeling,

D. Romeres, M. Zorzi, R. Camoriano, and A. Chiuso, “Online semi- parametric learning for inverse dynamics modeling,” in Proc. of the IEEE Conf. Dec. and Ctrl , 2016, pp. 2945–2950

work page 2016

[16] [17]

Semiparametrical Gaussian Processes Learning of Forward Dynamical Models for Navigating in a Circular Maze

[Online]. Available: http://arxiv.org/abs/1809.04993

work page internal anchor Pith review Pith/arXiv arXiv

[17] [18]

Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian processes,

F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian processes,” Proc. of the IEEE Conf. Decision and Control , pp. 4661–4666, 2016

work page 2016

[18] [19]

Zeilinger

L. Hewing and M. N. Zeilinger, “Cautious model predictive control using gaussian process regression,” CoRR, vol. abs/1705.10702, 2017. [Online]. Available: http://arxiv.org/abs/1705.10702

work page arXiv 2017

[19] [20]

Optimal co-design of nonlinear control systems based on a modiﬁed policy iteration method,

Y . Jiang, Y . Wang, S. A. Bortoff, and Z.-P. Jiang, “Optimal co-design of nonlinear control systems based on a modiﬁed policy iteration method,” IEEE Transactions on Neural Networks and Learning Systems , vol. 26, no. 2, pp. 409–414, 2015

work page 2015

[20] [21]

Data-driven control of nonlinear systems: An on-line direct approach,

M. Tanaskovic, L. Fagiano, C. Novara, and M. Morari, “Data-driven control of nonlinear systems: An on-line direct approach,” Automatica, vol. 75, pp. 1–10, 2017

work page 2017

[21] [22]

Direct data-driven control of constrained systems,

D. Piga, S. Formentin, and A. Bemporad, “Direct data-driven control of constrained systems,” IEEE Transactions on Control Systems Technol- ogy, vol. 26, no. 4, pp. 1422–1429, 2018

work page 2018

[22] [23]

Optimal and autonomous control using reinforcement learning: A survey,

B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE transactions on neural networks and learning systems , vol. 29, no. 6, pp. 2042–2062, 2018

work page 2042

[23] [24]

Automatic crosswind ﬂight of tethered wings for airborne wind energy: a direct data-driven approach,

L. Fagiano and C. Novara, “Automatic crosswind ﬂight of tethered wings for airborne wind energy: a direct data-driven approach,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 4927–4932, 2014

work page 2014

[24] [25]

Support vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences

A. Chakrabarty, V . C. Dinh, M. J. Corless, A. E. Rundell, S. H. Zak, G. T. Buzzard et al. , “Support vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences.” IEEE Trans. Automat. Contr., vol. 62, no. 1, pp. 135–148, 2017

work page 2017

[25] [26]

Conservative decision-making and inference in uncer- tain dynamical systems,

J.-P. Calliess, “Conservative decision-making and inference in uncer- tain dynamical systems,” Ph.D. dissertation, PhD thesis, University of Oxford, 2014

work page 2014

[26] [27]

Learning-based nonlinear model predictive control,

D. Limon, J. Calliess, and J. M. Maciejowski, “Learning-based nonlinear model predictive control,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 7769– 7776, 2017

work page 2017

[27] [28]

State and unknown input observers for nonlinear systems with bounded exogenous inputs,

A. Chakrabarty, M. J. Corless, G. T. Buzzard, S. H. ˙Zak, and A. E. Rundell, “State and unknown input observers for nonlinear systems with bounded exogenous inputs,” IEEE Transactions on Automatic Control , vol. 62, no. 11, pp. 5497–5510, 2017

work page 2017

[28] [29]

Observer-based output feedback control design for systems with incrementally conic nonlinearities,

X. Xu, B. Ac ¸ıkmes ¸e, M. Corless, and H. Sartipizadeh, “Observer-based output feedback control design for systems with incrementally conic nonlinearities,” in Proc. of the American Control Conference (ACC) , 2018, pp. 1364–1369

work page 2018

[29] [30]

Estimation of the Lipschitz constant of a function,

G. Wood and B. Zhang, “Estimation of the Lipschitz constant of a function,” Journal of Global Optimization , vol. 8, no. 1, pp. 91–103, 1996

work page 1996

[30] [31]

On the convergence of an algorithm for ﬁnding a global extremum,

R. G. Strongin, “On the convergence of an algorithm for ﬁnding a global extremum,” Engineering Cybernetics, vol. 11, pp. 549–555, 1973

work page 1973

[31] [32]

On using estimates of Lipschitz constants in global optimization,

P. Hansen, B. Jaumard, and S.-H. Lu, “On using estimates of Lipschitz constants in global optimization,” Journal of Optimization Theory and Applications, vol. 75, no. 1, pp. 195–200, 1992

work page 1992

[32] [33]

Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,

F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Systems, vol. 32, no. 6, pp. 76–105, 2012

work page 2012

[33] [34]

Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online,

K. G. Vamvoudakis, H. Modares, B. Kiumarsi, and F. L. Lewis, “Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online,”IEEE Control Systems Magazine, vol. 37, no. 1, pp. 33–52, Feb 2017

work page 2017

[34] [35]

Sparse bayesian learning and the relevance vector machine,

M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” Journal of machine learning research , vol. 1, no. Jun, pp. 211–244, 2001

work page 2001

[35] [36]

Preference learning with gaussian pro- cesses,

W. Chu and Z. Ghahramani, “Preference learning with gaussian pro- cesses,” in Proceedings of the 22nd international conference on Machine learning. ACM, 2005, pp. 137–144

work page 2005

[36] [37]

Lipschitz optimisation for Lipschitz interpolation,

J.-P. Calliess, “Lipschitz optimisation for Lipschitz interpolation,” in American Control Conference (ACC), 2017 , 2017, pp. 3141–3146

work page 2017

[37] [38]

A plug-in approach to support estimation,

A. Cuevas and R. Fraiman, “A plug-in approach to support estimation,” The Annals of Statistics , vol. 25, no. 6, pp. 2300–2312, 1997

work page 1997

[38] [39]

H. K. Khalil, Nonlinear control. Pearson New York, 2015

work page 2015

[39] [40]

Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,

D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Trans. on Neural Networks and Learning Systems , vol. 25, no. 3, pp. 621–634, 2014

work page 2014

[40] [41]

S. Boyd, L. El Ghaoui, E. Feron, and V . Balakrishnan, Linear matrix inequalities in system and control theory . SIAM, 1994, vol. 15

work page 1994

[41] [42]

On the exponential stability of discrete-time systems with applications in observer design,

V . C. Aitken and H. M. Schwartz, “On the exponential stability of discrete-time systems with applications in observer design,” IEEE Trans- actions on Automatic Control , vol. 39, no. 9, pp. 1959–1962, 1994

work page 1959

[42] [43]

Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,

M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005

work page 2005

[43] [44]

Optimal control for discrete-time systems with actuator saturation,

Q. Lin, Q. Wei, and B. Zhao, “Optimal control for discrete-time systems with actuator saturation,” Optimal Control Applications and Methods , vol. 38, no. 6, pp. 1071–1080, 2017

work page 2017

[44] [45]

Neural-network-based near-optimal control for a class of discrete-time afﬁne nonlinear systems with control constraints,

H. Zhang, Y . Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time afﬁne nonlinear systems with control constraints,” IEEE Transactions on Neural Networks , vol. 20, no. 9, pp. 1490–1503, 2009

work page 2009

[45] [46]

Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,

H. Modares, F. L. Lewis, and M.-B. Naghibi-Sistani, “Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 10, pp. 1513–1525, 2013

work page 2013

[46] [47]

Asymptoti- cally stable adaptiveoptimal control algorithm with saturating actuators and relaxed persistence of excitation,

K. G. Vamvoudakis, M. F. Miranda, and J. P. Hespanha, “Asymptoti- cally stable adaptiveoptimal control algorithm with saturating actuators and relaxed persistence of excitation,” IEEE Transactions on Neural Networks and Learning Systems , vol. 27, no. 11, pp. 2386–2398, Nov 2016

work page 2016

[47] [48]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning , vol. 8, no. 3, pp. 279–292, May 1992

work page 1992

[48] [49]

Asynchronous stochastic approximation and q- learning,

J. N. Tsitsiklis, “Asynchronous stochastic approximation and q- learning,” Machine Learning, vol. 16, no. 3, pp. 185–202, Sep 1994

work page 1994

[49] [50]

Q-learning and Pontryagin’s minimum princi- ple,

P. Mehta and S. Meyn, “Q-learning and Pontryagin’s minimum princi- ple,” in Proc. of the 48th IEEE Conference on Decision and Control (CDC). IEEE, 2009, pp. 3598–3605

work page 2009

[50] [51]

Q-learning for continuous-time linear systems: A model-free inﬁnite horizon optimal control approach,

K. G. Vamvoudakis, “Q-learning for continuous-time linear systems: A model-free inﬁnite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14 – 20, 2017

work page 2017