pith. sign in

arxiv: 1907.02151 · v1 · pith:CWFVY6BAnew · submitted 2019-07-03 · 📡 eess.SY · cs.LG· cs.SY· math.DS· math.OC

Safe Approximate Dynamic Programming Via Kernelized Lipschitz Estimation

Pith reviewed 2026-05-25 09:30 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SYmath.DSmath.OC
keywords approximate dynamic programmingkernelized Lipschitz estimationsemidefinite programmingsafe reinforcement learningexponential stabilityadmissible policiesuncertain systemsdiscrete-time dynamics
0
0 comments X

The pith

Kernelized Lipschitz estimation and semidefinite programming compute admissible initial control policies for approximate dynamic programming that stabilize uncertain discrete-time systems with high probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to obtain safe initial policies for reinforcement learning in systems with unknown discrete-time dynamics. It combines kernelized Lipschitz estimation to bound system changes with semidefinite programming to find policies that are admissible with high probability. These policies support safe starting points for learning, enforce constraints, and ensure the closed-loop system reaches exponential stability. A reader would care because finding safe starting controllers is a key barrier in applying reinforcement learning to real uncertain systems.

Core claim

We employ kernelized Lipschitz estimation and semidefinite programming for computing admissible initial control policies with provably high probability. Such admissible controllers enable safe initialization and constraint enforcement while providing exponential stability of the equilibrium of the closed-loop system.

What carries the argument

kernelized Lipschitz estimation paired with semidefinite programming to generate admissible initial policies

If this is right

  • Admissible initial policies allow safe initialization of reinforcement learning algorithms.
  • Constraint enforcement is possible during the initial phase of learning.
  • The closed-loop system achieves exponential stability around the equilibrium.
  • Policies come with probabilistic guarantees of admissibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on benchmark control problems to measure the probability of success in practice.
  • It might extend to systems where the kernel choice affects the tightness of the bounds.
  • Similar ideas could apply to continuous-time systems if the estimation is adapted accordingly.

Load-bearing premise

The kernelized Lipschitz estimator gives bounds that are tight enough for the semidefinite program to find a policy stabilizing the actual unknown system.

What would settle it

Running the computed policy on the true system and observing that the state diverges or constraints are violated at a rate exceeding the claimed probability.

Figures

Figures reproduced from arXiv: 1907.02151 by Ankush Chakrabarty, Devesh K. Jha, Gregery T. Buzzard, Kyriakos Vamvoudakis, Yebin Wang.

Figure 2
Figure 2. Figure 2: Illustration of constrained-input value-iteration based ADP with safe [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of constrained-input policy-iteration based ADP with safe [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

We develop a method for obtaining safe initial policies for reinforcement learning via approximate dynamic programming (ADP) techniques for uncertain systems evolving with discrete-time dynamics. We employ kernelized Lipschitz estimation and semidefinite programming for computing admissible initial control policies with provably high probability. Such admissible controllers enable safe initialization and constraint enforcement while providing exponential stability of the equilibrium of the closed-loop system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a method for computing safe initial control policies for approximate dynamic programming on uncertain discrete-time systems. It combines kernelized Lipschitz estimation of the dynamics with semidefinite programming to produce admissible controllers that, with high probability, enforce state and input constraints while guaranteeing exponential stability of the closed-loop equilibrium.

Significance. If the high-probability stability and feasibility claims hold with explicit, non-vacuous bounds, the approach would address a practical barrier in safe RL initialization by supplying certifiably admissible starting policies rather than relying on post-hoc verification. The combination of nonparametric estimation with SDP-based Lyapunov synthesis is a natural direction for uncertain control systems.

major comments (2)
  1. [main stability theorem (likely §4 or §5)] The central claim (abstract and main stability theorem) requires that the kernelized Lipschitz estimator produces bounds tight enough that any SDP-feasible policy on the estimated model remains exponentially stabilizing and constraint-satisfying on the true unknown dynamics. The manuscript must show explicitly how the estimation error radius is absorbed into the Lyapunov decrease margin; otherwise the high-probability guarantee does not transfer.
  2. [probability bound derivation] The probability statement is invoked directly for both stability and constraint satisfaction. The derivation should clarify whether the failure probability is over the kernel estimator alone or also accounts for the SDP solver returning a policy whose margin is insufficient once the true Lipschitz constant is substituted.
minor comments (2)
  1. [preliminaries / method section] Notation for the kernelized estimator (e.g., the definition of the Lipschitz bound function) should be introduced earlier and used consistently when stating the SDP constraints.
  2. [abstract] The abstract states 'provably high probability' without quantifying the probability or the sample complexity; a brief statement of the dependence on number of samples would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments help us strengthen the presentation of the stability and probability guarantees. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [main stability theorem (likely §4 or §5)] The central claim (abstract and main stability theorem) requires that the kernelized Lipschitz estimator produces bounds tight enough that any SDP-feasible policy on the estimated model remains exponentially stabilizing and constraint-satisfying on the true unknown dynamics. The manuscript must show explicitly how the estimation error radius is absorbed into the Lyapunov decrease margin; otherwise the high-probability guarantee does not transfer.

    Authors: We agree that the absorption step should be stated more explicitly. In the proof of Theorem 5.1, the kernelized Lipschitz radius enters the robust LMI as an additive perturbation on the system matrices; this perturbation is subtracted from the nominal Lyapunov decrease margin before the SDP is solved, ensuring the closed-loop decrease remains negative for all realizations inside the high-probability ball. We will insert a dedicated remark immediately after the theorem statement that isolates this margin adjustment and references the relevant lines in the proof. revision: yes

  2. Referee: [probability bound derivation] The probability statement is invoked directly for both stability and constraint satisfaction. The derivation should clarify whether the failure probability is over the kernel estimator alone or also accounts for the SDP solver returning a policy whose margin is insufficient once the true Lipschitz constant is substituted.

    Authors: The probability bound applies exclusively to the kernelized Lipschitz estimator (via the concentration result in Lemma 3.2). The SDP is a deterministic feasibility problem once the estimated Lipschitz constant and uncertainty radius are fixed; any feasible solution of the SDP is guaranteed to satisfy the robust decrease and constraint conditions for all true dynamics inside that radius. Consequently, the overall failure probability equals the estimator failure probability. We will add one clarifying sentence to the statement of Theorem 5.1 and to the paragraph following Lemma 3.2. revision: yes

Circularity Check

0 steps flagged

No circularity: method constructs policies from independent kernel estimates and SDP

full rationale

The paper's chain starts from data-driven kernelized Lipschitz bounds on unknown discrete-time dynamics, feeds those bounds into an SDP that enforces a Lyapunov decrease and input constraints, and outputs a policy with high-probability stability guarantees on the true system. None of the target claims (exponential stability, constraint satisfaction) are presupposed in the estimator definition or recovered by fitting a parameter whose value is defined by the desired outcome. No load-bearing self-citation reduces the central guarantee to prior work by the same authors; the probabilistic certificate is derived from the SDP feasibility margin and concentration properties of the kernel estimator. The derivation is therefore self-contained and does not collapse to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on unstated modeling assumptions about the discrete-time dynamics and the validity of the kernel Lipschitz estimator that cannot be audited from the given text.

pith-pipeline@v0.9.0 · 5609 in / 1163 out tokens · 28413 ms · 2026-05-25T09:30:23.093512+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    Autonomy and machine intelligence in complex systems: A tutorial,

    K. Vamvoudakis, P. Antsaklis, W. Dixon, J. Hespanha, F. Lewis, H. Modares, and B. Kiumarsi, “Autonomy and machine intelligence in complex systems: A tutorial,” in American Control Conference (ACC), 2015, July 2015, pp. 5062–5079

  2. [2]

    R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (2nd Edition). MIT press Cambridge, 2018, vol. 1

  3. [3]

    Vrabie, K

    D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles , ser. IET control engineering series. Institution of Engineering and Technology, 2013

  4. [4]

    Mastering the game of go with deep neural networks and tree search,

    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016. 13

  5. [5]

    Mastering the game of go without human knowledge,

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017

  6. [6]

    The optimistic exploration value function,

    M. Gregor and J. Spalek, “The optimistic exploration value function,” in 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES), Sep. 2015, pp. 119–123

  7. [7]

    R-max-a general polynomial time algorithm for near-optimal reinforcement learning,

    R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time algorithm for near-optimal reinforcement learning,” Journal of Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002

  8. [8]

    High confidence policy improvement,

    P. Thomas, G. Theocharous, and M. Ghavamzadeh, “High confidence policy improvement,” in International Conference on Machine Learning, 2015, pp. 2380–2388

  9. [9]

    Data-driven anytime algo- rithms for motion planning with safety guarantees,

    D. K. Jha, M. Zhu, Y . Wang, and A. Ray, “Data-driven anytime algo- rithms for motion planning with safety guarantees,” in 2016 American Control Conference (ACC). IEEE, 2016, pp. 5716–5721

  10. [10]

    End-to-end training of deep visuomotor policies,

    S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research , vol. 17, no. 1, pp. 1334–1373, 2016

  11. [11]

    Bench- marking deep reinforcement learning for continuous control,

    Y . Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench- marking deep reinforcement learning for continuous control,” in Inter- national Conference on Machine Learning , 2016, pp. 1329–1338

  12. [12]

    Constrained policy optimization,

    J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 22–31

  13. [13]

    A lyapunov-based approach to safe reinforcement learning,

    Y . Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” in Advances in Neural Information Processing Systems , 2018, pp. 8092–8101

  14. [14]

    A comprehensive survey on safe reinforce- ment learning,

    J. Garcıa and F. Fern ´andez, “A comprehensive survey on safe reinforce- ment learning,” Journal of Machine Learning Research , vol. 16, no. 1, pp. 1437–1480, 2015

  15. [15]

    Online semi- parametric learning for inverse dynamics modeling,

    D. Romeres, M. Zorzi, R. Camoriano, and A. Chiuso, “Online semi- parametric learning for inverse dynamics modeling,” in Proc. of the IEEE Conf. Dec. and Ctrl , 2016, pp. 2945–2950

  16. [17]
  17. [18]

    Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian processes,

    F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian processes,” Proc. of the IEEE Conf. Decision and Control , pp. 4661–4666, 2016

  18. [19]

    Zeilinger

    L. Hewing and M. N. Zeilinger, “Cautious model predictive control using gaussian process regression,” CoRR, vol. abs/1705.10702, 2017. [Online]. Available: http://arxiv.org/abs/1705.10702

  19. [20]

    Optimal co-design of nonlinear control systems based on a modified policy iteration method,

    Y . Jiang, Y . Wang, S. A. Bortoff, and Z.-P. Jiang, “Optimal co-design of nonlinear control systems based on a modified policy iteration method,” IEEE Transactions on Neural Networks and Learning Systems , vol. 26, no. 2, pp. 409–414, 2015

  20. [21]

    Data-driven control of nonlinear systems: An on-line direct approach,

    M. Tanaskovic, L. Fagiano, C. Novara, and M. Morari, “Data-driven control of nonlinear systems: An on-line direct approach,” Automatica, vol. 75, pp. 1–10, 2017

  21. [22]

    Direct data-driven control of constrained systems,

    D. Piga, S. Formentin, and A. Bemporad, “Direct data-driven control of constrained systems,” IEEE Transactions on Control Systems Technol- ogy, vol. 26, no. 4, pp. 1422–1429, 2018

  22. [23]

    Optimal and autonomous control using reinforcement learning: A survey,

    B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE transactions on neural networks and learning systems , vol. 29, no. 6, pp. 2042–2062, 2018

  23. [24]

    Automatic crosswind flight of tethered wings for airborne wind energy: a direct data-driven approach,

    L. Fagiano and C. Novara, “Automatic crosswind flight of tethered wings for airborne wind energy: a direct data-driven approach,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 4927–4932, 2014

  24. [25]

    Support vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences

    A. Chakrabarty, V . C. Dinh, M. J. Corless, A. E. Rundell, S. H. Zak, G. T. Buzzard et al. , “Support vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences.” IEEE Trans. Automat. Contr., vol. 62, no. 1, pp. 135–148, 2017

  25. [26]

    Conservative decision-making and inference in uncer- tain dynamical systems,

    J.-P. Calliess, “Conservative decision-making and inference in uncer- tain dynamical systems,” Ph.D. dissertation, PhD thesis, University of Oxford, 2014

  26. [27]

    Learning-based nonlinear model predictive control,

    D. Limon, J. Calliess, and J. M. Maciejowski, “Learning-based nonlinear model predictive control,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 7769– 7776, 2017

  27. [28]

    State and unknown input observers for nonlinear systems with bounded exogenous inputs,

    A. Chakrabarty, M. J. Corless, G. T. Buzzard, S. H. ˙Zak, and A. E. Rundell, “State and unknown input observers for nonlinear systems with bounded exogenous inputs,” IEEE Transactions on Automatic Control , vol. 62, no. 11, pp. 5497–5510, 2017

  28. [29]

    Observer-based output feedback control design for systems with incrementally conic nonlinearities,

    X. Xu, B. Ac ¸ıkmes ¸e, M. Corless, and H. Sartipizadeh, “Observer-based output feedback control design for systems with incrementally conic nonlinearities,” in Proc. of the American Control Conference (ACC) , 2018, pp. 1364–1369

  29. [30]

    Estimation of the Lipschitz constant of a function,

    G. Wood and B. Zhang, “Estimation of the Lipschitz constant of a function,” Journal of Global Optimization , vol. 8, no. 1, pp. 91–103, 1996

  30. [31]

    On the convergence of an algorithm for finding a global extremum,

    R. G. Strongin, “On the convergence of an algorithm for finding a global extremum,” Engineering Cybernetics, vol. 11, pp. 549–555, 1973

  31. [32]

    On using estimates of Lipschitz constants in global optimization,

    P. Hansen, B. Jaumard, and S.-H. Lu, “On using estimates of Lipschitz constants in global optimization,” Journal of Optimization Theory and Applications, vol. 75, no. 1, pp. 195–200, 1992

  32. [33]

    Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,

    F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Systems, vol. 32, no. 6, pp. 76–105, 2012

  33. [34]

    Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online,

    K. G. Vamvoudakis, H. Modares, B. Kiumarsi, and F. L. Lewis, “Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online,”IEEE Control Systems Magazine, vol. 37, no. 1, pp. 33–52, Feb 2017

  34. [35]

    Sparse bayesian learning and the relevance vector machine,

    M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” Journal of machine learning research , vol. 1, no. Jun, pp. 211–244, 2001

  35. [36]

    Preference learning with gaussian pro- cesses,

    W. Chu and Z. Ghahramani, “Preference learning with gaussian pro- cesses,” in Proceedings of the 22nd international conference on Machine learning. ACM, 2005, pp. 137–144

  36. [37]

    Lipschitz optimisation for Lipschitz interpolation,

    J.-P. Calliess, “Lipschitz optimisation for Lipschitz interpolation,” in American Control Conference (ACC), 2017 , 2017, pp. 3141–3146

  37. [38]

    A plug-in approach to support estimation,

    A. Cuevas and R. Fraiman, “A plug-in approach to support estimation,” The Annals of Statistics , vol. 25, no. 6, pp. 2300–2312, 1997

  38. [39]

    H. K. Khalil, Nonlinear control. Pearson New York, 2015

  39. [40]

    Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,

    D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Trans. on Neural Networks and Learning Systems , vol. 25, no. 3, pp. 621–634, 2014

  40. [41]

    S. Boyd, L. El Ghaoui, E. Feron, and V . Balakrishnan, Linear matrix inequalities in system and control theory . SIAM, 1994, vol. 15

  41. [42]

    On the exponential stability of discrete-time systems with applications in observer design,

    V . C. Aitken and H. M. Schwartz, “On the exponential stability of discrete-time systems with applications in observer design,” IEEE Trans- actions on Automatic Control , vol. 39, no. 9, pp. 1959–1962, 1994

  42. [43]

    Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,

    M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005

  43. [44]

    Optimal control for discrete-time systems with actuator saturation,

    Q. Lin, Q. Wei, and B. Zhao, “Optimal control for discrete-time systems with actuator saturation,” Optimal Control Applications and Methods , vol. 38, no. 6, pp. 1071–1080, 2017

  44. [45]

    Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,

    H. Zhang, Y . Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Transactions on Neural Networks , vol. 20, no. 9, pp. 1490–1503, 2009

  45. [46]

    Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,

    H. Modares, F. L. Lewis, and M.-B. Naghibi-Sistani, “Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 10, pp. 1513–1525, 2013

  46. [47]

    Asymptoti- cally stable adaptiveoptimal control algorithm with saturating actuators and relaxed persistence of excitation,

    K. G. Vamvoudakis, M. F. Miranda, and J. P. Hespanha, “Asymptoti- cally stable adaptiveoptimal control algorithm with saturating actuators and relaxed persistence of excitation,” IEEE Transactions on Neural Networks and Learning Systems , vol. 27, no. 11, pp. 2386–2398, Nov 2016

  47. [48]

    Q-learning,

    C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning , vol. 8, no. 3, pp. 279–292, May 1992

  48. [49]

    Asynchronous stochastic approximation and q- learning,

    J. N. Tsitsiklis, “Asynchronous stochastic approximation and q- learning,” Machine Learning, vol. 16, no. 3, pp. 185–202, Sep 1994

  49. [50]

    Q-learning and Pontryagin’s minimum princi- ple,

    P. Mehta and S. Meyn, “Q-learning and Pontryagin’s minimum princi- ple,” in Proc. of the 48th IEEE Conference on Decision and Control (CDC). IEEE, 2009, pp. 3598–3605

  50. [51]

    Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach,

    K. G. Vamvoudakis, “Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14 – 20, 2017