pith. sign in

arxiv: 1906.11369 · v1 · pith:OZNZHZRTnew · submitted 2019-06-26 · 📡 eess.SY · cs.LG· cs.SY· math.DS· math.OC

Approximate Dynamic Programming For Linear Systems with State and Input Constraints

Pith reviewed 2026-05-25 15:05 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SYmath.DSmath.OC
keywords approximate dynamic programminginvariant setsconstraint satisfactionlinear systemsreinforcement learningoptimal controllinear quadratic regulatorpolicy iteration
0
0 comments X

The pith

Invariant sets let approximate dynamic programming update policies for linear systems without ever violating state or input constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that policy updates inside an approximate dynamic programming loop can be made to preserve a precomputed invariant set, so every subsequent state and input stays inside the allowed limits for all future time. At the same time the sequence of policies is shown to approach the optimal linear quadratic regulator solution as the approximation error goes to zero. A data-driven algorithm is given that implements these updates from measured trajectories. A sympathetic reader would care because reinforcement learning in continuous spaces has lacked any general guarantee against unsafe behavior during learning, which has blocked its use wherever constraint violations carry real costs.

Core claim

By embedding invariant sets into the policy-update step of approximate dynamic programming, the closed-loop trajectory of a linear system is forced to remain inside the given state and input constraint set at every instant, while the value function and policy still converge asymptotically to those of the unconstrained linear quadratic regulator.

What carries the argument

Invariant sets, regions of the state space that remain feasible forever under the closed-loop dynamics and that are preserved by each ADP policy update.

If this is right

  • Every updated policy satisfies the original constraints at every future time.
  • The data-driven algorithm produces a sequence of feasible controllers without requiring an explicit model at runtime.
  • As the number of updates grows, the performance approaches that of the optimal unconstrained LQR controller.
  • The same invariance argument applies whenever the sets can be identified offline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same invariance argument could be checked on any system where polytopic or ellipsoidal invariant sets are already known.
  • One could test whether the data-driven version still works when the invariant set is only approximated rather than computed exactly.
  • The approach supplies a concrete way to blend offline set computation with online learning, which may connect to other safe-control architectures that pre-compute feasible regions.

Load-bearing premise

Invariant sets for the constrained linear system exist and can be computed so that every ADP policy update keeps the system inside them indefinitely.

What would settle it

A numerical run on a double-integrator plant with box constraints in which any state or input exceeds its limit at any time step during repeated policy updates, or in which the closed-loop cost stops approaching the known LQR optimum.

Figures

Figures reproduced from arXiv: 1906.11369 by Ankush Chakrabarty, Claus Danielson, Rien Quirynen, Weinan Gao.

Figure 1
Figure 1. Figure 1: Results of constrained ADP for 2-state dynamic system: [A] Sequence of invariant sets [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Results of constrained ADP for 5-state, 2-input dynamic system: [A] State evolution [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
read the original abstract

Enforcing state and input constraints during reinforcement learning (RL) in continuous state spaces is an open but crucial problem which remains a roadblock to using RL in safety-critical applications. This paper leverages invariant sets to update control policies within an approximate dynamic programming (ADP) framework that guarantees constraint satisfaction for all time and converges to the optimal policy (in a linear quadratic regulator sense) asymptotically. An algorithm for implementing the proposed constrained ADP approach in a data-driven manner is provided. The potential of this formalism is demonstrated via numerical examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a constrained approximate dynamic programming (ADP) framework for linear systems that incorporates positively invariant sets to enforce state and input constraints at every time step during policy iteration. The approach is claimed to guarantee constraint satisfaction for all time while ensuring asymptotic convergence of the policy to the optimal infinite-horizon LQR solution; a data-driven implementation via sampled trajectories is provided along with numerical examples.

Significance. If the invariance preservation property holds under the data-driven updates, the result would meaningfully advance safe RL for constrained linear systems by combining set-theoretic invariance with ADP, offering a route to constraint satisfaction without relying on post-hoc projection or barrier functions. The data-driven algorithm and explicit convergence claim to LQR are positive features.

major comments (2)
  1. [§3 and Algorithm 1] §3 (Policy Update via Invariant Sets) and Algorithm 1: the data-driven least-squares / temporal-difference update for the feedback gain is performed without any explicit constraint or projection onto the (non-convex) set of matrices that preserve positive invariance of the chosen set X. Consequently the closed-loop map (A+BK)X may exit X after the first improvement step, undermining the “constraint satisfaction for all time” guarantee.
  2. [Theorem 2] Theorem 2 (Asymptotic Convergence): the proof sketch assumes that every iterate remains inside the invariance-preserving policy class, but the data-driven step described in §4 does not enforce this; the convergence claim therefore rests on an unverified invariance preservation property under approximation error.
minor comments (2)
  1. [§2] Notation for the invariant set X and the associated polyhedral representation is introduced without a dedicated preliminary subsection; readers must infer the exact invariance condition from later equations.
  2. [§5] Numerical examples section reports trajectories but omits quantitative metrics (e.g., maximum constraint violation over 100 Monte-Carlo runs or distance to LQR gain) that would allow direct verification of the claimed guarantees.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the manuscript. We address each major comment below and agree that revisions are needed to strengthen the invariance guarantees in the data-driven setting.

read point-by-point responses
  1. Referee: [§3 and Algorithm 1] §3 (Policy Update via Invariant Sets) and Algorithm 1: the data-driven least-squares / temporal-difference update for the feedback gain is performed without any explicit constraint or projection onto the (non-convex) set of matrices that preserve positive invariance of the chosen set X. Consequently the closed-loop map (A+BK)X may exit X after the first improvement step, undermining the “constraint satisfaction for all time” guarantee.

    Authors: We agree that the data-driven least-squares update in §4 and Algorithm 1 lacks an explicit projection or constraint to ensure the updated gain K preserves positive invariance of X. The current formulation does not guarantee that (A+BK)X remains inside X after each improvement step. We will revise the manuscript to incorporate a projection step onto the set of invariance-preserving gains (or provide a proof that the update preserves invariance under the stated assumptions) and update Algorithm 1 and the text in §3 and §4 accordingly. revision: yes

  2. Referee: [Theorem 2] Theorem 2 (Asymptotic Convergence): the proof sketch assumes that every iterate remains inside the invariance-preserving policy class, but the data-driven step described in §4 does not enforce this; the convergence claim therefore rests on an unverified invariance preservation property under approximation error.

    Authors: The referee is correct that the proof of Theorem 2 assumes all iterates remain in the invariance-preserving policy class, yet the data-driven implementation does not enforce this. We will revise the manuscript to either modify the update rule (via projection or additional constraints) to ensure the assumption holds or adjust the theorem and proof to rigorously account for cases where invariance is preserved only approximately. The revised version will include these changes. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation chain is self-contained against external LQR benchmark

full rationale

The provided abstract and context describe an ADP procedure that augments policy updates with invariant-set constraints to enforce safety while targeting the standard infinite-horizon LQR optimum asymptotically. No equations, fitted-parameter predictions, or self-citation chains are exhibited that would reduce any claimed result to its own inputs by construction. The convergence claim is to an externally defined LQR solution rather than a quantity defined inside the algorithm, and the data-driven implementation is presented as a practical realization rather than a tautological renaming. This is the normal case of a non-circular paper whose central guarantee rests on independent set-invariance theory and classical LQR optimality.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Ledger constructed from abstract only; full paper may contain additional fitted parameters or assumptions in the policy update rules or invariant set computations.

axioms (2)
  • domain assumption The system is linear (time-invariant).
    Title and abstract explicitly restrict the setting to linear systems.
  • domain assumption Invariant sets for the state and input constraints can be found or approximated.
    The method relies on leveraging such sets to enforce perpetual constraint satisfaction.

pith-pipeline@v0.9.0 · 5626 in / 1268 out tokens · 34530 ms · 2026-05-25T15:05:25.586132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Reinforcement learning and adaptive dynamic programming for feedback control,

    F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits and Systems Magazine , vol. 9, no. 3, pp. 32–50, 2009

  2. [2]

    Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,

    F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,”IEEE Control Systems, vol. 32, no. 6, pp. 76–105, 2012

  3. [3]

    D. P. Bertsekas, Dynamic Programming and Optimal Control , 2nd ed. Athena Scientific, 2000

  4. [4]

    Optimal and Autonomous Control Using Reinforcement Learning: A Survey,

    B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and Autonomous Control Using Reinforcement Learning: A Survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2042–2062, 2018

  5. [5]

    An iterative technique for the computation of the steady state gains for the discrete optimal regulator,

    G. Hewer, “An iterative technique for the computation of the steady state gains for the discrete optimal regulator,” IEEE Transactions on Automatic Control , vol. 16, no. 4, pp. 382–384, aug 1971. [Online]. Available: http://ieeexplore.ieee.org/document/1099755/

  6. [6]

    On an iterative technique for riccati equation computations,

    D. L. Kleinman, “On an iterative technique for riccati equation computations,” IEEE Trans. on Automatic Control, vol. 13, no. 1, pp. 114–115, 1968

  7. [7]

    R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 2011

  8. [8]

    Adaptive linear quadratic control using policy iteration,

    S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” in Proc. of the American Control Conference, vol. 3. Citeseer, 1994, pp. 3475–3475

  9. [9]

    Q-learning,

    C. J. Watkins and P. Dayan, “Q-learning,” Machine learning , vol. 8, no. 3-4, pp. 279–292, 1992

  10. [10]

    Neural networks for control and system identification,

    P. J. Werbos, “Neural networks for control and system identification,” in Proc. of the 28th IEEE Conf. Dec. and Control . IEEE, 1989, pp. 260–265. 15

  11. [11]

    Adaptive optimal control for continuous-time linear systems based on policy iteration,

    D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009

  12. [12]

    Reinforcement learning and distributed local model synthesis,

    T. Landelius, “Reinforcement learning and distributed local model synthesis,” Ph.D. disserta- tion, Link¨ oping University Electronic Press, 1997

  13. [13]

    Model-free Q-learning designs for linear discrete-time zero-sum games with application to H∞ control,

    A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free Q-learning designs for linear discrete-time zero-sum games with application to H∞ control,” Automatica, vol. 43, no. 3, pp. 473–481, 2007

  14. [14]

    Approximate optimal trajectory tracking for continuous-time nonlinear systems,

    R. Kamalapurkar, H. Dinh, S. Bhasin, and W. E. Dixon, “Approximate optimal trajectory tracking for continuous-time nonlinear systems,” Automatica, vol. 51, pp. 40–48, 2015

  15. [15]

    Adaptive dynamic programming and adaptive optimal output reg- ulation of linear systems,

    W. Gao and Z.-P. Jiang, “Adaptive dynamic programming and adaptive optimal output reg- ulation of linear systems,” IEEE Transactions on Automatic Control , vol. 61, no. 12, pp. 4164–4169, 2016

  16. [16]

    Data-driven adaptive dynamic programming for continuous- time fully cooperative games with partially constrained inputs,

    Q. Zhang, D. Zhao, and Y. Zhu, “Data-driven adaptive dynamic programming for continuous- time fully cooperative games with partially constrained inputs,” Neurocomputing, vol. 238, pp. 377–386, 2017

  17. [17]

    Discrete-time nonzero-sum games for multiplayer using policy-iteration-based adaptive dynamic programming algorithms,

    H. Zhang, H. Jiang, C. Luo, and G. Xiao, “Discrete-time nonzero-sum games for multiplayer using policy-iteration-based adaptive dynamic programming algorithms,” IEEE transactions on cybernetics, vol. 47, no. 10, pp. 3331–3340, 2017

  18. [18]

    Discrete-time nonlinear hjb solution using approximate dynamic programming: Convergence proof,

    A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear hjb solution using approximate dynamic programming: Convergence proof,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , vol. 38, no. 4, pp. 943–949, 2008

  19. [19]

    Revisiting Approximate Dynamic Programming and its Convergence,

    A. Heydari, “Revisiting Approximate Dynamic Programming and its Convergence,” IEEE Transactions on Cybernetics, vol. 44, no. 12, pp. 2733–2743, 2014

  20. [20]

    Robust constrained model predictive control using linear matrix inequalities,

    M. V. Kothare, V. Balakrishnan, and M. Morari, “Robust constrained model predictive control using linear matrix inequalities,” Automatica, vol. 32, no. 10, pp. 1361–1379, 1996

  21. [21]

    Sup- port vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences

    A. Chakrabarty, V. C. Dinh, M. J. Corless, A. E. Rundell, S. H. Zak, G. T. Buzzardet al., “Sup- port vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences.” IEEE Trans. Automat. Contr. , vol. 62, no. 1, pp. 135–148, 2017

  22. [22]

    Automated driving: Safe motion planning using positively invariant sets,

    K. Berntorp, A. Weiss, C. Danielson, I. V. Kolmanovsky, and S. Di Cairano, “Automated driving: Safe motion planning using positively invariant sets,” in Proc. of the 20th Int. Conf. on Intelligent Transportation Sys. (ITSC) . IEEE, 2017, pp. 1–6

  23. [23]

    Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes,

    F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes,” Proc. of the IEEE Conf. Decision and Control, pp. 4661–4666, 2016

  24. [24]

    Safe reinforcement learning: Learning with supervision using a constraint-admissible set,

    Z. Li, U. Kalabi´ c, and T. Chu, “Safe reinforcement learning: Learning with supervision using a constraint-admissible set,” in Proc. of the American Control Conference (ACC) . IEEE, 2018, pp. 6390–6395. 16

  25. [25]

    Direct data-driven control of constrained systems,

    D. Piga, S. Formentin, and A. Bemporad, “Direct data-driven control of constrained systems,” IEEE Transactions on Control Systems Technology , vol. 26, no. 4, pp. 1422–1429, 2018

  26. [26]

    Ljung, System identification: Theory for the User

    L. Ljung, System identification: Theory for the User . Upper Saddle River, N.J.: Prentice Hall, 1999

  27. [27]

    Semidefinite optimization,

    M. J. Todd, “Semidefinite optimization,” Acta Numerica, vol. 10, p. 515560, 2001

  28. [28]

    Semidefinite programming,

    L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Review, vol. 38, no. 1, pp. 49–95, 1996

  29. [29]

    Alternating direction augmented lagrangian methods for semidefinite programming,

    Z. Wen, D. Goldfarb, and W. Yin, “Alternating direction augmented lagrangian methods for semidefinite programming,” Mathematical Programming Computation , vol. 2, no. 3, pp. 203–230, Dec 2010

  30. [30]

    Fast ADMM for semidefinite programs with chordal sparsity,

    Y. Zheng, G. Fantuzzi, A. Papachristodoulou, P. Goulart, and A. Wynn, “Fast ADMM for semidefinite programs with chordal sparsity,” in 2017 American Control Conference (ACC) , May 2017, pp. 3335–3340. 17