Approximate Dynamic Programming For Linear Systems with State and Input Constraints
Pith reviewed 2026-05-25 15:05 UTC · model grok-4.3
The pith
Invariant sets let approximate dynamic programming update policies for linear systems without ever violating state or input constraints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By embedding invariant sets into the policy-update step of approximate dynamic programming, the closed-loop trajectory of a linear system is forced to remain inside the given state and input constraint set at every instant, while the value function and policy still converge asymptotically to those of the unconstrained linear quadratic regulator.
What carries the argument
Invariant sets, regions of the state space that remain feasible forever under the closed-loop dynamics and that are preserved by each ADP policy update.
If this is right
- Every updated policy satisfies the original constraints at every future time.
- The data-driven algorithm produces a sequence of feasible controllers without requiring an explicit model at runtime.
- As the number of updates grows, the performance approaches that of the optimal unconstrained LQR controller.
- The same invariance argument applies whenever the sets can be identified offline.
Where Pith is reading between the lines
- The same invariance argument could be checked on any system where polytopic or ellipsoidal invariant sets are already known.
- One could test whether the data-driven version still works when the invariant set is only approximated rather than computed exactly.
- The approach supplies a concrete way to blend offline set computation with online learning, which may connect to other safe-control architectures that pre-compute feasible regions.
Load-bearing premise
Invariant sets for the constrained linear system exist and can be computed so that every ADP policy update keeps the system inside them indefinitely.
What would settle it
A numerical run on a double-integrator plant with box constraints in which any state or input exceeds its limit at any time step during repeated policy updates, or in which the closed-loop cost stops approaching the known LQR optimum.
Figures
read the original abstract
Enforcing state and input constraints during reinforcement learning (RL) in continuous state spaces is an open but crucial problem which remains a roadblock to using RL in safety-critical applications. This paper leverages invariant sets to update control policies within an approximate dynamic programming (ADP) framework that guarantees constraint satisfaction for all time and converges to the optimal policy (in a linear quadratic regulator sense) asymptotically. An algorithm for implementing the proposed constrained ADP approach in a data-driven manner is provided. The potential of this formalism is demonstrated via numerical examples.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a constrained approximate dynamic programming (ADP) framework for linear systems that incorporates positively invariant sets to enforce state and input constraints at every time step during policy iteration. The approach is claimed to guarantee constraint satisfaction for all time while ensuring asymptotic convergence of the policy to the optimal infinite-horizon LQR solution; a data-driven implementation via sampled trajectories is provided along with numerical examples.
Significance. If the invariance preservation property holds under the data-driven updates, the result would meaningfully advance safe RL for constrained linear systems by combining set-theoretic invariance with ADP, offering a route to constraint satisfaction without relying on post-hoc projection or barrier functions. The data-driven algorithm and explicit convergence claim to LQR are positive features.
major comments (2)
- [§3 and Algorithm 1] §3 (Policy Update via Invariant Sets) and Algorithm 1: the data-driven least-squares / temporal-difference update for the feedback gain is performed without any explicit constraint or projection onto the (non-convex) set of matrices that preserve positive invariance of the chosen set X. Consequently the closed-loop map (A+BK)X may exit X after the first improvement step, undermining the “constraint satisfaction for all time” guarantee.
- [Theorem 2] Theorem 2 (Asymptotic Convergence): the proof sketch assumes that every iterate remains inside the invariance-preserving policy class, but the data-driven step described in §4 does not enforce this; the convergence claim therefore rests on an unverified invariance preservation property under approximation error.
minor comments (2)
- [§2] Notation for the invariant set X and the associated polyhedral representation is introduced without a dedicated preliminary subsection; readers must infer the exact invariance condition from later equations.
- [§5] Numerical examples section reports trajectories but omits quantitative metrics (e.g., maximum constraint violation over 100 Monte-Carlo runs or distance to LQR gain) that would allow direct verification of the claimed guarantees.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on the manuscript. We address each major comment below and agree that revisions are needed to strengthen the invariance guarantees in the data-driven setting.
read point-by-point responses
-
Referee: [§3 and Algorithm 1] §3 (Policy Update via Invariant Sets) and Algorithm 1: the data-driven least-squares / temporal-difference update for the feedback gain is performed without any explicit constraint or projection onto the (non-convex) set of matrices that preserve positive invariance of the chosen set X. Consequently the closed-loop map (A+BK)X may exit X after the first improvement step, undermining the “constraint satisfaction for all time” guarantee.
Authors: We agree that the data-driven least-squares update in §4 and Algorithm 1 lacks an explicit projection or constraint to ensure the updated gain K preserves positive invariance of X. The current formulation does not guarantee that (A+BK)X remains inside X after each improvement step. We will revise the manuscript to incorporate a projection step onto the set of invariance-preserving gains (or provide a proof that the update preserves invariance under the stated assumptions) and update Algorithm 1 and the text in §3 and §4 accordingly. revision: yes
-
Referee: [Theorem 2] Theorem 2 (Asymptotic Convergence): the proof sketch assumes that every iterate remains inside the invariance-preserving policy class, but the data-driven step described in §4 does not enforce this; the convergence claim therefore rests on an unverified invariance preservation property under approximation error.
Authors: The referee is correct that the proof of Theorem 2 assumes all iterates remain in the invariance-preserving policy class, yet the data-driven implementation does not enforce this. We will revise the manuscript to either modify the update rule (via projection or additional constraints) to ensure the assumption holds or adjust the theorem and proof to rigorously account for cases where invariance is preserved only approximately. The revised version will include these changes. revision: yes
Circularity Check
No circularity; derivation chain is self-contained against external LQR benchmark
full rationale
The provided abstract and context describe an ADP procedure that augments policy updates with invariant-set constraints to enforce safety while targeting the standard infinite-horizon LQR optimum asymptotically. No equations, fitted-parameter predictions, or self-citation chains are exhibited that would reduce any claimed result to its own inputs by construction. The convergence claim is to an externally defined LQR solution rather than a quantity defined inside the algorithm, and the data-driven implementation is presented as a practical realization rather than a tautological renaming. This is the normal case of a non-circular paper whose central guarantee rests on independent set-invariance theory and classical LQR optimality.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The system is linear (time-invariant).
- domain assumption Invariant sets for the state and input constraints can be found or approximated.
Reference graph
Works this paper leans on
-
[1]
Reinforcement learning and adaptive dynamic programming for feedback control,
F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits and Systems Magazine , vol. 9, no. 3, pp. 32–50, 2009
work page 2009
-
[2]
F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,”IEEE Control Systems, vol. 32, no. 6, pp. 76–105, 2012
work page 2012
-
[3]
D. P. Bertsekas, Dynamic Programming and Optimal Control , 2nd ed. Athena Scientific, 2000
work page 2000
-
[4]
Optimal and Autonomous Control Using Reinforcement Learning: A Survey,
B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and Autonomous Control Using Reinforcement Learning: A Survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2042–2062, 2018
work page 2042
-
[5]
G. Hewer, “An iterative technique for the computation of the steady state gains for the discrete optimal regulator,” IEEE Transactions on Automatic Control , vol. 16, no. 4, pp. 382–384, aug 1971. [Online]. Available: http://ieeexplore.ieee.org/document/1099755/
-
[6]
On an iterative technique for riccati equation computations,
D. L. Kleinman, “On an iterative technique for riccati equation computations,” IEEE Trans. on Automatic Control, vol. 13, no. 1, pp. 114–115, 1968
work page 1968
-
[7]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 2011
work page 2011
-
[8]
Adaptive linear quadratic control using policy iteration,
S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” in Proc. of the American Control Conference, vol. 3. Citeseer, 1994, pp. 3475–3475
work page 1994
-
[9]
C. J. Watkins and P. Dayan, “Q-learning,” Machine learning , vol. 8, no. 3-4, pp. 279–292, 1992
work page 1992
-
[10]
Neural networks for control and system identification,
P. J. Werbos, “Neural networks for control and system identification,” in Proc. of the 28th IEEE Conf. Dec. and Control . IEEE, 1989, pp. 260–265. 15
work page 1989
-
[11]
Adaptive optimal control for continuous-time linear systems based on policy iteration,
D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009
work page 2009
-
[12]
Reinforcement learning and distributed local model synthesis,
T. Landelius, “Reinforcement learning and distributed local model synthesis,” Ph.D. disserta- tion, Link¨ oping University Electronic Press, 1997
work page 1997
-
[13]
A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free Q-learning designs for linear discrete-time zero-sum games with application to H∞ control,” Automatica, vol. 43, no. 3, pp. 473–481, 2007
work page 2007
-
[14]
Approximate optimal trajectory tracking for continuous-time nonlinear systems,
R. Kamalapurkar, H. Dinh, S. Bhasin, and W. E. Dixon, “Approximate optimal trajectory tracking for continuous-time nonlinear systems,” Automatica, vol. 51, pp. 40–48, 2015
work page 2015
-
[15]
Adaptive dynamic programming and adaptive optimal output reg- ulation of linear systems,
W. Gao and Z.-P. Jiang, “Adaptive dynamic programming and adaptive optimal output reg- ulation of linear systems,” IEEE Transactions on Automatic Control , vol. 61, no. 12, pp. 4164–4169, 2016
work page 2016
-
[16]
Q. Zhang, D. Zhao, and Y. Zhu, “Data-driven adaptive dynamic programming for continuous- time fully cooperative games with partially constrained inputs,” Neurocomputing, vol. 238, pp. 377–386, 2017
work page 2017
-
[17]
H. Zhang, H. Jiang, C. Luo, and G. Xiao, “Discrete-time nonzero-sum games for multiplayer using policy-iteration-based adaptive dynamic programming algorithms,” IEEE transactions on cybernetics, vol. 47, no. 10, pp. 3331–3340, 2017
work page 2017
-
[18]
Discrete-time nonlinear hjb solution using approximate dynamic programming: Convergence proof,
A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear hjb solution using approximate dynamic programming: Convergence proof,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , vol. 38, no. 4, pp. 943–949, 2008
work page 2008
-
[19]
Revisiting Approximate Dynamic Programming and its Convergence,
A. Heydari, “Revisiting Approximate Dynamic Programming and its Convergence,” IEEE Transactions on Cybernetics, vol. 44, no. 12, pp. 2733–2743, 2014
work page 2014
-
[20]
Robust constrained model predictive control using linear matrix inequalities,
M. V. Kothare, V. Balakrishnan, and M. Morari, “Robust constrained model predictive control using linear matrix inequalities,” Automatica, vol. 32, no. 10, pp. 1361–1379, 1996
work page 1996
-
[21]
A. Chakrabarty, V. C. Dinh, M. J. Corless, A. E. Rundell, S. H. Zak, G. T. Buzzardet al., “Sup- port vector machine informed explicit nonlinear model predictive control using low-discrepancy sequences.” IEEE Trans. Automat. Contr. , vol. 62, no. 1, pp. 135–148, 2017
work page 2017
-
[22]
Automated driving: Safe motion planning using positively invariant sets,
K. Berntorp, A. Weiss, C. Danielson, I. V. Kolmanovsky, and S. Di Cairano, “Automated driving: Safe motion planning using positively invariant sets,” in Proc. of the 20th Int. Conf. on Intelligent Transportation Sys. (ITSC) . IEEE, 2017, pp. 1–6
work page 2017
-
[23]
Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes,
F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes,” Proc. of the IEEE Conf. Decision and Control, pp. 4661–4666, 2016
work page 2016
-
[24]
Safe reinforcement learning: Learning with supervision using a constraint-admissible set,
Z. Li, U. Kalabi´ c, and T. Chu, “Safe reinforcement learning: Learning with supervision using a constraint-admissible set,” in Proc. of the American Control Conference (ACC) . IEEE, 2018, pp. 6390–6395. 16
work page 2018
-
[25]
Direct data-driven control of constrained systems,
D. Piga, S. Formentin, and A. Bemporad, “Direct data-driven control of constrained systems,” IEEE Transactions on Control Systems Technology , vol. 26, no. 4, pp. 1422–1429, 2018
work page 2018
-
[26]
Ljung, System identification: Theory for the User
L. Ljung, System identification: Theory for the User . Upper Saddle River, N.J.: Prentice Hall, 1999
work page 1999
-
[27]
M. J. Todd, “Semidefinite optimization,” Acta Numerica, vol. 10, p. 515560, 2001
work page 2001
-
[28]
L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Review, vol. 38, no. 1, pp. 49–95, 1996
work page 1996
-
[29]
Alternating direction augmented lagrangian methods for semidefinite programming,
Z. Wen, D. Goldfarb, and W. Yin, “Alternating direction augmented lagrangian methods for semidefinite programming,” Mathematical Programming Computation , vol. 2, no. 3, pp. 203–230, Dec 2010
work page 2010
-
[30]
Fast ADMM for semidefinite programs with chordal sparsity,
Y. Zheng, G. Fantuzzi, A. Papachristodoulou, P. Goulart, and A. Wynn, “Fast ADMM for semidefinite programs with chordal sparsity,” in 2017 American Control Conference (ACC) , May 2017, pp. 3335–3340. 17
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.