pith. sign in

arxiv: 2403.05367 · v3 · submitted 2024-03-08 · 📡 eess.SY · cs.SY

Stability-Certified On-Policy Data-Driven LQR via Recursive Learning and Policy Gradient

Pith reviewed 2026-05-24 02:41 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords data-driven LQRon-policy learningrecursive least squarespolicy gradientLyapunov stabilitytimescale separationadaptive control
0
0 comments X

The pith

Relearn LQR combines recursive estimation and policy gradients to solve data-driven LQR while proving stability of the full closed-loop scheme.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an on-policy method for learning the optimal LQR controller when the linear dynamics are unknown. It interleaves recursive least squares updates for the system parameters with gradient steps on the feedback gain, all while the control is applied to the real plant. The central step is to represent the combined estimation, optimization, and plant evolution as one feedback interconnection of nonlinear systems. Lyapunov analysis then exploits averaging and timescale separation to establish that the trajectories remain bounded and converge to the optimal gain. This supplies formal certificates that prior data-driven LQR schemes lacked.

Core claim

The Relearn LQR procedure integrates a recursive least squares estimator with a direct policy-gradient search. By casting the overall learning-control loop as a feedback-interconnected nonlinear dynamical system and invoking averaging together with timescale separation, a Lyapunov function is constructed that certifies asymptotic stability of the equilibrium consisting of the true parameters and the optimal LQR gain.

What carries the argument

The feedback-interconnected nonlinear dynamical system formed by the plant, recursive least-squares estimator, and policy-gradient update, analyzed via averaging and timescale separation.

If this is right

  • The scheme converges to the optimal LQR gain while the plant state remains bounded throughout adaptation.
  • Stability holds for both constant and slowly drifting plant parameters.
  • The same Lyapunov-plus-averaging argument applies to any on-policy combination of recursive estimation and gradient-based policy search that meets the rate-separation condition.
  • The method can be run directly on physical plants, as demonstrated by the aircraft-control example.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modeling trick could be tried on other adaptive controllers whose updates are slower than the plant dynamics.
  • Relaxing the linear-plant assumption while keeping the timescale separation might yield stability results for data-driven nonlinear control.
  • The persistence-of-excitation requirement points to a practical test: inject sufficiently rich probing signals only until the estimator converges, then switch to pure regulation.

Load-bearing premise

The combined learning and control process can be represented as a nonlinear feedback interconnection to which averaging and timescale separation apply, which in turn requires sufficient separation of the adaptation rates and persistence of excitation.

What would settle it

A concrete linear system with known persistence of excitation where the adaptation rates are separated yet the closed-loop trajectories diverge or the gain fails to converge to the optimal LQR solution.

Figures

Figures reproduced from arXiv: 2403.05367 by Giuseppe Notarstefano, Guido Carnevale, Ivano Notarnicola, Lorenzo Sforni.

Figure 1
Figure 1. Figure 1: Schematic representation of the stability-certified on-policy LQR [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representation of the concurrent learning and optimization scheme [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Block diagram describing system (30). in this reformulation, the effect of the exogenous/dithering signal wt has been embedded in the time dependency of h, g, and f. Finally, by using the definitions of h, g, and f (cf. (31)) and the fact that G(K⋆ , θ⋆ ) = 0 since K⋆ is the solution to problem (8), we note that h(0, t) = 0, g(0, 0, t) = 0, f(0, 0, t) = 0, (34) for all t ∈ N. 4.2 Averaged System Analysis N… view at source ↗
Figure 4
Figure 4. Figure 4: Block diagram of (36) with ˜zav t = col(˜θ av t , K˜ av t ). The dynamics of ˜θ av t is trivially exponentially convergent to zero, while in the following we will formally show that the dynamics of K˜ av t is input-to￾state (ISS) exponentially stable (cf. [54]). For the sake of compactness, let us also introduce the (averaged) esti￾mates Aav t ∈ R n×n and Bav t ∈ R n×m of the matrices A and B, defined as … view at source ↗
Figure 5
Figure 5. Figure 5: (left) Evolution of the normalized cost error |J(Kt , θ⋆ t ) − J ⋆ |/J⋆ . (right) Evolution of the normalized estimation error about ∥θt − θ ⋆∥ / ∥θ ⋆∥ (left). 0 100 200 300 400 500 −10 −5 0 5 t State trajectory x1 x2 x3 x4 −0.2 0 0.2 [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: State trajectory of the closed-loop system. The states x1, x2, x3, x4 correspond, respectively, the forward velocity, the attack angle, the pitch rate and the pitch angle. 5.2 Aircraft Control with Drifting Parameters To better highlight the capabilities of our algorithm, we also consider the case where the system matrices A⋆, B⋆, slowly change over time. The new time-varying state and input matrices are d… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between J(Kt , θ⋆ t ) and J ⋆ t . 6 Conclusions In this paper, we addressed infinite-horizon LQR problems with unknown state-input matrices. Specifically, we propose a procedure mixing the iden￾24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evolution of the normalized cost error |J(Kt , θ⋆ t ) − J ⋆ t |/J⋆ t (left). Evolution of the normalized estimation error ∥θt − θ ⋆ t ∥ / ∥θ ⋆ t ∥ (right). tification phase of the unknown matrices with the optimization of the feed￾back policy. We design an iterative algorithm combining a Recursive Least Squares (RLS) scheme (elaborating samples from the closed-loop system per￾sistently excited by a ditheri… view at source ↗
read the original abstract

In this paper, we investigate a data-driven framework to solve Linear Quadratic Regulator (LQR) problems when the dynamics is unknown, with the additional challenge of providing stability certificates for the overall learning and control scheme. Specifically, in the proposed on-policy learning framework, the control input is applied to the actual (unknown) linear system while iteratively optimized. We propose a learning and control procedure, termed Relearn LQR, that combines a recursive least squares method with a direct policy search based on the gradient method. The resulting scheme is analyzed by modeling it as a feedback-interconnected nonlinear dynamical system. A Lyapunov-based approach, exploiting averaging and timescale separation theories for nonlinear systems, allows us to provide formal stability guarantees for the whole interconnected scheme. The effectiveness of the proposed strategy is corroborated by numerical simulations, where Relearn LQR is deployed on an aircraft control problem, with both static and drifting parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Relearn LQR, an on-policy data-driven method for unknown LQR problems that combines recursive least squares (RLS) identification with direct policy gradient search. The control input is applied to the real system while the policy is iteratively optimized. The scheme is modeled as a feedback interconnection of nonlinear dynamical systems, and a composite Lyapunov function is constructed using averaging and two-time-scale separation arguments to certify stability of the overall closed-loop learning process. Effectiveness is illustrated on an aircraft control example with both constant and drifting parameters.

Significance. If the stability certificates hold under explicitly verifiable conditions, the result would be a useful contribution to safe on-policy learning for linear control. The modeling choice and invocation of averaging/two-time-scale theorems are standard tools that, when applicable, can deliver rigorous guarantees without requiring offline data collection. The on-policy setting and handling of drifting parameters are practically relevant.

major comments (2)
  1. [analysis paragraph] Abstract/analysis paragraph: the formal stability claim rests on the regressor remaining uniformly persistently exciting while the policy is updated on-policy, yet no explicit bounds on the RLS forgetting factor or policy-gradient step size are derived to guarantee that the separation of timescales remains valid for the chosen parameterization.
  2. [analysis paragraph] Abstract/analysis paragraph: the application of averaging and two-time-scale theorems requires uniform PE of the closed-loop regressor under the time-varying policy; the manuscript states the condition but does not verify or bound the minimum eigenvalue of the regressor covariance when the input is generated by the evolving policy estimate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the stability analysis. We address each major comment below and will revise the manuscript to clarify the assumptions.

read point-by-point responses
  1. Referee: Abstract/analysis paragraph: the formal stability claim rests on the regressor remaining uniformly persistently exciting while the policy is updated on-policy, yet no explicit bounds on the RLS forgetting factor or policy-gradient step size are derived to guarantee that the separation of timescales remains valid for the chosen parameterization.

    Authors: We acknowledge the observation. The Lyapunov analysis with averaging and two-time-scale separation is performed under the standing assumption of uniform persistent excitation (PE) of the regressor and sufficient separation between the RLS and policy-gradient timescales. Explicit, parameterization-independent bounds on the forgetting factor and step size are not derived, as they would require a detailed, system-specific characterization of the closed-loop regressor that lies outside the paper's scope. In the revised version we will add an explicit remark in the analysis section stating these assumptions and referencing standard adaptive-control practice for parameter tuning to maintain them. revision: yes

  2. Referee: Abstract/analysis paragraph: the application of averaging and two-time-scale theorems requires uniform PE of the closed-loop regressor under the time-varying policy; the manuscript states the condition but does not verify or bound the minimum eigenvalue of the regressor covariance when the input is generated by the evolving policy estimate.

    Authors: We agree that the manuscript states the uniform-PE requirement without providing an explicit lower bound on the minimum eigenvalue of the covariance matrix under the time-varying policy. Such a bound depends on the unknown plant, the initial policy, and the update rates; deriving a general expression is therefore not feasible without additional assumptions. The contribution centers on stability of the interconnected system once the PE condition holds. In revision we will insert a clarifying paragraph that reiterates the assumption, notes its practical enforcement via persistent excitation in the on-policy data, and points to related literature where analogous conditions are left as standing assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity: stability via external Lyapunov/averaging/timescale-separation theorems on modeled interconnection

full rationale

The derivation models the Relearn LQR scheme (RLS + policy gradient) as a feedback-interconnected nonlinear system and invokes standard external results (Lyapunov functions, averaging theory, two-time-scale separation) to certify stability under stated assumptions of uniform PE and rate separation. These theorems are independent mathematical tools whose hypotheses are not constructed from the paper's fitted quantities or outputs; the paper states the conditions but does not reduce the stability claim to a self-definition or self-citation chain. No self-definitional, fitted-input-as-prediction, or ansatz-smuggled steps appear in the abstract or reader's summary of the chain. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The approach implicitly relies on standard LQR quadratic-cost assumptions and the applicability of averaging/timescale-separation theorems, but none are enumerated or justified in the provided text.

pith-pipeline@v0.9.0 · 5699 in / 1047 out tokens · 30140 ms · 2026-05-24T02:41:06.663228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

  1. [1]

    Adaptive linear quadratic control using policy iteration,

    S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” inIEEE American Control Conference, vol. 3, pp. 3475–3479, 1994

  2. [2]

    A tour of reinforcement learning: The view from continuous control,

    B. Recht, “A tour of reinforcement learning: The view from continuous control,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, pp. 253–279, 2019

  3. [3]

    On an iterative technique for Riccati equation computa- tions,

    D. Kleinman, “On an iterative technique for Riccati equation computa- tions,”IEEE Transactions on Automatic Control, vol.13, no.1, pp.114– 115, 1968

  4. [4]

    Robust policy iteration for continuous-timelinearquadraticregulation,

    B. Pang, T. Bian, and Z.-P. Jiang, “Robust policy iteration for continuous-timelinearquadraticregulation,” IEEE Transactions on Au- tomatic Control, vol. 67, no. 1, pp. 504–511, 2021. 25

  5. [5]

    Efficient off-policy Q- learning for data-based discrete-time LQR problems,

    V. G. Lopez, M. Alsalti, and M. A. Müller, “Efficient off-policy Q- learning for data-based discrete-time LQR problems,”IEEE Transac- tions on Automatic Control, 2023

  6. [6]

    Online optimal tracking control of continuous-time linear systems with unknown dynamics by using adap- tive dynamic programming,

    C. Qin, H. Zhang, and Y. Luo, “Online optimal tracking control of continuous-time linear systems with unknown dynamics by using adap- tive dynamic programming,”International Journal of Control, vol. 87, no. 5, pp. 1000–1009, 2014

  7. [7]

    Finite-time analysis of approximate policy iteration for the linear quadratic regulator,

    K. Krauth, S. Tu, and B. Recht, “Finite-time analysis of approximate policy iteration for the linear quadratic regulator,”Advances in Neural Information Processing Systems, vol. 32, 2019

  8. [8]

    Optimal output-feedback control of unknown continuous-time linear systems using off-policy rein- forcement learning,

    H. Modares, F. L. Lewis, and Z.-P. Jiang, “Optimal output-feedback control of unknown continuous-time linear systems using off-policy rein- forcement learning,”IEEE Transactions on Cybernetics, vol. 46, no. 11, pp. 2401–2410, 2016

  9. [9]

    Data-driven finite-horizon optimal control for linear time-varying discrete-time systems,

    B. Pang, T. Bian, and Z.-P. Jiang, “Data-driven finite-horizon optimal control for linear time-varying discrete-time systems,” in2018 IEEE Conference on Decision and Control (CDC), pp. 861–866, IEEE, 2018

  10. [10]

    Q-learning for continuous-time linear sys- tems: A data-driven implementation of the Kleinman algorithm,

    C. Possieri and M. Sassano, “Q-learning for continuous-time linear sys- tems: A data-driven implementation of the Kleinman algorithm,”IEEE Transactions on Systems, Man, and Cybernetics: Systems , vol. 52, no. 10, pp. 6487–6497, 2022

  11. [11]

    Value iteration and adaptive dynamic pro- grammingfordata-drivenadaptiveoptimalcontroldesign,

    T. Bian and Z.-P. Jiang, “Value iteration and adaptive dynamic pro- grammingfordata-drivenadaptiveoptimalcontroldesign,” Automatica, vol. 71, pp. 348–360, 2016

  12. [12]

    How are policy gradient methods affected by the limits of control?,

    I. Ziemann, A. Tsiamis, H. Sandberg, and N. Matni, “How are policy gradient methods affected by the limits of control?,” inIEEE 61st Con- ference on Decision and Control (CDC), pp. 5992–5999, 2022

  13. [13]

    H∞ controloflineardiscrete- time systems: Off-policy reinforcement learning,

    B.Kiumarsi, F.L.Lewis, andZ.-P.Jiang, “H∞ controloflineardiscrete- time systems: Off-policy reinforcement learning,”Automatica, vol. 78, pp. 144–152, 2017

  14. [14]

    Formulas for data-driven control: Stabi- lization, optimality, and robustness,

    C. De Persis and P. Tesi, “Formulas for data-driven control: Stabi- lization, optimality, and robustness,”IEEE Transactions on Automatic Control, vol. 65, no. 3, pp. 909–924, 2019

  15. [15]

    Data informativity: a new perspective on data-driven analysis and control,

    H. J. Van Waarde, J. Eising, H. L. Trentelman, and M. K. Camli- bel, “Data informativity: a new perspective on data-driven analysis and control,”IEEE Transactions on Automatic Control, vol. 65, no. 11, pp. 4753–4768, 2020. 26

  16. [16]

    Data-driven linear quadratic regulation via semidefinite programming,

    M. Rotulo, C. De Persis, and P. Tesi, “Data-driven linear quadratic regulation via semidefinite programming,”IFAC-PapersOnLine, vol. 53, no. 2, pp. 3995–4000, 2020

  17. [17]

    Online learning of data-driven controllers for unknown switched linear systems,

    M. Rotulo, C. De Persis, and P. Tesi, “Online learning of data-driven controllers for unknown switched linear systems,”Automatica, vol. 145, p. 110519, 2022

  18. [18]

    Low-complexity learning of linear quadratic regulators from noisy data,

    C. De Persis and P. Tesi, “Low-complexity learning of linear quadratic regulators from noisy data,”Automatica, vol. 128, p. 109548, 2021

  19. [19]

    On the certainty-equivalence approach to direct data-driven LQR design,

    F. Dörfler, P. Tesi, and C. De Persis, “On the certainty-equivalence approach to direct data-driven LQR design,”IEEE Transactions on Automatic Control, 2023

  20. [20]

    Robust data- driven state-feedback design,

    J. Berberich, A. Koch, C. W. Scherer, and F. Allgöwer, “Robust data- driven state-feedback design,” inIEEE American Control Conference (ACC), pp. 1532–1538, 2020

  21. [21]

    From noisy data to feedback controllers: Nonconservative design via a matrix s-lemma,

    H. J. van Waarde, M. K. Camlibel, and M. Mesbahi, “From noisy data to feedback controllers: Nonconservative design via a matrix s-lemma,” IEEE Transactions on Automatic Control, vol. 67, no. 1, pp. 162–175, 2020

  22. [22]

    Learning controllers for nonlinear systems from data,

    C. De Persis and P. Tesi, “Learning controllers for nonlinear systems from data,”Annual Reviews in Control, p. 100915, 2023

  23. [23]

    Safely learning to control the constrained linear quadratic regulator,

    S. Dean, S. Tu, N. Matni, and B. Recht, “Safely learning to control the constrained linear quadratic regulator,” inIEEE American Control Conference (ACC), pp. 5582–5588, 2019

  24. [24]

    Certainty equivalence is efficient for linear quadratic control,

    H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efficient for linear quadratic control,”Advances in Neural Information Processing Systems, vol. 32, 2019

  25. [25]

    Learning robust lq-controllers using application oriented exploration,

    M. Ferizbegovic, J. Umenberger, H. Hjalmarsson, and T. B. Schön, “Learning robust lq-controllers using application oriented exploration,” IEEE Control Systems Letters, vol. 4, no. 1, pp. 19–24, 2019

  26. [26]

    Structured exploration in the finite horizon linear quadratic dual control problem,

    A. Iannelli, M. Khosravi, and R. S. Smith, “Structured exploration in the finite horizon linear quadratic dual control problem,”IFAC- PapersOnLine, vol. 53, no. 2, pp. 959–964, 2020

  27. [27]

    Core: Control-oriented regularization for system identification,

    S. Formentin and A. Chiuso, “Core: Control-oriented regularization for system identification,” inIEEE Conference on Decision and Control (CDC), pp. 2253–2258, 2018. 27

  28. [28]

    Bridging direct and indirect data-driven control formulations via regularizations and relaxations,

    F. Dörfler, J. Coulson, and I. Markovsky, “Bridging direct and indirect data-driven control formulations via regularizations and relaxations,” IEEE Transactions on Automatic Control, vol. 68, no. 2, pp. 883–897, 2022

  29. [29]

    Toward a theoretical foundation of policy optimization for learning control poli- cies,

    B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, and T. Başar, “Toward a theoretical foundation of policy optimization for learning control poli- cies,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 6, pp. 123–158, 2023

  30. [30]

    LQR through the lens of first order methods: Discrete-time case,

    J. Bu, A. Mesbahi, M. Fazel, and M. Mesbahi, “LQR through the lens of first order methods: Discrete-time case,” arXiv preprint arXiv:1907.08921, 2019

  31. [31]

    Global convergence of policy gradient methods for the linear quadratic regulator,

    M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” inInterna- tional Conference on Machine Learning, pp. 1467–1476, PMLR, 2018

  32. [32]

    Policy optimization forH2 linear control with H∞ robustness guarantee: Implicit regularization and global convergence,

    K. Zhang, B. Hu, and T. Basar, “Policy optimization forH2 linear control with H∞ robustness guarantee: Implicit regularization and global convergence,” inLearning for Dynamics and Control, pp. 179– 190, PMLR, 2020

  33. [33]

    Con- vergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem,

    H.Mohammadi, A.Zare, M.Soltanolkotabi, andM.R.Jovanović, “Con- vergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem,”IEEE Transactions on Automatic Control, vol. 67, no. 5, pp. 2435–2450, 2021

  34. [34]

    On the linear convergence of random search for discrete-time LQR,

    H. Mohammadi, M. Soltanolkotabi, and M. R. Jovanović, “On the linear convergence of random search for discrete-time LQR,”IEEE Control Systems Letters, vol. 5, no. 3, pp. 989–994, 2020

  35. [35]

    Regret bounds for the adaptive control of linear quadratic systems,

    Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” inProceedings of the 24th Annual Conference on Learning Theory, pp. 1–26, JMLR Workshop and Con- ference Proceedings, 2011

  36. [36]

    Learning linear-quadratic regu- lators efficiently with only √ T regret,

    A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regu- lators efficiently with only √ T regret,” inInternational Conference on Machine Learning, pp. 1300–1309, PMLR, 2019

  37. [37]

    Logarithmic regret for learning linear quadratic regulators efficiently,

    A. Cassel, A. Cohen, and T. Koren, “Logarithmic regret for learning linear quadratic regulators efficiently,” inInternational Conference on Machine Learning, pp. 1328–1337, PMLR, 2020

  38. [38]

    Achieving logarithmic re- gret via hints in online learning of noisy LQR systems,

    M. Akbari, B. Gharesifard, and T. Linder, “Achieving logarithmic re- gret via hints in online learning of noisy LQR systems,” inIEEE 61st Conference on Decision and Control (CDC), pp. 4700–4705, 2022. 28

  39. [39]

    On the sample com- plexity of the linear quadratic regulator,

    S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample com- plexity of the linear quadratic regulator,”Foundations of Computational Mathematics, vol. 20, no. 4, pp. 633–679, 2020

  40. [40]

    Adaptive optimal control for continuous-time linear systems based on policy iter- ation,

    D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iter- ation,”Automatica, vol. 45, no. 2, pp. 477–484, 2009

  41. [41]

    Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,

    Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,” Automatica, vol. 48, no. 10, pp. 2699–2704, 2012

  42. [42]

    Value iteration for continuous-time lin- ear time-invariant systems,

    C. Possieri and M. Sassano, “Value iteration for continuous-time lin- ear time-invariant systems,”IEEE Transactions on Automatic Control, vol. 68, no. 5, pp. 3070–3077, 2022

  43. [43]

    Optimal tracking control of unknown discrete-time linear systems us- ing input-output measured data,

    B. Kiumarsi, F. L. Lewis, M.-B. Naghibi-Sistani, and A. Karimpour, “Optimal tracking control of unknown discrete-time linear systems us- ing input-output measured data,”IEEE Transactions on Cybernetics, vol. 45, no. 12, pp. 2770–2779, 2015

  44. [44]

    Naive exploration is optimal for online LQR,

    M. Simchowitz and D. Foster, “Naive exploration is optimal for online LQR,” inProceedings of the 37th International Conference on Machine Learning(H. D. III and A. Singh, eds.), vol. 119 ofProceedings of Ma- chine Learning Research, pp. 8937–8948, PMLR, 13–18 Jul 2020

  45. [45]

    Averaging analysis for discrete time and sampled data adaptive systems,

    E.-W. Bai, L.-C. Fu, and S. S. Sastry, “Averaging analysis for discrete time and sampled data adaptive systems,”IEEE Transactions on Cir- cuits and Systems, vol. 35, no. 2, pp. 137–148, 1988

  46. [46]

    B. D. Anderson and J. B. Moore, Optimal control: linear quadratic methods. Courier Corporation, 2007

  47. [47]

    Ontopologicalpropertiesoftheset ofstabilizingfeedbackgains,

    J.Bu, A.Mesbahi, andM.Mesbahi, “Ontopologicalpropertiesoftheset ofstabilizingfeedbackgains,” IEEE Transactions on Automatic Control, vol. 66, no. 2, pp. 730–744, 2020

  48. [48]

    Exponential convergence of recursive least squares with exponential forgetting factor,

    R. M. Johnstone, C. R. Johnson Jr, R. R. Bitmead, and B. D. Anderson, “Exponential convergence of recursive least squares with exponential forgetting factor,”Systems & Control Letters, vol. 2, no. 2, pp. 77–82, 1982

  49. [49]

    Recursive discrete-time sinusoidal oscillators,

    C. S. Turner, “Recursive discrete-time sinusoidal oscillators,”IEEE Sig- nal Processing Magazine, vol. 20, no. 3, pp. 103–111, 2003

  50. [50]

    A note on persistency of excitation,

    J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. De Moor, “A note on persistency of excitation,”Systems & Control Letters, vol. 54, no. 4, pp. 325–329, 2005. 29

  51. [51]

    Persistency of excitation, sufficient richness and parameter convergence in discrete time adaptive control,

    E.-W. Bai and S. S. Sastry, “Persistency of excitation, sufficient richness and parameter convergence in discrete time adaptive control,”Systems & control letters, vol. 6, no. 3, pp. 153–163, 1985

  52. [52]

    A geometric characterization ofthepersistenceofexcitationconditionforthesolutionsofautonomous systems,

    A. Padoan, G. Scarciotti, and A. Astolfi, “A geometric characterization ofthepersistenceofexcitationconditionforthesolutionsofautonomous systems,”IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5666–5677, 2017

  53. [53]

    Isidori, Lectures in feedback design for multivariable systems

    A. Isidori, Lectures in feedback design for multivariable systems . Springer, 2017

  54. [54]

    Asymptotic stability equals exponential stability, and iss equals finite energy gain—if you twist your eyes,

    L. Grüne, E. D. Sontag, and F. R. Wirth, “Asymptotic stability equals exponential stability, and iss equals finite energy gain—if you twist your eyes,”Systems & Control Letters, vol. 38, no. 2, pp. 127–134, 1999

  55. [55]

    Design of feedback control systems for unstable plants with saturating actuators,

    P. Kapasouris, M. Athans, and G. Stein, “Design of feedback control systems for unstable plants with saturating actuators,” inProc. IFAC Symp. on Nonlinear Control System Design, pp. 302–307, Pergamon Press, 1990

  56. [56]

    How and why to solve the operator equa- tion ax- xb= y,

    R. Bhatia and P. Rosenthal, “How and why to solve the operator equa- tion ax- xb= y,”Bulletin of the London Mathematical Society, vol. 29, no. 1, pp. 1–21, 1997

  57. [57]

    Nonlinear dynamical systems and control,

    W. M. Haddad and V. Chellaboina, “Nonlinear dynamical systems and control,” inNonlinear Dynamical Systems and Control, Princeton uni- versity press, 2011. A Proof of Lemma 4.1 We note that (27) is obtained by settingKt = K⋆ in (24) (which compactly collects the updates (19a), (19b), and (23)). Hence, we start by inspect- ing (19a) and (19b) restricted to ...

  58. [58]

    Let us arbitrarily choose ν1, ν2 ∈ (0, 1). Then, for all γ ∈ (0, ¯γa v) with ¯γa v := min n 1, ¯γ0, 2ν1 3β3 , 2ν2 1+β3β2 4 o , we further bound (C.8) as ∆V ( ˜Ka v t , ˜θa v t ) ≤ −γκν1 G( ˜Ka v t + K⋆, θ⋆) 2 + γκβ4 G( ˜Ka v t + K⋆, θ⋆) ˜θa v t − γν2 ˜θa v t 2 (a) = −γ   G( ˜Ka v t +K⋆, θ⋆) ˜θa v t   ⊤ U(κ)   G( ˜Ka v t +K⋆, θ⋆) ˜θa v t  ,(C.9) 7G...