arxiv: 2512.04565 · v2 · submitted 2025-12-04 · 📡 eess.SY · cs.SY· math.OC

Recognition: 2 theorem links

· Lean Theorem

Adapt and Stabilize, Then Learn and Optimize: A New Approach to Adaptive LQR

Peter A. Fisher , Anuradha M. Annaswamy

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:43 UTC · model grok-4.3

classification 📡 eess.SY cs.SYmath.OC

keywords adaptive LQRmodel-reference adaptive controlregret boundsdiscrete-time linear systemsepoch-based adaptationclosed-loop stabilityadaptive control

0 comments

The pith

A new adaptive LQR algorithm first stabilizes the closed loop with direct MRAC then optimizes within epochs to remove the need for an initial stabilizing controller.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an adaptive controller for discrete-time linear quadratic regulation that avoids three practical barriers in prior work. It pairs direct model-reference adaptive control with an epoch structure so the system stabilizes early, exploration stays limited, and computation remains modest. The method delivers a high-probability regret bound comparable to existing results while producing smaller regret when an initial stabilizer or heavy exploration is unavailable. If the claims hold, adaptive LQR becomes usable on plants where a stabilizing controller cannot be designed in advance. Simulations confirm the regret performance under both favorable and unfavorable starting conditions.

Core claim

For a class of discrete-time linear systems the algorithm uses direct MRAC inside successive epochs to drive the closed-loop state to a stable regime and then refines the control parameters, yielding a high-probability regret bound that matches the best known results without requiring an initial stabilizing controller or sustained exploration.

What carries the argument

Direct model-reference adaptive control combined with an epoch-based switching rule that progressively enforces stability before parameter optimization.

If this is right

The approach guarantees closed-loop stability from the first epoch onward without an external stabilizing controller.
Regret remains comparable to state-of-the-art methods when the usual initial-stability or exploration assumptions hold.
Regret drops markedly when those assumptions are dropped, widening the set of plants on which adaptive LQR is practical.
Computational cost stays lower than methods that rely on persistent excitation or intensive online optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The epoch-plus-MRAC template may transfer to other adaptive control problems such as adaptive MPC or nonlinear regulation where early stabilization is the bottleneck.
Hardware tests on uncertain linear plants would directly check whether the theoretical regret bound appears in practice.
Relaxing the discrete-time linear assumption to slowly varying or mildly nonlinear plants could be tested by substituting a different reference model inside the same epoch structure.

Load-bearing premise

The plant must belong to the specific class of discrete-time linear systems for which the direct MRAC stability proof and the epoch regret analysis both apply.

What would settle it

Running the algorithm on a system inside the claimed class and recording either closed-loop instability or cumulative regret that exceeds the stated high-probability bound would refute the central guarantee.

Figures

Figures reproduced from arXiv: 2512.04565 by Anuradha M. Annaswamy, Peter A. Fisher.

**Figure 2.** Figure 2: Laplacian system with stable initial controller: [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: Laplacian system with stable initial controller: [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: 6DOF quadrotor: σexplore = 0.01. Solid lines are the median values over 1000 trials, and shaded regions are the 20%-80% confidence windows. (a) Regret (b) State magnitude [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Laplacian system with stable initial controller: [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Laplacian system with unstable initial controller: [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: 6DOF quadrotor: σexplore = 0.1, σnoise = 0.1. Solid lines are the median values over 1000 trials, and shaded regions are the 20%-80% confidence windows. E Additional proofs for regret analysis E.1 Proof of Lemma 4 From Lemma 3 and Proposition 2, we know that, with probability at least 1 − δ, as long as tk+1 − tk := Tk ≥ O[ σ 2 ck ln(92(n+m)/δ) ∥Φ −1 ck ∥−2 ], we have 1 Tk tk+ X Tk−1 t=tk ϕctϕ ⊤ ct ≥ O " ∥Φ… view at source ↗

read the original abstract

This paper focuses on adaptive control of the discrete-time linear quadratic regulator (adaptive LQR). Recent literature has made significant contributions in proving non-asymptotic convergence rates, but existing approaches have a few drawbacks that pose barriers for practical implementation. These drawbacks include (i) a requirement of an initial stabilizing controller, (ii) a reliance on exploration for closed-loop stability, and/or (iii) computationally intensive algorithms. This paper proposes a new algorithm that overcomes these drawbacks for a particular class of discrete-time systems. This algorithm leverages direct model-reference adaptive control (direct MRAC) and combines it with an epoch-based approach in order to address the drawbacks (i)-(iii) with a provable high-probability regret bound comparable to existing literature. Simulations demonstrate that the proposed approach yields regrets that are comparable to those from existing methods when the conditions (i) and (ii) are met, and yields regrets that are significantly smaller when either of these two conditions is not met.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a direct MRAC algorithm wrapped in epochs that drops the initial stabilizer and exploration requirements for adaptive LQR on a stated class of discrete-time systems while claiming a comparable high-probability regret bound.

read the letter

The main thing to know is that this paper gives an adaptive LQR method using direct MRAC in epochs that avoids needing an initial stabilizing controller or exploration for stability. It comes with a provable high-probability regret bound for certain discrete-time systems and shows good simulation results. They build directly on model reference adaptive control by adapting parameters to match a reference model. The epoch approach means they run periods of adaptation where stability is prioritized before moving to full optimization. This combination is presented as overcoming the listed drawbacks in prior non-asymptotic work. It does well by focusing on practical barriers. The simulations demonstrate that regrets stay reasonable and often better when those extra conditions are dropped. That is a useful check on the method's robustness. Soft spots include the limited system class, which means the claims hold only where the dynamics fit the assumptions. Without the full proof, it's tough to assess exactly how the epochs ensure the bound remains tight and non-circular. The simulation section would be stronger with details on multiple trials and sensitivity to design parameters. This work is for control theorists and practitioners dealing with adaptive LQR in discrete time. A reader who knows the recent literature on regret bounds in adaptive control will appreciate the algorithmic change and its reported performance. The paper shows clear engagement with the issues in the field. It deserves a serious referee. I would recommend sending this to peer review.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes an adaptive LQR algorithm for a stated class of discrete-time linear systems. It combines direct model-reference adaptive control (MRAC) with an epoch-based scheduling mechanism to achieve closed-loop stability and a high-probability regret bound without requiring an initial stabilizing controller or explicit exploration. The approach is claimed to overcome three practical drawbacks of prior methods while delivering regret performance comparable to existing literature; supporting simulation results are presented.

Significance. If the regret bound and stability claims hold under the stated system assumptions, the work would be a meaningful contribution to adaptive control. Removing the need for an initial stabilizer or forced exploration lowers a practical barrier, and the epoch construction appears to leverage established direct MRAC analysis in a way that preserves non-asymptotic guarantees. The simulation comparison (both when prior conditions are satisfied and when they are not) provides useful empirical support.

major comments (2)

[§3.2, Theorem 4.1] §3.2 and Theorem 4.1: the high-probability regret bound is stated to be comparable to the literature, yet the dependence on the epoch length T_k and the MRAC adaptation gain is not made fully explicit. It is unclear whether the union bound over epochs preserves the claimed probability without additional logarithmic factors that would alter the comparison.
[Assumption 2.1] Assumption 2.1 and the reference-model matching condition: the analysis assumes the reference model is chosen such that the matching equation admits a solution; this is standard for direct MRAC but should be verified to hold uniformly for the LQR cost matrices used in the regret analysis, otherwise the stability claim in the first epoch may not transfer.

minor comments (3)

[Figure 3] Figure 3: the regret plots lack error bars or indication of the number of Monte-Carlo runs; adding this would strengthen the empirical comparison.
[Notation] Notation: the symbol for the epoch index is occasionally overloaded with the time index inside the epoch; a clearer distinction (e.g., k for epoch, t for intra-epoch time) would improve readability.
[Simulation section] The simulation section should state the exact system dimensions, noise variance, and how the initial condition is sampled to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and positive recommendation for minor revision. We address each major comment below, making revisions to improve clarity and explicitness as suggested.

read point-by-point responses

Referee: [§3.2, Theorem 4.1] §3.2 and Theorem 4.1: the high-probability regret bound is stated to be comparable to the literature, yet the dependence on the epoch length T_k and the MRAC adaptation gain is not made fully explicit. It is unclear whether the union bound over epochs preserves the claimed probability without additional logarithmic factors that would alter the comparison.

Authors: We agree that greater explicitness would benefit the reader. The proof of Theorem 4.1 accounts for the per-epoch contribution, where the regret in epoch k scales with the epoch length T_k and the adaptation gain in the MRAC update. Summing over epochs yields a bound comparable to the literature (e.g., O(sqrt(T) log T) high-probability regret). For the union bound, the number of epochs is O(log T), so the failure probability per epoch is set to δ / log T, introducing only an additional log log T factor that is dominated by the existing logarithmic terms in the bound. This preserves the comparability. In the revised version, we have added an explicit statement in §3.2 and a note in the proof sketch regarding these dependencies and the union bound. revision: yes
Referee: [Assumption 2.1] Assumption 2.1 and the reference-model matching condition: the analysis assumes the reference model is chosen such that the matching equation admits a solution; this is standard for direct MRAC but should be verified to hold uniformly for the LQR cost matrices used in the regret analysis, otherwise the stability claim in the first epoch may not transfer.

Authors: This is a valid point for ensuring the first-epoch stability. The reference model is fixed a priori to be a stable system for which the matching condition holds for all plants in the class defined by Assumption 2.1 (i.e., systems for which there exists a controller achieving the reference dynamics). Since the LQR costs are fixed and positive definite, the optimal controller for the reference model satisfies the matching equation independently of the unknown plant parameters. We have inserted a brief verification paragraph after Assumption 2.1 in the revised manuscript to confirm that this holds uniformly, thereby securing the stability transfer to the first epoch. revision: yes

Circularity Check

0 steps flagged

Derivation builds on established direct MRAC without reduction to inputs by construction

full rationale

The paper's central construction combines direct model-reference adaptive control with an epoch-based scheduling mechanism to achieve stability and a high-probability regret bound for a stated class of discrete-time systems. No step in the provided abstract or high-level argument reduces the claimed regret bound or stability result to a fitted parameter, self-definition, or self-citation chain that is itself unverified within the paper. The epoch construction is introduced to remove prior requirements (initial stabilizer or explicit exploration) rather than presupposing the target bound. Once the system class and standard MRAC matching properties are accepted, the derivation proceeds independently without the circular patterns of self-definitional equivalence or fitted-input-as-prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; full paper likely contains additional technical assumptions in the regret proof and system class definition.

axioms (1)

domain assumption The plant belongs to a particular class of discrete-time systems for which direct MRAC applies.
Explicitly stated as the scope of the algorithm in the abstract.

pith-pipeline@v0.9.0 · 5478 in / 1167 out tokens · 30366 ms · 2026-05-17T01:43:14.901410+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

direct model-reference adaptive control (direct MRAC) ... epoch-based approach ... WRLS-PROJ ... comparator system ... Regret(T) ≤ eO(T^{2/3})
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 1 (Matched Uncertainties) ... Am = A* + Bm ΘA* ... reference model updates Am(k+1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

Regret bounds for the adaptive control of linear quadratic systems,

Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” inProceedings of the 24th Annual Conference on Learning Theory. JMLR Workshop and Conference Proceedings, 2011, pp. 1–26

work page 2011
[2]

Efficient reinforcement learning for high dimensional linear quadratic systems,

M. Ibrahimi, A. Javanmard, and B. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” inAdvances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., vol. 25. Curran Associates, Inc., 2012. [Online]. Available: https: //proceedings.neurips.cc/paper_files/paper/2012/file/a9e...

work page 2012
[3]

Regret bounds for robust adaptive control of the linear quadratic regulator,

S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,” inAdvances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018, pp. 4192–4201

work page 2018
[4]

Learning linear-quadratic regulators efficiently with only √ T regret,

A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regulators efficiently with only √ T regret,” inProceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 1300–1309. [Online]. Available: https://proceedings.m...

work page 2019
[5]

Certainty equivalence is efficient for linear quadratic control,

H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efficient for linear quadratic control,” inAdvances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https: //proceedings.neurips.cc/paper_files/paper/2019/f...

work page 2019
[6]

Naive exploration is optimal for online LQR,

M. Simchowitz and D. Foster, “Naive exploration is optimal for online LQR,” inProceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 8937–8948. [Online]. Available: https://proceedings.mlr.press/v119/simchowitz20a.html

work page 2020
[7]

Reinforcement learning with fast stabiliza- tion in linear dynamical systems,

S. Lale, K. Azizzadenesheli, B. Hassibi, and A. Anandkumar, “Reinforcement learning with fast stabiliza- tion in linear dynamical systems,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 5354–5390

work page 2022
[8]

Accurate parameter estimation for safety- critical systems with unmodeled dynamics,

A. Sarker, P. Fisher, J. E. Gaudio, and A. M. Annaswamy, “Accurate parameter estimation for safety- critical systems with unmodeled dynamics,”Artificial Intelligence, p. 103857, 2023

work page 2023
[9]

On self tuning regulators,

K. Åström and B. Wittenmark, “On self tuning regulators,”Automatica, vol. 9, no. 2, pp. 185–199, 1973. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0005109873900733

work page arXiv 1973
[10]

K. S. Narendra and A. M. Annaswamy,Stable Adaptive Systems. NJ: Dover Publications, 2005, (original publication by Prentice-Hall Inc., 1989). 10

work page 2005
[11]

G. C. Goodwin and K. S. Sin,Adaptive Filtering Prediction and Control. Prentice Hall, 1984

work page 1984
[12]

R. F. Stengel,Optimal control and estimation. Courier Corporation, 1994

work page 1994
[13]

Vershynin,High-dimensional probability: An introduction with applications in data science

R. Vershynin,High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018, vol. 47

work page 2018
[14]

Subgaussian sequences in probability and fourier analysis,

G. Pisier, “Subgaussian sequences in probability and fourier analysis,”Graduate J. Math, vol. 1, pp. 60–80, 2016

work page 2016
[15]

Self-convergence of weighted least-squares with applications to stochastic adaptive control,

L. Guo, “Self-convergence of weighted least-squares with applications to stochastic adaptive control,” IEEE Transactions on Automatic Control, vol. 41, no. 1, pp. 79–89, 1996

work page 1996
[16]

Adaptive control and intersections with reinforcement learning,

A. M. Annaswamy, “Adaptive control and intersections with reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 6, 2023

work page 2023
[17]

Adaptive control of the linear quadratic regulator,

S. Dean and S. Tu, “Adaptive control of the linear quadratic regulator,” https://github.com/modestyachts/ robust-adaptive-lqr, 2018

work page 2018
[18]

Adaptive linear quadratic control using policy iteration,

S. Bradtke, B. Ydstie, and A. Barto, “Adaptive linear quadratic control using policy iteration,” in Proceedings of 1994 American Control Conference - ACC ’94, vol. 3, 1994, pp. 3475–3479 vol.3

work page 1994
[19]

Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,

Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,”Automatica, vol. 48, no. 10, pp. 2699–2704, 2012. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0005109812003664

work page 2012
[20]

Global convergence of policy gradient methods for the linear quadratic regulator,

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1467–1476. [Online]. Available: https://proceedi...

work page 2018
[21]

On the linear convergence of random search for discrete-time lqr,

H. Mohammadi, M. Soltanolkotabi, and M. R. Jovanović, “On the linear convergence of random search for discrete-time lqr,”IEEE Control Systems Letters, vol. 5, no. 3, pp. 989–994, 2021

work page 2021
[22]

Iterative feedback tuning: theory and applications,

H. Hjalmarsson, M. Gevers, S. Gunnarsson, and O. Lequin, “Iterative feedback tuning: theory and applications,”IEEE Control Systems Magazine, vol. 18, no. 4, pp. 26–41, 1998

work page 1998
[23]

Data-driven finite-horizon optimal control for linear time-varying discrete-time systems,

B. Pang, T. Bian, and Z.-P. Jiang, “Data-driven finite-horizon optimal control for linear time-varying discrete-time systems,” in2018 IEEE Conference on Decision and Control (CDC), 2018, pp. 861–866

work page 2018
[24]

Robust data-driven state-feedback design,

J. Berberich, A. Koch, C. W. Scherer, and F. Allgöwer, “Robust data-driven state-feedback design,” in 2020 American Control Conference (ACC), 2020, pp. 1532–1538

work page 2020
[25]

Datainformativity: anewperspective on data-driven analysis and control,

H.J.vanWaarde, J.Eising, H.L.Trentelman, andM.K.Camlibel, “Datainformativity: anewperspective on data-driven analysis and control,” 2020. [Online]. Available: https://arxiv.org/abs/1908.00468

work page arXiv 2020
[26]

On the certainty-equivalence approach to direct data-driven lqr design,

F. Dörfler, P. Tesi, and C. De Persis, “On the certainty-equivalence approach to direct data-driven lqr design,”IEEE Transactions on Automatic Control, vol. 68, no. 12, pp. 7989–7996, 2023

work page 2023
[27]

Convergence rate of least-squares identification and adaptive control for stochastic systems†,

H.-F. CHEN and L. GUO, “Convergence rate of least-squares identification and adaptive control for stochastic systems†,”International Journal of Control, vol. 44, no. 5, pp. 1459–1476, 1986. [Online]. Available: https://doi.org/10.1080/00207178608933679

work page doi:10.1080/00207178608933679 1986
[28]

Adaptive linear quadratic gaussian control: the cost-biased approach revisited,

M. C. Campi and P. R. Kumar, “Adaptive linear quadratic gaussian control: the cost-biased approach revisited,”SIAM Journal on Control and Optimization, vol. 36, no. 6, pp. 1890–1907, 1998

work page 1907
[29]

I. D. Landau, R. Lozano, M. M’Saad, and A. Karimi,Adaptive Control: Algorithms, Analysis and Applications. Springer Science & Business Media, 2011. 11

work page 2011
[30]

Integration of adaptive control and reinforcement learning for real-time control and learning,

A. M. Annaswamy, A. Guha, Y. Cui, S. Tang, P. A. Fisher, and J. E. Gaudio, “Integration of adaptive control and reinforcement learning for real-time control and learning,”IEEE Transactions on Automatic Control, pp. 1–16, 2023

work page 2023
[31]

Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems

Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari, “Online least squares estimation with self-normalized processes: An application to bandit problems,” 2011. [Online]. Available: https://arxiv.org/abs/1102.2670 12 A Analysis of the comparator system A.1 Proof of Theorem 1 Define a positive definite sequence Xct =x ⊤ ctPk,lyapxct (17) where k is the index of th...

work page internal anchor Pith review Pith/arXiv arXiv 2011
[32]

TX t=0 ∥ξt∥2 # +O(1).(35) Additionally, using an argument similar to that in the proof of Lemma 1 in [15], in the limit asT→ ∞, we have T−1X t=0 w⊤ t+1B ⊤ 1 P k,lyapB1BmeΘtϕt =o

It is straightforward to verify thatP k,lyap and Qk,lyap are symmetric positive-definite and satisfy A ⊤ k P k,lyapAk − P k,lyap =− Qk,lyap.(30) Furthermore, by the same reasoning as in the proof of Theorem 1, there exist finiteP lyap, P lyap, Qlyap, Qlyap ∈ (0,∞)such thatP lyap ≤Tr[ P k,lyap]≤ P lyap andQ lyap ≤Tr[ Qk,lyap]≤ Qlyap ∀k∈Z ≥0. Now, define a ...

work page