Rate-Optimal Regret for the Safe Learning-based Control of the Constrained Linear Quadratic Regulator
Pith reviewed 2026-05-08 11:18 UTC · model grok-4.3
The pith
Adaptive control of constrained stochastic LQR achieves ãO(sqrt(T)) regret while satisfying chance constraints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We study the problem of adaptive control of the stochastic linear quadratic regulator (LQR) with constraints that must be satisfied at every time step. Prior work on the multidimensional problem has shown ãO(T^{2/3}) regret and satisfaction of robust constraints, leaving open the question of whether ãO(sqrt(T)) regret can be attained in the constrained LQR setting. We contribute to this problem by showing ãO(sqrt(T)) regret and satisfaction of chance constraints. This type of constraints allow us to handle unbounded noise and also enable analytical techniques not directly applicable to robust constraints. Our proposed algorithm for this problem uses an SDP to select an optimistic policy, and
What carries the argument
The covariance-bounding lemma that expresses system covariance directly in terms of the chosen policy; this lemma supplies both the regret upper bound and the probabilistic guarantee that chance constraints hold at every step.
If this is right
- Regret improves from the previous ãO(T^{2/3}) rate to the near-optimal ãO(sqrt(T)) rate that is standard for unconstrained LQR.
- Chance constraints remain satisfied with high probability for every time step despite unbounded disturbances.
- The scaling-back step converts any optimistic candidate into a verifiably safe policy without destroying the regret order.
- Covariance-based analysis replaces the cost-to-go arguments that are standard in adaptive LQR proofs.
Where Pith is reading between the lines
- Chance constraints appear to be a strictly more permissive route to optimal rates than robust constraints in this setting.
- The covariance lemma may transfer to other linear stochastic control problems where safety must be maintained online.
- The SDP-plus-scaling structure could be reused in other optimistic algorithms that must enforce hard or probabilistic safety limits.
Load-bearing premise
The lemma that bounds the closed-loop covariance as a function of the applied policy remains valid for the optimistic policies produced by the algorithm.
What would settle it
A concrete constrained LQR instance on which the algorithm either produces regret growing faster than sqrt(T) or violates the stated chance constraints on a positive fraction of sample paths.
read the original abstract
We study the problem of adaptive control of the stochastic linear quadratic regulator (LQR) with constraints that must be satisfied at every time step. Prior work on the multidimensional problem has shown $\tilde{O}(T^{2/3})$ regret and satisfaction of robust constraints, leaving open the question of whether $\tilde{O}(\sqrt{T})$ regret can be attained in the constrained LQR setting. We contribute to this problem by showing $\tilde{O}(\sqrt{T})$ regret and satisfaction of chance constraints. This type of constraints allow us to handle unbounded noise and also enable analytical techniques not directly applicable to robust constraints. Our proposed algorithm for this problem uses an SDP to select an optimistic policy, and then "scales back" this policy until it is verifiably-safe. Our theoretical analysis establishes regret and constraint guarantees via a key lemma that bounds the system covariance in terms of the chosen policy. This covariance-based analysis is in contrast with the cost-to-go based analysis that is typically used in adaptive LQR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses adaptive control of the stochastic constrained LQR, where constraints must hold at every step. Prior multidimensional results achieved only Õ(T^{2/3}) regret under robust constraints. The authors propose an algorithm that solves an SDP for an optimistic policy and then scales the gain by a factor α ≤ 1 to restore safety, claiming Õ(√T) regret together with satisfaction of chance constraints. The analysis relies on a new covariance lemma that bounds closed-loop covariance in terms of the chosen policy, in contrast to standard cost-to-go arguments.
Significance. If the covariance lemma and its application to scaled policies are correct, the result would close the gap to rate-optimal regret for chance-constrained LQR and introduce a covariance-based proof technique that avoids cost-to-go analysis. Chance constraints also permit unbounded noise, broadening applicability relative to robust formulations.
major comments (1)
- [Abstract and covariance lemma (analysis section)] The central regret and constraint guarantees rest on the covariance lemma (referenced in the abstract as bounding system covariance in terms of the chosen policy). The algorithm explicitly scales the SDP-derived gain by α ≤ 1, which changes the closed-loop matrix to A + B K α. It is not clear from the provided description whether the lemma is re-proved or adjusted for this scaling; if the bound degrades with α, the subsequent decomposition that converts covariance into Õ(√T) terms may not close. This is load-bearing for the main claim.
minor comments (1)
- [Abstract] Notation for the scaling factor α and its dependence on the safety margin should be introduced earlier and used consistently when stating the policy selection step.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for highlighting the importance of clarifying the covariance lemma's applicability under policy scaling. We address the major comment below and will incorporate revisions to improve clarity.
read point-by-point responses
-
Referee: [Abstract and covariance lemma (analysis section)] The central regret and constraint guarantees rest on the covariance lemma (referenced in the abstract as bounding system covariance in terms of the chosen policy). The algorithm explicitly scales the SDP-derived gain by α ≤ 1, which changes the closed-loop matrix to A + B K α. It is not clear from the provided description whether the lemma is re-proved or adjusted for this scaling; if the bound degrades with α, the subsequent decomposition that converts covariance into Õ(√T) terms may not close. This is load-bearing for the main claim.
Authors: The covariance lemma is stated and proved for an arbitrary stabilizing policy K (see Lemma 3 in the analysis section), with the bound expressed directly in terms of the closed-loop matrix A + BK and the associated covariance. The algorithm applies this lemma to the scaled policy αK (with α ≤ 1) that is actually implemented; no separate re-proof is required because the lemma depends only on the stability of the applied closed-loop dynamics, which is preserved under scaling. The resulting covariance bound is then substituted into the regret decomposition, where the contribution of any α < 1 is controlled by the choice of α (which is selected to satisfy the chance constraint while remaining sufficiently close to the optimistic gain). This ensures the overall regret remains Õ(√T). We acknowledge that the manuscript description could be clearer on this point and will add an explicit remark in the revised version stating that the lemma applies verbatim to scaled policies and that the scaling factor does not alter the Õ(√T) rate. revision: yes
Circularity Check
Covariance lemma provides independent bound; derivation self-contained
full rationale
The paper's central guarantees rest on a key lemma bounding closed-loop covariance in terms of the selected policy, which is then used to derive both regret and chance-constraint satisfaction. This covariance-based route is explicitly contrasted with standard cost-to-go analysis and does not reduce any claimed result to a fitted parameter, self-definition, or load-bearing self-citation by construction. The SDP-based optimistic policy selection followed by scaling is an algorithmic step whose safety and performance are analyzed via the lemma rather than presupposed by it. No step in the provided derivation chain exhibits the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions for stochastic LQR such as stabilizability and bounded noise moments
Reference graph
Works this paper leans on
-
[1]
Then, (24) implies thatdλ −1ν(µ+κ 2γ−1ηD2)≤ ϵ2 8Φ−1(1−δ)2
First, note thatexp(−˜γρ) ≲ 1 T 2, ζ≲ 1√ T, ηλ−1 ≲ 1and therefore C1 ≲ 1√ T. Then, (24) implies thatdλ −1ν(µ+κ 2γ−1ηD2)≤ ϵ2 8Φ−1(1−δ)2. Therefore, ϵ1 ≥ ϵ2 16Φ−1(1−δ) 2 ⇐=D 2(1 +κ) 2κ2γ−1(1−γ) 2T + 2Φ−1(1−δ/2)β 2ω ϕ(Φ−1(1−δ/2))Φ −1(1−δ) 4 +C 1 ≤ ϵ2 16Φ−1(1−δ) 2 . (25) Note that(1 −γ )2T ≤exp (−2γT ) ≤exp (−log (2γT + 1)) = 1 2γT+1 where we use thatx≥log (1...
-
[2]
PT t=1 ∥¯Σkt −Σ t∥ ≤ 2νN ζ 3.∥Σ t+1 −Σ t∥ ≤2νζ. 4.∥ ¯Σkt −Σ t∥ ≤2ν(1−ζ) t−τk. Next, we show that the state is bounded with high probability. Lemma 6(Bounded State).In addition to the assumptions of Lemma 10, assume that: 1.W⪰σI, 2.A ⋆ is(κ, γ)-strongly stable, 3.∥Θ ⋆∥ ≤S 20 4.ν≥ κ2tr(W) γ 5.λ≥max HR 2 z log(2) , 8ην σ , η,4κ 2γ−2 4 ¯wdlog 3 d+λ −1T R2 z /...
-
[3]
¯Vk ⪯3 ¯Vk−1 for allk, 3.V t ⪯2V t−ρ for allt, 4.N≤1 + 2dlog 1 + R2 zT dλ Next, we give a lemma that shows how conditional chance-constraints can be expressed in terms of the conditional mean and conditional covariance. The proof is in Section B.6.3. Lemma 8(Reformulation of Chance Constraints).Ifz t|Fτ ∼ N(m t|τ , St|τ), then it holds that, P(α⊤ j zt ≤β|...
-
[4]
the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], 8.ρ≥2˜γ −1 log max(β,ϵ/4) (1+˜κ)˜κRzD 9.T≥γ −1 log max(β,ϵ/4) (1+κ)κR1D 10.ω≤δ/2 11.ϵ 1 >0, where, ϵ1 := ϵ2 4Φ−1(1−δ) 2 −D 2(1 +κ) 2κ2γ−1(1−γ) 2T − 2Φ−1(1−δ/2)β 2ω ϕ(Φ−1(1−δ/2))Φ −1(1−δ) 4 −C 1 −dλ −1ν(µ+κ 2γ−1ηD2) 22 Then, under eventFk, it holds th...
-
[5]
the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], Then, it holds thatP(α⊤ j zt ≤β)≥1−δfor allt∈[τ 1 −1]and, P ∥ ˆΘ0 −Θ ⋆∥F ≤300 ¯wd s n+dlog(306 ¯Γ min(σ, c2/m)−1ω−1) (τ0 −1) min(σ, c2/m) ! ≥1−ω/3, where ¯Γ = κ2R1 +κ 2γ−1( p 2n¯wlog(2nT /ω) +Sc) +c 2 . B.1.3 Regret Decomposition We decompose the reg...
work page 2018
-
[6]
Then, we show thattr(Σs) ≤ν for all s∈ [τk+1 − 1]
is well-defined in the sense that there existsϕ = 0such that ϕΣo i + (1 −ϕ )Σsafe i ∈ E p i for all i≤k . Then, we show thattr(Σs) ≤ν for all s∈ [τk+1 − 1]. Indeed, the conditions for Lemma 33 are satisfied since Fk is assumed to hold, and due to the assumption onν and the assumption thatA⋆ is strongly stable. Therefore Lemma 33 says thattr(Σs) ≤ν for all...
work page 2019
-
[7]
eventF k holds 2.A ⋆ is(κ, γ)-strongly stable 3.ν≥ κ2tr(W) γ LetΣ safe ⋆,xx be the true covariance of the state under zero input, i.e.Σsafe ⋆,xx = A⋆Σsafe ⋆,xxA⊤ ⋆ + W. Then, it holds that, ∥Σsafe ⋆,xx −Σ safe k,xx∥ ≤ (n+m)κ 2ην γλ Proof.Note that, Σsafe ⋆,xx −Σ safe k,xx =A ⋆Σsafe ⋆,xxA⊤ ⋆ − ˆAkΣsafe k,xx ˆA⊤ k =A ⋆(Σsafe ⋆,xx −Σ safe k,xx)A⊤ ⋆ +A ⋆Σsafe...
-
[8]
Weusethenotation ST = I 0 ST,xx I 0 ⊤ with Sxx =PT−1 s=0 As ⋆W (As ⋆)⊤, andmt = I 0 AT−1 ⋆ x1 I 0 ⊤
the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], 6.ρ≥2˜γ −1 log max(β,ϵ/4) (1+˜κ)˜κRzD 7.T≥γ −1 log max(β,ϵ/4) (1+κ)κR1D 8.ω≤δ/2 9.ξ= β−D(1+˜κ)˜κ(1−˜γ/2)ρRz Φ−1(1−δ+ω) 2 −D 2C1 10.ϵ 2 >0, where, ϵ2 := ϵ2 4Φ−1(1−δ) 2 −D 2(1 +κ) 2κ2γ−1(1−γ) 2T − 2Φ−1(1−δ/2)β 2ω ϕ(Φ−1(1−δ/2))Φ −1(1−δ) 4 −D 2C1 Then, i...
work page 2018
-
[9]
Proof.Consider the sequencez safe t = xsafe t 0 wherex safe t+1 =A ⋆xsafe t +w t andx safe 1 =x 1
the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], Then, it holds thatP(α⊤ j zt ≤β)≥1−δfor allt∈[τ 1 −1]. Proof.Consider the sequencez safe t = xsafe t 0 wherex safe t+1 =A ⋆xsafe t +w t andx safe 1 =x 1. Then, xt+1 −x safe t+1 =A ⋆(xt −x safe t ) +B ⋆ut =A t ⋆(x1 −x safe 1 ) + tX s=1 At−s ⋆ B⋆us. Th...
-
[10]
zt − Axt−1 0 zt − Axt−1 0 ⊤ |Ft−2 # v =v ⊤E
the policyK=0is(κ, γ)-strongly stable 3.∥x 1∥ ≤R 1 If, τ0 ≥1 + 134dlog 34R2 z ωmin(σ, c 2/m) . Then, it holds that, P ∥ ˆΘ0 −Θ ⋆∥F ≤300 ¯wd s n+dlog(306 ¯Γ min(σ, c2/m)−1ω−1) (τ0 −1) min(σ, c2/m) ! ≥1−ω/3, where ¯Γ = κ2R1 +κ 2γ−1( p 2n¯wlog(2nT /ω) +Sc) +c 2 . Proof. We apply Theorem 2.4 of Simchowitz et al. (2018). We verify each of the requirements for ...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.