Rate-Optimal Regret for the Safe Learning-based Control of the Constrained Linear Quadratic Regulator

Mahnoosh Alizadeh; Nanfei Jiang; Spencer Hutchinson

arxiv: 2604.22158 · v2 · submitted 2026-04-24 · 🧮 math.OC · cs.LG

Rate-Optimal Regret for the Safe Learning-based Control of the Constrained Linear Quadratic Regulator

Spencer Hutchinson , Nanfei Jiang , Mahnoosh Alizadeh This is my paper

Pith reviewed 2026-05-08 11:18 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords adaptive controllinear quadratic regulatorregret boundschance constraintssafe learningstochastic systemsconstrained LQR

0 comments

The pith

Adaptive control of constrained stochastic LQR achieves ãO(sqrt(T)) regret while satisfying chance constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that near-optimal ãO(sqrt(T)) regret is possible for adaptive control of the stochastic linear quadratic regulator subject to per-step constraints. Earlier multidimensional results were limited to ãO(T^{2/3}) regret under robust constraints; the improvement here comes from switching to chance constraints, which tolerate unbounded noise and support a new analysis technique. The algorithm picks an optimistic policy by solving a semidefinite program and then scales the policy back until safety is guaranteed. Regret and constraint bounds both follow from a single lemma that controls the state covariance as a function of the applied policy.

Core claim

What carries the argument

The covariance-bounding lemma that expresses system covariance directly in terms of the chosen policy; this lemma supplies both the regret upper bound and the probabilistic guarantee that chance constraints hold at every step.

If this is right

Regret improves from the previous ãO(T^{2/3}) rate to the near-optimal ãO(sqrt(T)) rate that is standard for unconstrained LQR.
Chance constraints remain satisfied with high probability for every time step despite unbounded disturbances.
The scaling-back step converts any optimistic candidate into a verifiably safe policy without destroying the regret order.
Covariance-based analysis replaces the cost-to-go arguments that are standard in adaptive LQR proofs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Chance constraints appear to be a strictly more permissive route to optimal rates than robust constraints in this setting.
The covariance lemma may transfer to other linear stochastic control problems where safety must be maintained online.
The SDP-plus-scaling structure could be reused in other optimistic algorithms that must enforce hard or probabilistic safety limits.

Load-bearing premise

The lemma that bounds the closed-loop covariance as a function of the applied policy remains valid for the optimistic policies produced by the algorithm.

What would settle it

A concrete constrained LQR instance on which the algorithm either produces regret growing faster than sqrt(T) or violates the stated chance constraints on a positive fraction of sample paths.

read the original abstract

We study the problem of adaptive control of the stochastic linear quadratic regulator (LQR) with constraints that must be satisfied at every time step. Prior work on the multidimensional problem has shown $\tilde{O}(T^{2/3})$ regret and satisfaction of robust constraints, leaving open the question of whether $\tilde{O}(\sqrt{T})$ regret can be attained in the constrained LQR setting. We contribute to this problem by showing $\tilde{O}(\sqrt{T})$ regret and satisfaction of chance constraints. This type of constraints allow us to handle unbounded noise and also enable analytical techniques not directly applicable to robust constraints. Our proposed algorithm for this problem uses an SDP to select an optimistic policy, and then "scales back" this policy until it is verifiably-safe. Our theoretical analysis establishes regret and constraint guarantees via a key lemma that bounds the system covariance in terms of the chosen policy. This covariance-based analysis is in contrast with the cost-to-go based analysis that is typically used in adaptive LQR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper addresses adaptive control of the stochastic constrained LQR, where constraints must hold at every step. Prior multidimensional results achieved only Õ(T^{2/3}) regret under robust constraints. The authors propose an algorithm that solves an SDP for an optimistic policy and then scales the gain by a factor α ≤ 1 to restore safety, claiming Õ(√T) regret together with satisfaction of chance constraints. The analysis relies on a new covariance lemma that bounds closed-loop covariance in terms of the chosen policy, in contrast to standard cost-to-go arguments.

Significance. If the covariance lemma and its application to scaled policies are correct, the result would close the gap to rate-optimal regret for chance-constrained LQR and introduce a covariance-based proof technique that avoids cost-to-go analysis. Chance constraints also permit unbounded noise, broadening applicability relative to robust formulations.

major comments (1)

[Abstract and covariance lemma (analysis section)] The central regret and constraint guarantees rest on the covariance lemma (referenced in the abstract as bounding system covariance in terms of the chosen policy). The algorithm explicitly scales the SDP-derived gain by α ≤ 1, which changes the closed-loop matrix to A + B K α. It is not clear from the provided description whether the lemma is re-proved or adjusted for this scaling; if the bound degrades with α, the subsequent decomposition that converts covariance into Õ(√T) terms may not close. This is load-bearing for the main claim.

minor comments (1)

[Abstract] Notation for the scaling factor α and its dependence on the safety margin should be introduced earlier and used consistently when stating the policy selection step.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the importance of clarifying the covariance lemma's applicability under policy scaling. We address the major comment below and will incorporate revisions to improve clarity.

read point-by-point responses

Referee: [Abstract and covariance lemma (analysis section)] The central regret and constraint guarantees rest on the covariance lemma (referenced in the abstract as bounding system covariance in terms of the chosen policy). The algorithm explicitly scales the SDP-derived gain by α ≤ 1, which changes the closed-loop matrix to A + B K α. It is not clear from the provided description whether the lemma is re-proved or adjusted for this scaling; if the bound degrades with α, the subsequent decomposition that converts covariance into Õ(√T) terms may not close. This is load-bearing for the main claim.

Authors: The covariance lemma is stated and proved for an arbitrary stabilizing policy K (see Lemma 3 in the analysis section), with the bound expressed directly in terms of the closed-loop matrix A + BK and the associated covariance. The algorithm applies this lemma to the scaled policy αK (with α ≤ 1) that is actually implemented; no separate re-proof is required because the lemma depends only on the stability of the applied closed-loop dynamics, which is preserved under scaling. The resulting covariance bound is then substituted into the regret decomposition, where the contribution of any α < 1 is controlled by the choice of α (which is selected to satisfy the chance constraint while remaining sufficiently close to the optimistic gain). This ensures the overall regret remains Õ(√T). We acknowledge that the manuscript description could be clearer on this point and will add an explicit remark in the revised version stating that the lemma applies verbatim to scaled policies and that the scaling factor does not alter the Õ(√T) rate. revision: yes

Circularity Check

0 steps flagged

Covariance lemma provides independent bound; derivation self-contained

full rationale

The paper's central guarantees rest on a key lemma bounding closed-loop covariance in terms of the selected policy, which is then used to derive both regret and chance-constraint satisfaction. This covariance-based route is explicitly contrasted with standard cost-to-go analysis and does not reduce any claimed result to a fitted parameter, self-definition, or load-bearing self-citation by construction. The SDP-based optimistic policy selection followed by scaling is an algorithmic step whose safety and performance are analyzed via the lemma rather than presupposed by it. No step in the provided derivation chain exhibits the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are detailed beyond standard LQR context.

axioms (1)

domain assumption Standard assumptions for stochastic LQR such as stabilizability and bounded noise moments
Implied by the problem setup and use of chance constraints for unbounded noise.

pith-pipeline@v0.9.0 · 5479 in / 1021 out tokens · 23120 ms · 2026-05-08T11:18:13.280717+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

Then, (24) implies thatdλ −1ν(µ+κ 2γ−1ηD2)≤ ϵ2 8Φ−1(1−δ)2

First, note thatexp(−˜γρ) ≲ 1 T 2, ζ≲ 1√ T, ηλ−1 ≲ 1and therefore C1 ≲ 1√ T. Then, (24) implies thatdλ −1ν(µ+κ 2γ−1ηD2)≤ ϵ2 8Φ−1(1−δ)2. Therefore, ϵ1 ≥ ϵ2 16Φ−1(1−δ) 2 ⇐=D 2(1 +κ) 2κ2γ−1(1−γ) 2T + 2Φ−1(1−δ/2)β 2ω ϕ(Φ−1(1−δ/2))Φ −1(1−δ) 4 +C 1 ≤ ϵ2 16Φ−1(1−δ) 2 . (25) Note that(1 −γ )2T ≤exp (−2γT ) ≤exp (−log (2γT + 1)) = 1 2γT+1 where we use thatx≥log (1...

work page
[2]

4.∥ ¯Σkt −Σ t∥ ≤2ν(1−ζ) t−τk

PT t=1 ∥¯Σkt −Σ t∥ ≤ 2νN ζ 3.∥Σ t+1 −Σ t∥ ≤2νζ. 4.∥ ¯Σkt −Σ t∥ ≤2ν(1−ζ) t−τk. Next, we show that the state is bounded with high probability. Lemma 6(Bounded State).In addition to the assumptions of Lemma 10, assume that: 1.W⪰σI, 2.A ⋆ is(κ, γ)-strongly stable, 3.∥Θ ⋆∥ ≤S 20 4.ν≥ κ2tr(W) γ 5.λ≥max HR 2 z log(2) , 8ην σ , η,4κ 2γ−2 4 ¯wdlog 3 d+λ −1T R2 z /...

work page
[3]

The proof is in Section B.6.3

¯Vk ⪯3 ¯Vk−1 for allk, 3.V t ⪯2V t−ρ for allt, 4.N≤1 + 2dlog 1 + R2 zT dλ Next, we give a lemma that shows how conditional chance-constraints can be expressed in terms of the conditional mean and conditional covariance. The proof is in Section B.6.3. Lemma 8(Reformulation of Chance Constraints).Ifz t|Fτ ∼ N(m t|τ , St|τ), then it holds that, P(α⊤ j zt ≤β|...

work page
[4]

The proof is in Section B.6.4

the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], 8.ρ≥2˜γ −1 log max(β,ϵ/4) (1+˜κ)˜κRzD 9.T≥γ −1 log max(β,ϵ/4) (1+κ)κR1D 10.ω≤δ/2 11.ϵ 1 >0, where, ϵ1 := ϵ2 4Φ−1(1−δ) 2 −D 2(1 +κ) 2κ2γ−1(1−γ) 2T − 2Φ−1(1−δ/2)β 2ω ϕ(Φ−1(1−δ/2))Φ −1(1−δ) 4 −C 1 −dλ −1ν(µ+κ 2γ−1ηD2) 22 Then, under eventFk, it holds th...

work page
[5]

the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], Then, it holds thatP(α⊤ j zt ≤β)≥1−δfor allt∈[τ 1 −1]and, P ∥ ˆΘ0 −Θ ⋆∥F ≤300 ¯wd s n+dlog(306 ¯Γ min(σ, c2/m)−1ω−1) (τ0 −1) min(σ, c2/m) ! ≥1−ω/3, where ¯Γ = κ2R1 +κ 2γ−1( p 2n¯wlog(2nT /ω) +Sc) +c 2 . B.1.3 Regret Decomposition We decompose the reg...

work page 2018
[6]

Then, we show thattr(Σs) ≤ν for all s∈ [τk+1 − 1]

is well-defined in the sense that there existsϕ = 0such that ϕΣo i + (1 −ϕ )Σsafe i ∈ E p i for all i≤k . Then, we show thattr(Σs) ≤ν for all s∈ [τk+1 − 1]. Indeed, the conditions for Lemma 33 are satisfied since Fk is assumed to hold, and due to the assumption onν and the assumption thatA⋆ is strongly stable. Therefore Lemma 33 says thattr(Σs) ≤ν for all...

work page 2019
[7]

eventF k holds 2.A ⋆ is(κ, γ)-strongly stable 3.ν≥ κ2tr(W) γ LetΣ safe ⋆,xx be the true covariance of the state under zero input, i.e.Σsafe ⋆,xx = A⋆Σsafe ⋆,xxA⊤ ⋆ + W. Then, it holds that, ∥Σsafe ⋆,xx −Σ safe k,xx∥ ≤ (n+m)κ 2ην γλ Proof.Note that, Σsafe ⋆,xx −Σ safe k,xx =A ⋆Σsafe ⋆,xxA⊤ ⋆ − ˆAkΣsafe k,xx ˆA⊤ k =A ⋆(Σsafe ⋆,xx −Σ safe k,xx)A⊤ ⋆ +A ⋆Σsafe...

work page
[8]

Weusethenotation ST = I 0 ST,xx I 0 ⊤ with Sxx =PT−1 s=0 As ⋆W (As ⋆)⊤, andmt = I 0 AT−1 ⋆ x1 I 0 ⊤

the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], 6.ρ≥2˜γ −1 log max(β,ϵ/4) (1+˜κ)˜κRzD 7.T≥γ −1 log max(β,ϵ/4) (1+κ)κR1D 8.ω≤δ/2 9.ξ= β−D(1+˜κ)˜κ(1−˜γ/2)ρRz Φ−1(1−δ+ω) 2 −D 2C1 10.ϵ 2 >0, where, ϵ2 := ϵ2 4Φ−1(1−δ) 2 −D 2(1 +κ) 2κ2γ−1(1−γ) 2T − 2Φ−1(1−δ/2)β 2ω ϕ(Φ−1(1−δ/2))Φ −1(1−δ) 4 −D 2C1 Then, i...

work page 2018
[9]

Proof.Consider the sequencez safe t = xsafe t 0 wherex safe t+1 =A ⋆xsafe t +w t andx safe 1 =x 1

the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], Then, it holds thatP(α⊤ j zt ≤β)≥1−δfor allt∈[τ 1 −1]. Proof.Consider the sequencez safe t = xsafe t 0 wherex safe t+1 =A ⋆xsafe t +w t andx safe 1 =x 1. Then, xt+1 −x safe t+1 =A ⋆(xt −x safe t ) +B ⋆ut =A t ⋆(x1 −x safe 1 ) + tX s=1 At−s ⋆ B⋆us. Th...

work page
[10]

zt − Axt−1 0 zt − Axt−1 0 ⊤ |Ft−2 # v =v ⊤E

the policyK=0is(κ, γ)-strongly stable 3.∥x 1∥ ≤R 1 If, τ0 ≥1 + 134dlog 34R2 z ωmin(σ, c 2/m) . Then, it holds that, P ∥ ˆΘ0 −Θ ⋆∥F ≤300 ¯wd s n+dlog(306 ¯Γ min(σ, c2/m)−1ω−1) (τ0 −1) min(σ, c2/m) ! ≥1−ω/3, where ¯Γ = κ2R1 +κ 2γ−1( p 2n¯wlog(2nT /ω) +Sc) +c 2 . Proof. We apply Theorem 2.4 of Simchowitz et al. (2018). We verify each of the requirements for ...

work page 2018

[1] [1]

Then, (24) implies thatdλ −1ν(µ+κ 2γ−1ηD2)≤ ϵ2 8Φ−1(1−δ)2

First, note thatexp(−˜γρ) ≲ 1 T 2, ζ≲ 1√ T, ηλ−1 ≲ 1and therefore C1 ≲ 1√ T. Then, (24) implies thatdλ −1ν(µ+κ 2γ−1ηD2)≤ ϵ2 8Φ−1(1−δ)2. Therefore, ϵ1 ≥ ϵ2 16Φ−1(1−δ) 2 ⇐=D 2(1 +κ) 2κ2γ−1(1−γ) 2T + 2Φ−1(1−δ/2)β 2ω ϕ(Φ−1(1−δ/2))Φ −1(1−δ) 4 +C 1 ≤ ϵ2 16Φ−1(1−δ) 2 . (25) Note that(1 −γ )2T ≤exp (−2γT ) ≤exp (−log (2γT + 1)) = 1 2γT+1 where we use thatx≥log (1...

work page

[2] [2]

4.∥ ¯Σkt −Σ t∥ ≤2ν(1−ζ) t−τk

PT t=1 ∥¯Σkt −Σ t∥ ≤ 2νN ζ 3.∥Σ t+1 −Σ t∥ ≤2νζ. 4.∥ ¯Σkt −Σ t∥ ≤2ν(1−ζ) t−τk. Next, we show that the state is bounded with high probability. Lemma 6(Bounded State).In addition to the assumptions of Lemma 10, assume that: 1.W⪰σI, 2.A ⋆ is(κ, γ)-strongly stable, 3.∥Θ ⋆∥ ≤S 20 4.ν≥ κ2tr(W) γ 5.λ≥max HR 2 z log(2) , 8ην σ , η,4κ 2γ−2 4 ¯wdlog 3 d+λ −1T R2 z /...

work page

[3] [3]

The proof is in Section B.6.3

¯Vk ⪯3 ¯Vk−1 for allk, 3.V t ⪯2V t−ρ for allt, 4.N≤1 + 2dlog 1 + R2 zT dλ Next, we give a lemma that shows how conditional chance-constraints can be expressed in terms of the conditional mean and conditional covariance. The proof is in Section B.6.3. Lemma 8(Reformulation of Chance Constraints).Ifz t|Fτ ∼ N(m t|τ , St|τ), then it holds that, P(α⊤ j zt ≤β|...

work page

[4] [4]

The proof is in Section B.6.4

the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], 8.ρ≥2˜γ −1 log max(β,ϵ/4) (1+˜κ)˜κRzD 9.T≥γ −1 log max(β,ϵ/4) (1+κ)κR1D 10.ω≤δ/2 11.ϵ 1 >0, where, ϵ1 := ϵ2 4Φ−1(1−δ) 2 −D 2(1 +κ) 2κ2γ−1(1−γ) 2T − 2Φ−1(1−δ/2)β 2ω ϕ(Φ−1(1−δ/2))Φ −1(1−δ) 4 −C 1 −dλ −1ν(µ+κ 2γ−1ηD2) 22 Then, under eventFk, it holds th...

work page

[5] [5]

the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], Then, it holds thatP(α⊤ j zt ≤β)≥1−δfor allt∈[τ 1 −1]and, P ∥ ˆΘ0 −Θ ⋆∥F ≤300 ¯wd s n+dlog(306 ¯Γ min(σ, c2/m)−1ω−1) (τ0 −1) min(σ, c2/m) ! ≥1−ω/3, where ¯Γ = κ2R1 +κ 2γ−1( p 2n¯wlog(2nT /ω) +Sc) +c 2 . B.1.3 Regret Decomposition We decompose the reg...

work page 2018

[6] [6]

Then, we show thattr(Σs) ≤ν for all s∈ [τk+1 − 1]

is well-defined in the sense that there existsϕ = 0such that ϕΣo i + (1 −ϕ )Σsafe i ∈ E p i for all i≤k . Then, we show thattr(Σs) ≤ν for all s∈ [τk+1 − 1]. Indeed, the conditions for Lemma 33 are satisfied since Fk is assumed to hold, and due to the assumption onν and the assumption thatA⋆ is strongly stable. Therefore Lemma 33 says thattr(Σs) ≤ν for all...

work page 2019

[7] [7]

eventF k holds 2.A ⋆ is(κ, γ)-strongly stable 3.ν≥ κ2tr(W) γ LetΣ safe ⋆,xx be the true covariance of the state under zero input, i.e.Σsafe ⋆,xx = A⋆Σsafe ⋆,xxA⊤ ⋆ + W. Then, it holds that, ∥Σsafe ⋆,xx −Σ safe k,xx∥ ≤ (n+m)κ 2ην γλ Proof.Note that, Σsafe ⋆,xx −Σ safe k,xx =A ⋆Σsafe ⋆,xxA⊤ ⋆ − ˆAkΣsafe k,xx ˆA⊤ k =A ⋆(Σsafe ⋆,xx −Σ safe k,xx)A⊤ ⋆ +A ⋆Σsafe...

work page

[8] [8]

Weusethenotation ST = I 0 ST,xx I 0 ⊤ with Sxx =PT−1 s=0 As ⋆W (As ⋆)⊤, andmt = I 0 AT−1 ⋆ x1 I 0 ⊤

the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], 6.ρ≥2˜γ −1 log max(β,ϵ/4) (1+˜κ)˜κRzD 7.T≥γ −1 log max(β,ϵ/4) (1+κ)κR1D 8.ω≤δ/2 9.ξ= β−D(1+˜κ)˜κ(1−˜γ/2)ρRz Φ−1(1−δ+ω) 2 −D 2C1 10.ϵ 2 >0, where, ϵ2 := ϵ2 4Φ−1(1−δ) 2 −D 2(1 +κ) 2κ2γ−1(1−γ) 2T − 2Φ−1(1−δ/2)β 2ω ϕ(Φ−1(1−δ/2))Φ −1(1−δ) 4 −D 2C1 Then, i...

work page 2018

[9] [9]

Proof.Consider the sequencez safe t = xsafe t 0 wherex safe t+1 =A ⋆xsafe t +w t andx safe 1 =x 1

the policy K =0is( κ, γ)-strongly stable and ensures thatP(α⊤ j zK,w t ≤β−ϵ ) ≥ 1 −δ for all j∈[J], t∈[T], Then, it holds thatP(α⊤ j zt ≤β)≥1−δfor allt∈[τ 1 −1]. Proof.Consider the sequencez safe t = xsafe t 0 wherex safe t+1 =A ⋆xsafe t +w t andx safe 1 =x 1. Then, xt+1 −x safe t+1 =A ⋆(xt −x safe t ) +B ⋆ut =A t ⋆(x1 −x safe 1 ) + tX s=1 At−s ⋆ B⋆us. Th...

work page

[10] [10]

zt − Axt−1 0 zt − Axt−1 0 ⊤ |Ft−2 # v =v ⊤E

the policyK=0is(κ, γ)-strongly stable 3.∥x 1∥ ≤R 1 If, τ0 ≥1 + 134dlog 34R2 z ωmin(σ, c 2/m) . Then, it holds that, P ∥ ˆΘ0 −Θ ⋆∥F ≤300 ¯wd s n+dlog(306 ¯Γ min(σ, c2/m)−1ω−1) (τ0 −1) min(σ, c2/m) ! ≥1−ω/3, where ¯Γ = κ2R1 +κ 2γ−1( p 2n¯wlog(2nT /ω) +Sc) +c 2 . Proof. We apply Theorem 2.4 of Simchowitz et al. (2018). We verify each of the requirements for ...

work page 2018