Quantized Online LQR

Babak Hassibi; Barron Han; Victoria Kostina

arxiv: 2604.11930 · v1 · submitted 2026-04-13 · 📡 eess.SY · cs.IT· cs.SY· math.IT

Quantized Online LQR

Barron Han , Victoria Kostina , Babak Hassibi This is my paper

Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3

classification 📡 eess.SY cs.ITcs.SYmath.IT

keywords online LQRquantized controlcommunication constraintscertainty equivalenceregret boundsinformation-theoretic lower boundadaptive controlnetworked systems

0 comments

The pith

Quantized dynamics estimates achieve the information-theoretic lower bound of Omega(log T) bits for sublinear regret in online LQR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that any online LQR controller achieving regret that grows slower than linear in the horizon must communicate at least logarithmically many bits about the unknown dynamics. It introduces the Quantized Certainty Equivalent algorithm in which the plant uses local full-state observations to run ordinary least squares, quantizes the resulting parameter estimates, and sends them over the rate-limited uplink. The remote controller then computes the optimal policy from the quantized model and returns it so the plant can apply the control action with its superior state knowledge. This design matches the lower bound and produces regret that smoothly approaches the unquantized certainty-equivalent baseline as the quantization codebook is refined.

Core claim

Any scheme that attains O(T^alpha) regret for alpha in [1/2,1) relative to the infinite-horizon optimal LQR controller must transmit Omega(log T) bits; the Quantized Certainty Equivalent algorithm achieves this bound by sending quantized ordinary-least-squares estimates of the system matrices, yielding explicit regret expressions whose quantization-inflation factors Q_slow and Q_fast vanish with finer codebooks.

What carries the argument

The Quantized Certainty Equivalent (QCE-LQR) algorithm, which replaces raw-state quantization with quantized transmission of locally estimated dynamics parameters so that a remote controller can return a stabilizing policy for local execution.

If this is right

Regret scales as O(T^alpha) with alpha in [1/2,1) using only logarithmic total communication.
The regret expression contains explicit multiplicative factors that approach one as the quantization resolution increases.
The same scheme recovers performance comparable to an unquantized certainty-equivalent controller on both scalar and high-dimensional benchmark systems over horizons of length 10,000.
The lower bound applies to every possible scheme, not merely to certainty-equivalent controllers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation between local state observation and remote policy computation may extend naturally to other adaptive control problems that operate under bit-rate limits.
Similar quantization of learned model parameters could reduce communication load in distributed reinforcement-learning agents that share dynamics information rather than raw trajectories.
The result highlights a possible design principle for networked control: prioritize transmission of sufficient statistics about the plant over transmission of instantaneous measurements.

Load-bearing premise

The plant has perfect local access to its full state vector and can run ordinary least squares to produce dynamics estimates that are accurate enough to support stabilization, while the communication link is noiseless but rate-limited.

What would settle it

An explicit counter-example linear system on which any controller using fewer than Omega(log T) bits over horizon T incurs regret that exceeds O(T^alpha) for some alpha less than 1, or numerical runs of QCE-LQR on the Boeing 747 model where the observed regret fails to approach the unquantized baseline as codebook size grows.

Figures

Figures reproduced from arXiv: 2604.11930 by Babak Hassibi, Barron Han, Victoria Kostina.

**Figure 2.** Figure 2: Normalized regret Regret(t)/ √ t for Practical QCE-LQR (red) vs. unquantized CE (blue), T = 104 , 50 trials. Solid lines show medians; shaded bands show the interquartile range. Both controllers use the causal bootstrap trigger with a runtime fallback shield. In all four systems the two controllers converge to comparable asymptotic values, confirming that quantization overhead is small in these simulations… view at source ↗

**Figure 3.** Figure 3: Total bits transmitted B(t) by Practical QCE-LQR over time for each system. Dashed lines show the Θ(log T) reference. All systems exhibit the characteristic three-phase structure: pre-safe flags, Elias Gamma initialization jump, and logarithmically growing tracking corrections. Total bits scale with ds: from 123 (ds = 2) to 819 (ds = 24). REFERENCES [1] M. Simchowitz and D. Foster, “Naive exploration is op… view at source ↗

read the original abstract

We study online linear-quadratic regulation (LQR) with unknown dynamics under communication rate constraints. Classical networked control quantizes the plant state at every time step, requiring $O(T)$ total bits while injecting persistent quantization noise that limits control performance. We consider a setting where the plant observes its state locally and can estimate system dynamics via ordinary least squares, while a remote controller possesses knowledge of the control cost. Rather than quantizing the raw state, the plant transmits learned dynamics estimates over a rate-limited uplink, and the controller returns the optimal control policy so that the plant can compute actions locally using its superior state knowledge. We first prove a fundamental information-theoretic lower bound: any scheme achieving $O(T^\alpha)$ regret for $\alpha \in [1/2,1)$ compared to the optimal infinite horizon LQR controller that knows the true system dynamics must transmit at least $\Omega(\log T)$ bits. We then design the \textbf{Quantized Certainty Equivalent (QCE-LQR)} algorithm, which matches this bound. The resulting regret bound contains inflation factors $Q_{\mathrm{slow}}(\varrho)$ and $Q_{\mathrm{fast}}(\varrho)$ that vanish as the codebook resolution increases, smoothly recovering the unquantized baseline regret. Numerical experiments on four benchmark systems -- from a scalar unstable plant to a 24-parameter Boeing 747 lateral model -- confirm that a variant of QCE-LQR achieves regret comparable to an unquantized certainty equivalent controller over a horizon of $T=10{,}000$ steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves you need only Ω(log T) total bits for sublinear regret in online LQR by quantizing dynamics estimates instead of states at every step, and gives a matching algorithm.

read the letter

The main point is that classical networked LQR pays O(T) bits by quantizing the state every time step, which also adds persistent noise. This work instead has the plant run local OLS to estimate the dynamics, quantize and send that estimate over the uplink, and receive the policy back from the remote controller. The plant then applies the policy using its local state. Total communication stays logarithmic while regret stays sublinear in T.

Referee Report

2 major / 3 minor

Summary. The paper investigates online LQR with unknown dynamics under communication rate constraints. It proves an information-theoretic lower bound requiring Ω(log T) bits of communication for any scheme to achieve O(T^α) regret where α ∈ [1/2, 1), compared to the optimal infinite-horizon LQR. It then proposes the Quantized Certainty Equivalent (QCE-LQR) algorithm that achieves a matching regret bound using quantized transmission of dynamics estimates, with regret inflation factors Q_slow(ρ) and Q_fast(ρ) that approach 1 as the quantization resolution ρ increases. Numerical experiments on benchmark systems, including a 24-parameter Boeing 747 model, show performance comparable to the unquantized case over T=10,000 steps.

Significance. If the lower bound and algorithm are correct, this work makes a significant contribution by establishing tight communication requirements for sublinear regret in adaptive LQR control. The ability to achieve near-optimal regret with only logarithmic total bits, rather than linear, is important for networked and resource-limited control applications. The matching of lower and upper bounds, along with the smooth recovery of the baseline regret as resolution increases, and the validation on high-dimensional systems, strengthen the practical and theoretical value.

major comments (2)

[Lower bound proof (likely §3 or Theorem 1)] The abstract states a fundamental lower bound of Ω(log T) bits, but without explicit assumptions on the system matrices (e.g., bounds on ||A||, ||B||, noise covariance) and the precise definition of regret (including any dependence on initial state or transients), it is hard to verify if the bound is parameter-free or holds uniformly. The reader's note indicates the full proof is not available for verification, which is load-bearing for the central claim.
[QCE-LQR algorithm and regret bound (likely §4)] The regret bound includes inflation factors Q_slow(ρ) and Q_fast(ρ), but the manuscript should provide explicit expressions or bounds for these factors in terms of the quantization error and system parameters to show how they vanish with ρ. Additionally, the choice of quantization schedule to ensure total bits are O(log T) while maintaining stability during learning needs to be detailed.

minor comments (3)

[Abstract] The abstract mentions 'four benchmark systems -- from a scalar unstable plant to a 24-parameter Boeing 747 lateral model' but does not list the specific systems or their dimensions, which would help assess the generality of the experiments.
[Notation] The symbols Q_slow(ρ) and Q_fast(ρ) are introduced without prior definition in the abstract; a brief explanation or reference to their definition in the main text would improve clarity.
[Experiments] Details on how the quantization resolution ρ is selected in the numerical experiments (e.g., fixed or adaptive) are not provided in the abstract, which affects reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the two major comments point by point below, providing clarifications and committing to revisions for improved clarity.

read point-by-point responses

Referee: [Lower bound proof (likely §3 or Theorem 1)] The abstract states a fundamental lower bound of Ω(log T) bits, but without explicit assumptions on the system matrices (e.g., bounds on ||A||, ||B||, noise covariance) and the precise definition of regret (including any dependence on initial state or transients), it is hard to verify if the bound is parameter-free or holds uniformly. The reader's note indicates the full proof is not available for verification, which is load-bearing for the central claim.

Authors: The system assumptions (bounds on ||A||, ||B||, stabilizability, and noise covariance properties) are stated in the problem formulation section immediately preceding the lower bound result. The regret is defined as the expected cumulative excess cost relative to the optimal infinite-horizon LQR controller that knows the true dynamics, for arbitrary but fixed initial state, with transients handled via standard Lyapunov analysis. The lower bound itself is established via an information-theoretic argument reducing to hypothesis testing over a finite discretization of the unknown parameters. We acknowledge that the main-text presentation is condensed for brevity. In the revision we will explicitly restate the assumptions and regret definition within the theorem statement itself and add a short remark confirming uniformity over the stated parameter class. revision: yes
Referee: [QCE-LQR algorithm and regret bound (likely §4)] The regret bound includes inflation factors Q_slow(ρ) and Q_fast(ρ), but the manuscript should provide explicit expressions or bounds for these factors in terms of the quantization error and system parameters to show how they vanish with ρ. Additionally, the choice of quantization schedule to ensure total bits are O(log T) while maintaining stability during learning needs to be detailed.

Authors: We agree that explicit dependence on quantization error would strengthen readability. The factors Q_slow(ρ) and Q_fast(ρ) arise from a perturbation analysis of the infinite-horizon LQR cost under bounded dynamics mismatch; they admit the form 1 + O(ε(ρ)) where ε(ρ) is the maximum quantization error of the codebook (which decays exponentially in ρ). Consequently both factors converge to 1 as ρ → ∞. For the communication schedule we use a doubling construction: resolution is increased at exponentially spaced epochs while transmitting only when the OLS estimate changes by more than the current quantization cell size. This keeps the total number of bits transmitted O(log T) and ensures the induced closed-loop perturbation remains summable, preserving stability. In the revision we will insert the explicit O(ε(ρ)) bounds together with a concise paragraph summarizing the epoch schedule and its stability argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper first derives an information-theoretic lower bound requiring Ω(log T) bits for any scheme to achieve O(T^α) regret (α ∈ [1/2,1)) relative to the optimal infinite-horizon LQR, based on the need to communicate sufficient dynamics information over a rate-limited channel. This bound is independent of any specific estimation procedure or fitted values. The QCE-LQR algorithm is then constructed separately, using local OLS dynamics estimates quantized and transmitted infrequently, with regret bounds containing explicit inflation factors Q_slow(ρ) and Q_fast(ρ) that are shown to approach 1 as quantization resolution increases. No step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing self-citation or uniqueness theorem is invoked; the central claims remain self-contained against external information-theoretic and control-theoretic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed from abstract only; full derivation and assumption list unavailable. The central claim rests on standard stabilizability/detectability of the LQR system, ordinary least squares consistency, and a noiseless rate-limited channel.

axioms (2)

domain assumption The linear system is stabilizable and detectable.
Required for existence of optimal infinite-horizon LQR controller and for certainty-equivalence to be stabilizing.
domain assumption Ordinary least squares produces a consistent dynamics estimate when the input is sufficiently exciting.
Underlies the certainty-equivalent policy computation.

pith-pipeline@v0.9.0 · 5581 in / 1440 out tokens · 39003 ms · 2026-05-10T15:21:01.443530+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...

work page
[2]

Simchowitz and D

M. Simchowitz and D. Foster, ``Naive exploration is optimal for online LQR ,'' in Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine Learning Research , pp. 8937--8948, PMLR, 13--18 Jul 2020

work page 2020
[3]

Mania, S

H. Mania, S. Tu, and B. Recht, ``Certainty equivalence is efficient for linear quadratic control,'' in Advances in Neural Information Processing Systems , pp. 10154--10164, Dec 2019

work page 2019
[4]

Kargin, S

T. Kargin, S. Lale, K. Azizzadenesheli, A. Anandkumar, and B. Hassibi, ``Thompson sampling for partially observable linear-quadratic control,'' in 2023 American Control Conference (ACC) , pp. 4561--4568, May 2023

work page 2023
[5]

Kostina and B

V. Kostina and B. Hassibi, ``Rate-cost tradeoffs in control,'' IEEE Transactions on Automatic Control , vol. 64, pp. 4525--4540, Apr. 2019

work page 2019
[6]

Tatikonda and S

S. Tatikonda and S. Mitter, ``Control under communication constraints,'' IEEE Transactions on Automatic Control , vol. 49, no. 7, pp. 1056--1068, Jul 2004

work page 2004
[7]

B. Han, V. Kostina, B. Hassibi, and O. Sabag, ``Coded K alman filtering over MIMO G aussian channels with feedback,'' in 2024 IEEE International Symposium on Information Theory (ISIT) , pp. 3261--3266, Jul 2024

work page 2024
[8]

Sahai and S

A. Sahai and S. Mitter, ``The necessity and sufficiency of anytime capacity for stabilization of a linear system over a noisy communication link—part i: Scalar systems,'' IEEE Transactions on Information Theory , vol. 52, pp. 3369--3395, Jul 2006

work page 2006
[9]

Mitra, L

A. Mitra, L. Ye, and V. Gupta, ``Towards model-free LQR control over rate-limited channels,'' in Proceedings of the Learning for Dynamics and Control Conference (L4DC) , Proceedings of Machine Learning Research, pp. 1253--1265, PMLR, Jul 2024

work page 2024
[10]

Fazel, R

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, ``Global convergence of policy gradient methods for the linear quadratic regulator,'' in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research , pp. 1467--1476, PMLR, 10--15 Jul 2018

work page 2018
[11]

Friedland, Control System Design: An Introduction to State-Space Methods

B. Friedland, Control System Design: An Introduction to State-Space Methods . Dover Publications, reprint ed., 2005

work page 2005
[12]

Abbasi-Yadkori and C

Y. Abbasi-Yadkori and C. Szepesv\'ari, ``Regret bounds for the adaptive control of linear quadratic systems,'' in Proceedings of the 24th Annual Conference on Learning Theory (S. M. Kakade and U. von Luxburg, eds.), vol. 19 of Proceedings of Machine Learning Research , (Budapest, Hungary), pp. 1--26, PMLR, 09--11 Jun 2011

work page 2011

[1] [1]

11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...

work page

[2] [2]

Simchowitz and D

M. Simchowitz and D. Foster, ``Naive exploration is optimal for online LQR ,'' in Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine Learning Research , pp. 8937--8948, PMLR, 13--18 Jul 2020

work page 2020

[3] [3]

Mania, S

H. Mania, S. Tu, and B. Recht, ``Certainty equivalence is efficient for linear quadratic control,'' in Advances in Neural Information Processing Systems , pp. 10154--10164, Dec 2019

work page 2019

[4] [4]

Kargin, S

T. Kargin, S. Lale, K. Azizzadenesheli, A. Anandkumar, and B. Hassibi, ``Thompson sampling for partially observable linear-quadratic control,'' in 2023 American Control Conference (ACC) , pp. 4561--4568, May 2023

work page 2023

[5] [5]

Kostina and B

V. Kostina and B. Hassibi, ``Rate-cost tradeoffs in control,'' IEEE Transactions on Automatic Control , vol. 64, pp. 4525--4540, Apr. 2019

work page 2019

[6] [6]

Tatikonda and S

S. Tatikonda and S. Mitter, ``Control under communication constraints,'' IEEE Transactions on Automatic Control , vol. 49, no. 7, pp. 1056--1068, Jul 2004

work page 2004

[7] [7]

B. Han, V. Kostina, B. Hassibi, and O. Sabag, ``Coded K alman filtering over MIMO G aussian channels with feedback,'' in 2024 IEEE International Symposium on Information Theory (ISIT) , pp. 3261--3266, Jul 2024

work page 2024

[8] [8]

Sahai and S

A. Sahai and S. Mitter, ``The necessity and sufficiency of anytime capacity for stabilization of a linear system over a noisy communication link—part i: Scalar systems,'' IEEE Transactions on Information Theory , vol. 52, pp. 3369--3395, Jul 2006

work page 2006

[9] [9]

Mitra, L

A. Mitra, L. Ye, and V. Gupta, ``Towards model-free LQR control over rate-limited channels,'' in Proceedings of the Learning for Dynamics and Control Conference (L4DC) , Proceedings of Machine Learning Research, pp. 1253--1265, PMLR, Jul 2024

work page 2024

[10] [10]

Fazel, R

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, ``Global convergence of policy gradient methods for the linear quadratic regulator,'' in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research , pp. 1467--1476, PMLR, 10--15 Jul 2018

work page 2018

[11] [11]

Friedland, Control System Design: An Introduction to State-Space Methods

B. Friedland, Control System Design: An Introduction to State-Space Methods . Dover Publications, reprint ed., 2005

work page 2005

[12] [12]

Abbasi-Yadkori and C

Y. Abbasi-Yadkori and C. Szepesv\'ari, ``Regret bounds for the adaptive control of linear quadratic systems,'' in Proceedings of the 24th Annual Conference on Learning Theory (S. M. Kakade and U. von Luxburg, eds.), vol. 19 of Proceedings of Machine Learning Research , (Budapest, Hungary), pp. 1--26, PMLR, 09--11 Jun 2011

work page 2011