Quantized Online LQR
Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3
The pith
Quantized dynamics estimates achieve the information-theoretic lower bound of Omega(log T) bits for sublinear regret in online LQR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Any scheme that attains O(T^alpha) regret for alpha in [1/2,1) relative to the infinite-horizon optimal LQR controller must transmit Omega(log T) bits; the Quantized Certainty Equivalent algorithm achieves this bound by sending quantized ordinary-least-squares estimates of the system matrices, yielding explicit regret expressions whose quantization-inflation factors Q_slow and Q_fast vanish with finer codebooks.
What carries the argument
The Quantized Certainty Equivalent (QCE-LQR) algorithm, which replaces raw-state quantization with quantized transmission of locally estimated dynamics parameters so that a remote controller can return a stabilizing policy for local execution.
If this is right
- Regret scales as O(T^alpha) with alpha in [1/2,1) using only logarithmic total communication.
- The regret expression contains explicit multiplicative factors that approach one as the quantization resolution increases.
- The same scheme recovers performance comparable to an unquantized certainty-equivalent controller on both scalar and high-dimensional benchmark systems over horizons of length 10,000.
- The lower bound applies to every possible scheme, not merely to certainty-equivalent controllers.
Where Pith is reading between the lines
- The separation between local state observation and remote policy computation may extend naturally to other adaptive control problems that operate under bit-rate limits.
- Similar quantization of learned model parameters could reduce communication load in distributed reinforcement-learning agents that share dynamics information rather than raw trajectories.
- The result highlights a possible design principle for networked control: prioritize transmission of sufficient statistics about the plant over transmission of instantaneous measurements.
Load-bearing premise
The plant has perfect local access to its full state vector and can run ordinary least squares to produce dynamics estimates that are accurate enough to support stabilization, while the communication link is noiseless but rate-limited.
What would settle it
An explicit counter-example linear system on which any controller using fewer than Omega(log T) bits over horizon T incurs regret that exceeds O(T^alpha) for some alpha less than 1, or numerical runs of QCE-LQR on the Boeing 747 model where the observed regret fails to approach the unquantized baseline as codebook size grows.
Figures
read the original abstract
We study online linear-quadratic regulation (LQR) with unknown dynamics under communication rate constraints. Classical networked control quantizes the plant state at every time step, requiring $O(T)$ total bits while injecting persistent quantization noise that limits control performance. We consider a setting where the plant observes its state locally and can estimate system dynamics via ordinary least squares, while a remote controller possesses knowledge of the control cost. Rather than quantizing the raw state, the plant transmits learned dynamics estimates over a rate-limited uplink, and the controller returns the optimal control policy so that the plant can compute actions locally using its superior state knowledge. We first prove a fundamental information-theoretic lower bound: any scheme achieving $O(T^\alpha)$ regret for $\alpha \in [1/2,1)$ compared to the optimal infinite horizon LQR controller that knows the true system dynamics must transmit at least $\Omega(\log T)$ bits. We then design the \textbf{Quantized Certainty Equivalent (QCE-LQR)} algorithm, which matches this bound. The resulting regret bound contains inflation factors $Q_{\mathrm{slow}}(\varrho)$ and $Q_{\mathrm{fast}}(\varrho)$ that vanish as the codebook resolution increases, smoothly recovering the unquantized baseline regret. Numerical experiments on four benchmark systems -- from a scalar unstable plant to a 24-parameter Boeing 747 lateral model -- confirm that a variant of QCE-LQR achieves regret comparable to an unquantized certainty equivalent controller over a horizon of $T=10{,}000$ steps.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates online LQR with unknown dynamics under communication rate constraints. It proves an information-theoretic lower bound requiring Ω(log T) bits of communication for any scheme to achieve O(T^α) regret where α ∈ [1/2, 1), compared to the optimal infinite-horizon LQR. It then proposes the Quantized Certainty Equivalent (QCE-LQR) algorithm that achieves a matching regret bound using quantized transmission of dynamics estimates, with regret inflation factors Q_slow(ρ) and Q_fast(ρ) that approach 1 as the quantization resolution ρ increases. Numerical experiments on benchmark systems, including a 24-parameter Boeing 747 model, show performance comparable to the unquantized case over T=10,000 steps.
Significance. If the lower bound and algorithm are correct, this work makes a significant contribution by establishing tight communication requirements for sublinear regret in adaptive LQR control. The ability to achieve near-optimal regret with only logarithmic total bits, rather than linear, is important for networked and resource-limited control applications. The matching of lower and upper bounds, along with the smooth recovery of the baseline regret as resolution increases, and the validation on high-dimensional systems, strengthen the practical and theoretical value.
major comments (2)
- [Lower bound proof (likely §3 or Theorem 1)] The abstract states a fundamental lower bound of Ω(log T) bits, but without explicit assumptions on the system matrices (e.g., bounds on ||A||, ||B||, noise covariance) and the precise definition of regret (including any dependence on initial state or transients), it is hard to verify if the bound is parameter-free or holds uniformly. The reader's note indicates the full proof is not available for verification, which is load-bearing for the central claim.
- [QCE-LQR algorithm and regret bound (likely §4)] The regret bound includes inflation factors Q_slow(ρ) and Q_fast(ρ), but the manuscript should provide explicit expressions or bounds for these factors in terms of the quantization error and system parameters to show how they vanish with ρ. Additionally, the choice of quantization schedule to ensure total bits are O(log T) while maintaining stability during learning needs to be detailed.
minor comments (3)
- [Abstract] The abstract mentions 'four benchmark systems -- from a scalar unstable plant to a 24-parameter Boeing 747 lateral model' but does not list the specific systems or their dimensions, which would help assess the generality of the experiments.
- [Notation] The symbols Q_slow(ρ) and Q_fast(ρ) are introduced without prior definition in the abstract; a brief explanation or reference to their definition in the main text would improve clarity.
- [Experiments] Details on how the quantization resolution ρ is selected in the numerical experiments (e.g., fixed or adaptive) are not provided in the abstract, which affects reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. We address the two major comments point by point below, providing clarifications and committing to revisions for improved clarity.
read point-by-point responses
-
Referee: [Lower bound proof (likely §3 or Theorem 1)] The abstract states a fundamental lower bound of Ω(log T) bits, but without explicit assumptions on the system matrices (e.g., bounds on ||A||, ||B||, noise covariance) and the precise definition of regret (including any dependence on initial state or transients), it is hard to verify if the bound is parameter-free or holds uniformly. The reader's note indicates the full proof is not available for verification, which is load-bearing for the central claim.
Authors: The system assumptions (bounds on ||A||, ||B||, stabilizability, and noise covariance properties) are stated in the problem formulation section immediately preceding the lower bound result. The regret is defined as the expected cumulative excess cost relative to the optimal infinite-horizon LQR controller that knows the true dynamics, for arbitrary but fixed initial state, with transients handled via standard Lyapunov analysis. The lower bound itself is established via an information-theoretic argument reducing to hypothesis testing over a finite discretization of the unknown parameters. We acknowledge that the main-text presentation is condensed for brevity. In the revision we will explicitly restate the assumptions and regret definition within the theorem statement itself and add a short remark confirming uniformity over the stated parameter class. revision: yes
-
Referee: [QCE-LQR algorithm and regret bound (likely §4)] The regret bound includes inflation factors Q_slow(ρ) and Q_fast(ρ), but the manuscript should provide explicit expressions or bounds for these factors in terms of the quantization error and system parameters to show how they vanish with ρ. Additionally, the choice of quantization schedule to ensure total bits are O(log T) while maintaining stability during learning needs to be detailed.
Authors: We agree that explicit dependence on quantization error would strengthen readability. The factors Q_slow(ρ) and Q_fast(ρ) arise from a perturbation analysis of the infinite-horizon LQR cost under bounded dynamics mismatch; they admit the form 1 + O(ε(ρ)) where ε(ρ) is the maximum quantization error of the codebook (which decays exponentially in ρ). Consequently both factors converge to 1 as ρ → ∞. For the communication schedule we use a doubling construction: resolution is increased at exponentially spaced epochs while transmitting only when the OLS estimate changes by more than the current quantization cell size. This keeps the total number of bits transmitted O(log T) and ensures the induced closed-loop perturbation remains summable, preserving stability. In the revision we will insert the explicit O(ε(ρ)) bounds together with a concise paragraph summarizing the epoch schedule and its stability argument. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper first derives an information-theoretic lower bound requiring Ω(log T) bits for any scheme to achieve O(T^α) regret (α ∈ [1/2,1)) relative to the optimal infinite-horizon LQR, based on the need to communicate sufficient dynamics information over a rate-limited channel. This bound is independent of any specific estimation procedure or fitted values. The QCE-LQR algorithm is then constructed separately, using local OLS dynamics estimates quantized and transmitted infrequently, with regret bounds containing explicit inflation factors Q_slow(ρ) and Q_fast(ρ) that are shown to approach 1 as quantization resolution increases. No step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing self-citation or uniqueness theorem is invoked; the central claims remain self-contained against external information-theoretic and control-theoretic benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The linear system is stabilizable and detectable.
- domain assumption Ordinary least squares produces a consistent dynamics estimate when the input is sufficiently exciting.
Reference graph
Works this paper leans on
-
[1]
11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...
-
[2]
M. Simchowitz and D. Foster, ``Naive exploration is optimal for online LQR ,'' in Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine Learning Research , pp. 8937--8948, PMLR, 13--18 Jul 2020
work page 2020
- [3]
- [4]
-
[5]
V. Kostina and B. Hassibi, ``Rate-cost tradeoffs in control,'' IEEE Transactions on Automatic Control , vol. 64, pp. 4525--4540, Apr. 2019
work page 2019
-
[6]
S. Tatikonda and S. Mitter, ``Control under communication constraints,'' IEEE Transactions on Automatic Control , vol. 49, no. 7, pp. 1056--1068, Jul 2004
work page 2004
-
[7]
B. Han, V. Kostina, B. Hassibi, and O. Sabag, ``Coded K alman filtering over MIMO G aussian channels with feedback,'' in 2024 IEEE International Symposium on Information Theory (ISIT) , pp. 3261--3266, Jul 2024
work page 2024
-
[8]
A. Sahai and S. Mitter, ``The necessity and sufficiency of anytime capacity for stabilization of a linear system over a noisy communication link—part i: Scalar systems,'' IEEE Transactions on Information Theory , vol. 52, pp. 3369--3395, Jul 2006
work page 2006
- [9]
-
[10]
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, ``Global convergence of policy gradient methods for the linear quadratic regulator,'' in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research , pp. 1467--1476, PMLR, 10--15 Jul 2018
work page 2018
-
[11]
Friedland, Control System Design: An Introduction to State-Space Methods
B. Friedland, Control System Design: An Introduction to State-Space Methods . Dover Publications, reprint ed., 2005
work page 2005
-
[12]
Y. Abbasi-Yadkori and C. Szepesv\'ari, ``Regret bounds for the adaptive control of linear quadratic systems,'' in Proceedings of the 24th Annual Conference on Learning Theory (S. M. Kakade and U. von Luxburg, eds.), vol. 19 of Proceedings of Machine Learning Research , (Budapest, Hungary), pp. 1--26, PMLR, 09--11 Jun 2011
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.