Bridging Continuous-time LQR and Reinforcement Learning via Gradient Flow of the Bellman Error
Pith reviewed 2026-05-19 09:49 UTC · model grok-4.3
The pith
A gradient flow on the continuous-time Bellman error finds the optimal LQR feedback gain while keeping every policy along the path stabilizing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the continuous-time Bellman error, parametrized by the feedback gain, is coercive with a unique stationary point inside the stability region; its closed-form gradient induces a gradient flow that converges globally to the optimal stabilizing feedback from any initial stabilizing gain, with the entire trajectory consisting exclusively of stabilizing policies.
What carries the argument
The continuous-time Bellman error derived from the HJB equation and parametrized by the feedback gain, whose gradient generates the ODE flow that solves the LQR problem.
Load-bearing premise
The continuous-time Bellman error is coercive and has a unique stationary point inside the stability region.
What would settle it
A concrete linear system and initial stabilizing gain for which the gradient flow diverges, oscillates, or converges to a non-optimal point would disprove the global convergence claim.
Figures
read the original abstract
In this paper, we present a novel method for computing the optimal feedback gain of the infinite-horizon Linear Quadratic Regulator (LQR) problem via an ordinary differential equation. We introduce a novel continuous-time Bellman error, derived from the Hamilton-Jacobi-Bellman (HJB) equation, which quantifies the suboptimality of stabilizing policies and is parametrized in terms of the feedback gain. We analyze its properties, including its effective domain, smoothness, coerciveness and show the existence of a unique stationary point within the stability region. Furthermore, we derive a closed-form gradient expression of the Bellman error that induces a gradient flow. This converges to the optimal feedback and generates a unique trajectory which exclusively comprises stabilizing feedback policies. Additionally, this work advances interesting connections between LQR theory and Reinforcement Learning (RL) by redefining suboptimality of the Algebraic Riccati Equation (ARE) as a Bellman error, adapting a state-independent formulation, and leveraging Lyapunov equations to overcome the infinite-horizon challenge. We validate our method in a simulation and compare it to the state of the art.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a continuous-time Bellman error for the infinite-horizon LQR problem, parametrized by the feedback gain K. It establishes properties such as smoothness, coerciveness, and a unique stationary point within the stability region. A closed-form gradient is derived, leading to a gradient flow ODE that is claimed to converge to the optimal gain while generating trajectories consisting only of stabilizing policies. The approach is positioned as a bridge between LQR theory and reinforcement learning via redefinition of ARE suboptimality, with numerical validation provided.
Significance. If the theoretical claims hold, particularly the global convergence and invariance properties of the flow, this could provide a new dynamical-systems method for solving the ARE, with potential implications for continuous-time RL algorithms. The use of Lyapunov equations to handle the infinite horizon and the state-independent formulation are notable connections. The simulation results indicate practical feasibility, but the contribution hinges on rigorous establishment of the flow's global behavior.
major comments (1)
- [§4] §4 (Gradient Flow and Convergence): The central claim that the ODE generates a unique trajectory exclusively comprising stabilizing feedback policies requires an invariance argument for the open set of stabilizing gains. While coerciveness and uniqueness of the stationary point inside the stability region are shown, the manuscript does not explicitly analyze the vector field's behavior near the stability boundary (where the Lyapunov solution ceases to exist) to ensure trajectories cannot escape in finite time. This is load-bearing for the assertion of global convergence from arbitrary stabilizing initial conditions.
minor comments (2)
- [§3] Notation in §3: The continuous-time Bellman error definition could more explicitly distinguish the state-independent formulation from standard state-dependent versions to aid readability.
- [Simulation] Simulation section: The comparison to state-of-the-art methods would be strengthened by reporting quantitative metrics such as convergence time or final cost error alongside the qualitative plots.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for identifying this important point regarding the invariance of the stabilizing set. We address the comment directly below and have incorporated a strengthened analysis in the revision.
read point-by-point responses
-
Referee: [§4] §4 (Gradient Flow and Convergence): The central claim that the ODE generates a unique trajectory exclusively comprising stabilizing feedback policies requires an invariance argument for the open set of stabilizing gains. While coerciveness and uniqueness of the stationary point inside the stability region are shown, the manuscript does not explicitly analyze the vector field's behavior near the stability boundary (where the Lyapunov solution ceases to exist) to ensure trajectories cannot escape in finite time. This is load-bearing for the assertion of global convergence from arbitrary stabilizing initial conditions.
Authors: We agree that an explicit invariance argument is necessary for rigor and thank the referee for this observation. In the revised manuscript we have added a new lemma in §4 that directly addresses the vector-field behavior at the boundary. Because the continuous-time Bellman error is shown to be coercive on the open stability set (i.e., J(K) → +∞ as K approaches any point on the boundary where the Lyapunov equation ceases to admit a positive-definite solution), every sublevel set {K : J(K) ≤ c} is compact and lies strictly inside the stability region. The gradient flow is the negative gradient of this smooth, coercive function; consequently, the flow cannot reach the boundary in finite time, as that would require J to become infinite while decreasing along trajectories. We supply a self-contained proof that the open stability set is forward-invariant under the flow and that solutions exist globally for any initial stabilizing gain. Global convergence to the unique stationary point then follows from standard Lyapunov arguments on the compact sublevel sets. This addition makes the invariance claim fully rigorous without altering any other results. revision: yes
Circularity Check
No circularity: derivation rests on standard HJB and Lyapunov equations without self-referential reduction.
full rationale
The paper begins from the established continuous-time Hamilton-Jacobi-Bellman equation and the infinite-horizon Lyapunov equation for LQR cost, both of which are classical results external to this manuscript. The continuous-time Bellman error is introduced as a direct re-expression of ARE suboptimality for a stabilizing gain K; its gradient is then computed in closed form and the resulting ODE is analyzed for coerciveness and a unique critical point inside the stability region. None of these steps equates the target convergence result to its own inputs by construction, nor does any load-bearing premise collapse to a self-citation whose validity depends on the present paper. The claimed invariance of the stabilizing-gain set under the flow is asserted via the derived vector field, but this is an independent analytic claim rather than a definitional tautology. The overall chain therefore remains non-circular.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The Hamilton-Jacobi-Bellman equation governs the optimal value function for the LQR problem.
- domain assumption Lyapunov equations yield the infinite-horizon quadratic cost for any stabilizing linear feedback.
invented entities (1)
-
Continuous-time Bellman error parametrized by feedback gain
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Contributions to the theory of optimal control,
R. E. Kalman, “Contributions to the theory of optimal control,” Bol. Soc. Mat. Mex. , no. 1, 1960
work page 1960
-
[2]
G. Hewer, “An iterative technique for the computation of the steady state gains for the discrete optimal regulator,” IEEE Transactions on Automatic Control , no. 4, 1971
work page 1971
-
[3]
On an iterative technique for Riccati equation computations,
D. Kleinman, “On an iterative technique for Riccati equation computations,” IEEE Trans. Autom. Control , no. 1, 1968
work page 1968
-
[4]
P. Lancaster and L. Rodman, Algebraic Riccati Equations . New York, USA: Oxford University Press Inc., 1995
work page 1995
-
[5]
Semidefinite pro- gramming duality and linear time-invariant systems,
V . Balakrishnan and L. Vandenberghe, “Semidefinite pro- gramming duality and linear time-invariant systems,” IEEE Transactions on Automatic Control , no. 1, 2003
work page 2003
-
[6]
LQ control via semidefinite programming,
D. Yao, S. Zhang, and X. Y . Zhou, “LQ control via semidefinite programming,” in Proceedings of the 38th IEEE Conference on Decision and Control , 1999
work page 1999
-
[7]
T. Chen and B. A. Francis, Optimal Sampled-Data Control Systems. Springer London, 1995
work page 1995
-
[8]
Numerical Methods for H2 Related Prob- lems,
E. Feron et al. , “Numerical Methods for H2 Related Prob- lems,” in IEEE American Control Conference , 1992
work page 1992
-
[9]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018
work page 2018
-
[10]
Optimal and Autonomous Control Using Reinforcement Learning: A Survey,
B. Kiumarsi et al., “Optimal and Autonomous Control Using Reinforcement Learning: A Survey,” IEEE Transactions on Neural Networks and Learning Systems , no. 6, 2018
work page 2018
-
[11]
A Tour of Reinforcement Learning: The View from Continuous Control,
B. Recht, “A Tour of Reinforcement Learning: The View from Continuous Control,” Annual Review of Control, Robotics, and Autonomous Systems , no. 1, 2019
work page 2019
- [12]
-
[13]
W. Levine and M. Athans, “On the determination of the op- timal constant output feedback gains for linear multivariable systems,” IEEE Trans. Autom. Control , no. 1, 1970
work page 1970
-
[14]
Optimal decentralized control of dynamic systems,
J. Geromel and J. Bernussou, “Optimal decentralized control of dynamic systems,” Automatica, no. 5, 1982
work page 1982
-
[15]
Gradient Methods for Large-Scale and Distributed Linear Quadratic Control,
K. M ˚artensson, “Gradient Methods for Large-Scale and Distributed Linear Quadratic Control,” Ph.D. dissertation, Lund University, Lund, 2012
work page 2012
-
[16]
Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator,
M. Fazel et al. , “Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator,” in Proceedings of the 35th ICML , 2018
work page 2018
- [17]
-
[18]
J. Bu, A. Mesbahi, and M. Mesbahi, Policy Gradient-based Algorithms for Continuous-time Linear Quadratic Control ,
- [19]
-
[20]
J. Bu, A. Mesbahi, and M. Mesbahi, “LQR via First Order Flows,” in American Control Conference, 2020
work page 2020
-
[21]
Adaptive optimal control for continuous- time linear systems based on policy iteration,
D. Vrabie et al., “Adaptive optimal control for continuous- time linear systems based on policy iteration,” Automatica, no. 2, 2009
work page 2009
-
[22]
D. Vrabie and F. Lewis, “Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,” Neural Networks, no. 3, 2009
work page 2009
-
[23]
Solution of the Matrix Equation AX + XB = C,
R. H. Bartels and G. W. Stewart, “Solution of the Matrix Equation AX + XB = C,” Communications of the ACM , no. 9, 1972
work page 1972
-
[24]
J. P. Hespanha, Linear Systems Theory , 2nd ed. Princeton, USA: Princeton University Press, 2018
work page 2018
-
[25]
J. Bu, A. Mesbahi, and M. Mesbahi, On Topological and Metrical Properties of Stabilizing Feedback Gains: the MIMO Case, 2019. arXiv: 1904.02737 [cs.SY]
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[26]
B. D. O. Anderson and J. B. Moore, Optimal Control: Linear Quadratic methods . Englewood Cliffs, USA: Prentice-Hall, Inc., 1990
work page 1990
-
[27]
Construction of Suboptimal Control Sequences,
R. J. Leake and R.-W. Liu, “Construction of Suboptimal Control Sequences,” SIAM Journal on Control , no. 1, 1967
work page 1967
-
[28]
R. A. Horn and C. R. Johnson, Matrix Analysis , 2nd ed. Cambridge University Press, 2012
work page 2012
-
[29]
J. Munkres, Topology, 2nd ed. Harlow, UK: Pearson Educa- tion Limited, 2014
work page 2014
-
[30]
Old and New Matrix Algebra Useful for Statis- tics,
T. Minka, “Old and New Matrix Algebra Useful for Statis- tics,” 2000
work page 2000
-
[31]
Rudin, Principles of Mathematical Analysis, 3rd ed
W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York, USA: McGraw-Hill, 1964
work page 1964
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.