Bridging Continuous-time LQR and Reinforcement Learning via Gradient Flow of the Bellman Error

Albertus Johannes Malan; Armin Gie{\ss}ler; S\"oren Hohmann

arxiv: 2506.09685 · v2 · submitted 2025-06-11 · 📡 eess.SY · cs.SY

Bridging Continuous-time LQR and Reinforcement Learning via Gradient Flow of the Bellman Error

Armin Gie{\ss}ler , Albertus Johannes Malan , S\"oren Hohmann This is my paper

Pith reviewed 2026-05-19 09:49 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords LQRBellman errorgradient flowoptimal feedbackreinforcement learningstabilizing policiescontinuous-time controlRiccati equation

0 comments

The pith

A gradient flow on the continuous-time Bellman error finds the optimal LQR feedback gain while keeping every policy along the path stabilizing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a continuous-time method to compute the optimal feedback gain for the infinite-horizon linear quadratic regulator problem. It introduces a Bellman error drawn from the Hamilton-Jacobi-Bellman equation that quantifies the suboptimality of any stabilizing feedback policy and is expressed directly in terms of the gain matrix. The authors establish that this error is smooth and coercive over the stability region and possesses a unique stationary point at the optimal gain. They obtain a closed-form expression for its gradient, which defines an ordinary differential equation whose solutions converge to the optimum. The flow stays inside the set of stabilizing policies at every instant, providing a bridge between classical LQR theory and reinforcement learning ideas through the use of Lyapunov equations.

Core claim

The central claim is that the continuous-time Bellman error, parametrized by the feedback gain, is coercive with a unique stationary point inside the stability region; its closed-form gradient induces a gradient flow that converges globally to the optimal stabilizing feedback from any initial stabilizing gain, with the entire trajectory consisting exclusively of stabilizing policies.

What carries the argument

The continuous-time Bellman error derived from the HJB equation and parametrized by the feedback gain, whose gradient generates the ODE flow that solves the LQR problem.

Load-bearing premise

The continuous-time Bellman error is coercive and has a unique stationary point inside the stability region.

What would settle it

A concrete linear system and initial stabilizing gain for which the gradient flow diverges, oscillates, or converges to a non-optimal point would disprove the global convergence claim.

Figures

Figures reproduced from arXiv: 2506.09685 by Albertus Johannes Malan, Armin Gie{\ss}ler, S\"oren Hohmann.

**Figure 2.** Figure 2: Contour plot of the LQR cost fK − fK∗ 19 fK =2 K 3 1 + 2K 2 1K2 + 5K 2 1 + 2K1K 2 2 + 4K1K2 + 4K1 + 2K 3 2 + 7K 2 2 + 2K2 + 5 / [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Plot of the residuals ∥K(t)−K∗∥F ∥K(0)−K∗∥F for the gradient flow of the LQR cost fK and the Bellman error eK the given system (68), the gradient flows of the Bellman error eK converge faster than that of the LQR cost fK. Due to numerical limitations, the residuals of the gradient flows stabilize at different values. V. CONCLUSION In this paper, we introduced a novel continuous-time Bellman error for the L… view at source ↗

read the original abstract

In this paper, we present a novel method for computing the optimal feedback gain of the infinite-horizon Linear Quadratic Regulator (LQR) problem via an ordinary differential equation. We introduce a novel continuous-time Bellman error, derived from the Hamilton-Jacobi-Bellman (HJB) equation, which quantifies the suboptimality of stabilizing policies and is parametrized in terms of the feedback gain. We analyze its properties, including its effective domain, smoothness, coerciveness and show the existence of a unique stationary point within the stability region. Furthermore, we derive a closed-form gradient expression of the Bellman error that induces a gradient flow. This converges to the optimal feedback and generates a unique trajectory which exclusively comprises stabilizing feedback policies. Additionally, this work advances interesting connections between LQR theory and Reinforcement Learning (RL) by redefining suboptimality of the Algebraic Riccati Equation (ARE) as a Bellman error, adapting a state-independent formulation, and leveraging Lyapunov equations to overcome the infinite-horizon challenge. We validate our method in a simulation and compare it to the state of the art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a gradient flow on a continuous-time Bellman error for LQR that is meant to stay inside stabilizing gains, but the invariance argument looks incomplete.

read the letter

The main thing to know is that this work defines a continuous-time Bellman error directly in terms of the feedback gain K, shows it is coercive with a unique minimum at the optimal gain, and derives a closed-form gradient that produces an ODE whose solutions are claimed to remain stabilizing and converge to the Riccati solution. That is the concrete new piece: a dynamical system on the gain matrix that reframes the ARE as a gradient flow while keeping the infinite-horizon setup via Lyapunov equations. The RL connection is mostly a rephrasing of existing suboptimality measures, but the parametrization and the explicit gradient are not routine extensions of prior LQR or policy-gradient results. The simulation is simple and does what it needs to do by comparing trajectories and final costs against standard solvers. That part is fine and reproducible on the reported example. The soft spot is the invariance claim. The paper establishes the domain, smoothness, and coerciveness inside the open set of stabilizing gains and shows a unique stationary point there. Those properties do not by themselves rule out the flow reaching the boundary in finite time, where the Lyapunov solution stops existing. Without an explicit inward-pointing or barrier argument near that boundary, the statement that every trajectory consists only of stabilizing policies rests on an unshown step. If the full proofs contain a clean invariance argument, the result tightens up; otherwise the global convergence guarantee is weaker than stated. This is aimed at people who already work on continuous-time optimal control or on bridging LQR with RL. A reader looking for a fresh dynamical-systems view of the Riccati equation will find the construction useful and the math mostly self-contained. It is coherent enough and grounded enough to deserve referee time rather than a desk reject, even though the boundary behavior needs tightening before publication.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a continuous-time Bellman error for the infinite-horizon LQR problem, parametrized by the feedback gain K. It establishes properties such as smoothness, coerciveness, and a unique stationary point within the stability region. A closed-form gradient is derived, leading to a gradient flow ODE that is claimed to converge to the optimal gain while generating trajectories consisting only of stabilizing policies. The approach is positioned as a bridge between LQR theory and reinforcement learning via redefinition of ARE suboptimality, with numerical validation provided.

Significance. If the theoretical claims hold, particularly the global convergence and invariance properties of the flow, this could provide a new dynamical-systems method for solving the ARE, with potential implications for continuous-time RL algorithms. The use of Lyapunov equations to handle the infinite horizon and the state-independent formulation are notable connections. The simulation results indicate practical feasibility, but the contribution hinges on rigorous establishment of the flow's global behavior.

major comments (1)

[§4] §4 (Gradient Flow and Convergence): The central claim that the ODE generates a unique trajectory exclusively comprising stabilizing feedback policies requires an invariance argument for the open set of stabilizing gains. While coerciveness and uniqueness of the stationary point inside the stability region are shown, the manuscript does not explicitly analyze the vector field's behavior near the stability boundary (where the Lyapunov solution ceases to exist) to ensure trajectories cannot escape in finite time. This is load-bearing for the assertion of global convergence from arbitrary stabilizing initial conditions.

minor comments (2)

[§3] Notation in §3: The continuous-time Bellman error definition could more explicitly distinguish the state-independent formulation from standard state-dependent versions to aid readability.
[Simulation] Simulation section: The comparison to state-of-the-art methods would be strengthened by reporting quantitative metrics such as convergence time or final cost error alongside the qualitative plots.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this important point regarding the invariance of the stabilizing set. We address the comment directly below and have incorporated a strengthened analysis in the revision.

read point-by-point responses

Referee: [§4] §4 (Gradient Flow and Convergence): The central claim that the ODE generates a unique trajectory exclusively comprising stabilizing feedback policies requires an invariance argument for the open set of stabilizing gains. While coerciveness and uniqueness of the stationary point inside the stability region are shown, the manuscript does not explicitly analyze the vector field's behavior near the stability boundary (where the Lyapunov solution ceases to exist) to ensure trajectories cannot escape in finite time. This is load-bearing for the assertion of global convergence from arbitrary stabilizing initial conditions.

Authors: We agree that an explicit invariance argument is necessary for rigor and thank the referee for this observation. In the revised manuscript we have added a new lemma in §4 that directly addresses the vector-field behavior at the boundary. Because the continuous-time Bellman error is shown to be coercive on the open stability set (i.e., J(K) → +∞ as K approaches any point on the boundary where the Lyapunov equation ceases to admit a positive-definite solution), every sublevel set {K : J(K) ≤ c} is compact and lies strictly inside the stability region. The gradient flow is the negative gradient of this smooth, coercive function; consequently, the flow cannot reach the boundary in finite time, as that would require J to become infinite while decreasing along trajectories. We supply a self-contained proof that the open stability set is forward-invariant under the flow and that solutions exist globally for any initial stabilizing gain. Global convergence to the unique stationary point then follows from standard Lyapunov arguments on the compact sublevel sets. This addition makes the invariance claim fully rigorous without altering any other results. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on standard HJB and Lyapunov equations without self-referential reduction.

full rationale

The paper begins from the established continuous-time Hamilton-Jacobi-Bellman equation and the infinite-horizon Lyapunov equation for LQR cost, both of which are classical results external to this manuscript. The continuous-time Bellman error is introduced as a direct re-expression of ARE suboptimality for a stabilizing gain K; its gradient is then computed in closed form and the resulting ODE is analyzed for coerciveness and a unique critical point inside the stability region. None of these steps equates the target convergence result to its own inputs by construction, nor does any load-bearing premise collapse to a self-citation whose validity depends on the present paper. The claimed invariance of the stabilizing-gain set under the flow is asserted via the derived vector field, but this is an independent analytic claim rather than a definitional tautology. The overall chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard optimal-control background plus the newly introduced Bellman error; no free parameters are indicated and the invented entity is the error measure itself.

axioms (2)

standard math The Hamilton-Jacobi-Bellman equation governs the optimal value function for the LQR problem.
Invoked to derive the continuous-time Bellman error from the HJB PDE.
domain assumption Lyapunov equations yield the infinite-horizon quadratic cost for any stabilizing linear feedback.
Used to overcome the infinite-horizon difficulty when adapting the formulation to reinforcement learning.

invented entities (1)

Continuous-time Bellman error parametrized by feedback gain no independent evidence
purpose: Quantifies suboptimality of any stabilizing policy and supplies the objective whose gradient induces the stabilizing flow.
Newly defined in the paper as the central object bridging LQR and RL.

pith-pipeline@v0.9.0 · 5739 in / 1508 out tokens · 48058 ms · 2026-05-19T09:49:28.122260+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Contributions to the theory of optimal control,

R. E. Kalman, “Contributions to the theory of optimal control,” Bol. Soc. Mat. Mex. , no. 1, 1960

work page 1960
[2]

An iterative technique for the computation of the steady state gains for the discrete optimal regulator,

G. Hewer, “An iterative technique for the computation of the steady state gains for the discrete optimal regulator,” IEEE Transactions on Automatic Control , no. 4, 1971

work page 1971
[3]

On an iterative technique for Riccati equation computations,

D. Kleinman, “On an iterative technique for Riccati equation computations,” IEEE Trans. Autom. Control , no. 1, 1968

work page 1968
[4]

Lancaster and L

P. Lancaster and L. Rodman, Algebraic Riccati Equations . New York, USA: Oxford University Press Inc., 1995

work page 1995
[5]

Semidefinite pro- gramming duality and linear time-invariant systems,

V . Balakrishnan and L. Vandenberghe, “Semidefinite pro- gramming duality and linear time-invariant systems,” IEEE Transactions on Automatic Control , no. 1, 2003

work page 2003
[6]

LQ control via semidefinite programming,

D. Yao, S. Zhang, and X. Y . Zhou, “LQ control via semidefinite programming,” in Proceedings of the 38th IEEE Conference on Decision and Control , 1999

work page 1999
[7]

Chen and B

T. Chen and B. A. Francis, Optimal Sampled-Data Control Systems. Springer London, 1995

work page 1995
[8]

Numerical Methods for H2 Related Prob- lems,

E. Feron et al. , “Numerical Methods for H2 Related Prob- lems,” in IEEE American Control Conference , 1992

work page 1992
[9]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018

work page 2018
[10]

Optimal and Autonomous Control Using Reinforcement Learning: A Survey,

B. Kiumarsi et al., “Optimal and Autonomous Control Using Reinforcement Learning: A Survey,” IEEE Transactions on Neural Networks and Learning Systems , no. 6, 2018

work page 2018
[11]

A Tour of Reinforcement Learning: The View from Continuous Control,

B. Recht, “A Tour of Reinforcement Learning: The View from Continuous Control,” Annual Review of Control, Robotics, and Autonomous Systems , no. 1, 2019

work page 2019
[12]

Vrabie, K

D. Vrabie, K. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles . London, UK: The Institution of Engi- neering and Technology, 2013

work page 2013
[13]

On the determination of the op- timal constant output feedback gains for linear multivariable systems,

W. Levine and M. Athans, “On the determination of the op- timal constant output feedback gains for linear multivariable systems,” IEEE Trans. Autom. Control , no. 1, 1970

work page 1970
[14]

Optimal decentralized control of dynamic systems,

J. Geromel and J. Bernussou, “Optimal decentralized control of dynamic systems,” Automatica, no. 5, 1982

work page 1982
[15]

Gradient Methods for Large-Scale and Distributed Linear Quadratic Control,

K. M ˚artensson, “Gradient Methods for Large-Scale and Distributed Linear Quadratic Control,” Ph.D. dissertation, Lund University, Lund, 2012

work page 2012
[16]

Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator,

M. Fazel et al. , “Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator,” in Proceedings of the 35th ICML , 2018

work page 2018
[17]

Bu et al

J. Bu et al. , LQR through the Lens of First Order Meth- ods: Discrete-time Case , 2019. arXiv: 1907 . 08921 [eess.SY]

work page 2019
[18]

J. Bu, A. Mesbahi, and M. Mesbahi, Policy Gradient-based Algorithms for Continuous-time Linear Quadratic Control ,

work page
[19]

arXiv: 2006.09178 [eess.SY]

work page arXiv 2006
[20]

LQR via First Order Flows,

J. Bu, A. Mesbahi, and M. Mesbahi, “LQR via First Order Flows,” in American Control Conference, 2020

work page 2020
[21]

Adaptive optimal control for continuous- time linear systems based on policy iteration,

D. Vrabie et al., “Adaptive optimal control for continuous- time linear systems based on policy iteration,” Automatica, no. 2, 2009

work page 2009
[22]

Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,

D. Vrabie and F. Lewis, “Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,” Neural Networks, no. 3, 2009

work page 2009
[23]

Solution of the Matrix Equation AX + XB = C,

R. H. Bartels and G. W. Stewart, “Solution of the Matrix Equation AX + XB = C,” Communications of the ACM , no. 9, 1972

work page 1972
[24]

J. P. Hespanha, Linear Systems Theory , 2nd ed. Princeton, USA: Princeton University Press, 2018

work page 2018
[25]

J. Bu, A. Mesbahi, and M. Mesbahi, On Topological and Metrical Properties of Stabilizing Feedback Gains: the MIMO Case, 2019. arXiv: 1904.02737 [cs.SY]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[26]

B. D. O. Anderson and J. B. Moore, Optimal Control: Linear Quadratic methods . Englewood Cliffs, USA: Prentice-Hall, Inc., 1990

work page 1990
[27]

Construction of Suboptimal Control Sequences,

R. J. Leake and R.-W. Liu, “Construction of Suboptimal Control Sequences,” SIAM Journal on Control , no. 1, 1967

work page 1967
[28]

R. A. Horn and C. R. Johnson, Matrix Analysis , 2nd ed. Cambridge University Press, 2012

work page 2012
[29]

Munkres, Topology, 2nd ed

J. Munkres, Topology, 2nd ed. Harlow, UK: Pearson Educa- tion Limited, 2014

work page 2014
[30]

Old and New Matrix Algebra Useful for Statis- tics,

T. Minka, “Old and New Matrix Algebra Useful for Statis- tics,” 2000

work page 2000
[31]

Rudin, Principles of Mathematical Analysis, 3rd ed

W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York, USA: McGraw-Hill, 1964

work page 1964

[1] [1]

Contributions to the theory of optimal control,

R. E. Kalman, “Contributions to the theory of optimal control,” Bol. Soc. Mat. Mex. , no. 1, 1960

work page 1960

[2] [2]

An iterative technique for the computation of the steady state gains for the discrete optimal regulator,

G. Hewer, “An iterative technique for the computation of the steady state gains for the discrete optimal regulator,” IEEE Transactions on Automatic Control , no. 4, 1971

work page 1971

[3] [3]

On an iterative technique for Riccati equation computations,

D. Kleinman, “On an iterative technique for Riccati equation computations,” IEEE Trans. Autom. Control , no. 1, 1968

work page 1968

[4] [4]

Lancaster and L

P. Lancaster and L. Rodman, Algebraic Riccati Equations . New York, USA: Oxford University Press Inc., 1995

work page 1995

[5] [5]

Semidefinite pro- gramming duality and linear time-invariant systems,

V . Balakrishnan and L. Vandenberghe, “Semidefinite pro- gramming duality and linear time-invariant systems,” IEEE Transactions on Automatic Control , no. 1, 2003

work page 2003

[6] [6]

LQ control via semidefinite programming,

D. Yao, S. Zhang, and X. Y . Zhou, “LQ control via semidefinite programming,” in Proceedings of the 38th IEEE Conference on Decision and Control , 1999

work page 1999

[7] [7]

Chen and B

T. Chen and B. A. Francis, Optimal Sampled-Data Control Systems. Springer London, 1995

work page 1995

[8] [8]

Numerical Methods for H2 Related Prob- lems,

E. Feron et al. , “Numerical Methods for H2 Related Prob- lems,” in IEEE American Control Conference , 1992

work page 1992

[9] [9]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018

work page 2018

[10] [10]

Optimal and Autonomous Control Using Reinforcement Learning: A Survey,

B. Kiumarsi et al., “Optimal and Autonomous Control Using Reinforcement Learning: A Survey,” IEEE Transactions on Neural Networks and Learning Systems , no. 6, 2018

work page 2018

[11] [11]

A Tour of Reinforcement Learning: The View from Continuous Control,

B. Recht, “A Tour of Reinforcement Learning: The View from Continuous Control,” Annual Review of Control, Robotics, and Autonomous Systems , no. 1, 2019

work page 2019

[12] [12]

Vrabie, K

D. Vrabie, K. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles . London, UK: The Institution of Engi- neering and Technology, 2013

work page 2013

[13] [13]

On the determination of the op- timal constant output feedback gains for linear multivariable systems,

W. Levine and M. Athans, “On the determination of the op- timal constant output feedback gains for linear multivariable systems,” IEEE Trans. Autom. Control , no. 1, 1970

work page 1970

[14] [14]

Optimal decentralized control of dynamic systems,

J. Geromel and J. Bernussou, “Optimal decentralized control of dynamic systems,” Automatica, no. 5, 1982

work page 1982

[15] [15]

Gradient Methods for Large-Scale and Distributed Linear Quadratic Control,

K. M ˚artensson, “Gradient Methods for Large-Scale and Distributed Linear Quadratic Control,” Ph.D. dissertation, Lund University, Lund, 2012

work page 2012

[16] [16]

Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator,

M. Fazel et al. , “Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator,” in Proceedings of the 35th ICML , 2018

work page 2018

[17] [17]

Bu et al

J. Bu et al. , LQR through the Lens of First Order Meth- ods: Discrete-time Case , 2019. arXiv: 1907 . 08921 [eess.SY]

work page 2019

[18] [18]

J. Bu, A. Mesbahi, and M. Mesbahi, Policy Gradient-based Algorithms for Continuous-time Linear Quadratic Control ,

work page

[19] [19]

arXiv: 2006.09178 [eess.SY]

work page arXiv 2006

[20] [20]

LQR via First Order Flows,

J. Bu, A. Mesbahi, and M. Mesbahi, “LQR via First Order Flows,” in American Control Conference, 2020

work page 2020

[21] [21]

Adaptive optimal control for continuous- time linear systems based on policy iteration,

D. Vrabie et al., “Adaptive optimal control for continuous- time linear systems based on policy iteration,” Automatica, no. 2, 2009

work page 2009

[22] [22]

Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,

D. Vrabie and F. Lewis, “Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,” Neural Networks, no. 3, 2009

work page 2009

[23] [23]

Solution of the Matrix Equation AX + XB = C,

R. H. Bartels and G. W. Stewart, “Solution of the Matrix Equation AX + XB = C,” Communications of the ACM , no. 9, 1972

work page 1972

[24] [24]

J. P. Hespanha, Linear Systems Theory , 2nd ed. Princeton, USA: Princeton University Press, 2018

work page 2018

[25] [25]

J. Bu, A. Mesbahi, and M. Mesbahi, On Topological and Metrical Properties of Stabilizing Feedback Gains: the MIMO Case, 2019. arXiv: 1904.02737 [cs.SY]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[26] [26]

B. D. O. Anderson and J. B. Moore, Optimal Control: Linear Quadratic methods . Englewood Cliffs, USA: Prentice-Hall, Inc., 1990

work page 1990

[27] [27]

Construction of Suboptimal Control Sequences,

R. J. Leake and R.-W. Liu, “Construction of Suboptimal Control Sequences,” SIAM Journal on Control , no. 1, 1967

work page 1967

[28] [28]

R. A. Horn and C. R. Johnson, Matrix Analysis , 2nd ed. Cambridge University Press, 2012

work page 2012

[29] [29]

Munkres, Topology, 2nd ed

J. Munkres, Topology, 2nd ed. Harlow, UK: Pearson Educa- tion Limited, 2014

work page 2014

[30] [30]

Old and New Matrix Algebra Useful for Statis- tics,

T. Minka, “Old and New Matrix Algebra Useful for Statis- tics,” 2000

work page 2000

[31] [31]

Rudin, Principles of Mathematical Analysis, 3rd ed

W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York, USA: McGraw-Hill, 1964

work page 1964