pith. sign in

arxiv: 2506.09685 · v2 · submitted 2025-06-11 · 📡 eess.SY · cs.SY

Bridging Continuous-time LQR and Reinforcement Learning via Gradient Flow of the Bellman Error

Pith reviewed 2026-05-19 09:49 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords LQRBellman errorgradient flowoptimal feedbackreinforcement learningstabilizing policiescontinuous-time controlRiccati equation
0
0 comments X

The pith

A gradient flow on the continuous-time Bellman error finds the optimal LQR feedback gain while keeping every policy along the path stabilizing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a continuous-time method to compute the optimal feedback gain for the infinite-horizon linear quadratic regulator problem. It introduces a Bellman error drawn from the Hamilton-Jacobi-Bellman equation that quantifies the suboptimality of any stabilizing feedback policy and is expressed directly in terms of the gain matrix. The authors establish that this error is smooth and coercive over the stability region and possesses a unique stationary point at the optimal gain. They obtain a closed-form expression for its gradient, which defines an ordinary differential equation whose solutions converge to the optimum. The flow stays inside the set of stabilizing policies at every instant, providing a bridge between classical LQR theory and reinforcement learning ideas through the use of Lyapunov equations.

Core claim

The central claim is that the continuous-time Bellman error, parametrized by the feedback gain, is coercive with a unique stationary point inside the stability region; its closed-form gradient induces a gradient flow that converges globally to the optimal stabilizing feedback from any initial stabilizing gain, with the entire trajectory consisting exclusively of stabilizing policies.

What carries the argument

The continuous-time Bellman error derived from the HJB equation and parametrized by the feedback gain, whose gradient generates the ODE flow that solves the LQR problem.

Load-bearing premise

The continuous-time Bellman error is coercive and has a unique stationary point inside the stability region.

What would settle it

A concrete linear system and initial stabilizing gain for which the gradient flow diverges, oscillates, or converges to a non-optimal point would disprove the global convergence claim.

Figures

Figures reproduced from arXiv: 2506.09685 by Albertus Johannes Malan, Armin Gie{\ss}ler, S\"oren Hohmann.

Figure 1
Figure 1. Figure 1: Contour plot of the Bellman error eK -4 -2 0 2 4 K1 -4 -2 0 2 4 K2 100 102 K fK ! fK$ K$ [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Contour plot of the LQR cost fK − fK∗ 19 fK =2 K 3 1 + 2K 2 1K2 + 5K 2 1 + 2K1K 2 2 + 4K1K2 + 4K1 + 2K 3 2 + 7K 2 2 + 2K2 + 5 / [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Plot of the residuals ∥K(t)−K∗∥F ∥K(0)−K∗∥F for the gradient flow of the LQR cost fK and the Bellman error eK the given system (68), the gradient flows of the Bellman error eK converge faster than that of the LQR cost fK. Due to numerical limitations, the residuals of the gradient flows stabilize at different values. V. CONCLUSION In this paper, we introduced a novel continuous-time Bellman error for the L… view at source ↗
read the original abstract

In this paper, we present a novel method for computing the optimal feedback gain of the infinite-horizon Linear Quadratic Regulator (LQR) problem via an ordinary differential equation. We introduce a novel continuous-time Bellman error, derived from the Hamilton-Jacobi-Bellman (HJB) equation, which quantifies the suboptimality of stabilizing policies and is parametrized in terms of the feedback gain. We analyze its properties, including its effective domain, smoothness, coerciveness and show the existence of a unique stationary point within the stability region. Furthermore, we derive a closed-form gradient expression of the Bellman error that induces a gradient flow. This converges to the optimal feedback and generates a unique trajectory which exclusively comprises stabilizing feedback policies. Additionally, this work advances interesting connections between LQR theory and Reinforcement Learning (RL) by redefining suboptimality of the Algebraic Riccati Equation (ARE) as a Bellman error, adapting a state-independent formulation, and leveraging Lyapunov equations to overcome the infinite-horizon challenge. We validate our method in a simulation and compare it to the state of the art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a continuous-time Bellman error for the infinite-horizon LQR problem, parametrized by the feedback gain K. It establishes properties such as smoothness, coerciveness, and a unique stationary point within the stability region. A closed-form gradient is derived, leading to a gradient flow ODE that is claimed to converge to the optimal gain while generating trajectories consisting only of stabilizing policies. The approach is positioned as a bridge between LQR theory and reinforcement learning via redefinition of ARE suboptimality, with numerical validation provided.

Significance. If the theoretical claims hold, particularly the global convergence and invariance properties of the flow, this could provide a new dynamical-systems method for solving the ARE, with potential implications for continuous-time RL algorithms. The use of Lyapunov equations to handle the infinite horizon and the state-independent formulation are notable connections. The simulation results indicate practical feasibility, but the contribution hinges on rigorous establishment of the flow's global behavior.

major comments (1)
  1. [§4] §4 (Gradient Flow and Convergence): The central claim that the ODE generates a unique trajectory exclusively comprising stabilizing feedback policies requires an invariance argument for the open set of stabilizing gains. While coerciveness and uniqueness of the stationary point inside the stability region are shown, the manuscript does not explicitly analyze the vector field's behavior near the stability boundary (where the Lyapunov solution ceases to exist) to ensure trajectories cannot escape in finite time. This is load-bearing for the assertion of global convergence from arbitrary stabilizing initial conditions.
minor comments (2)
  1. [§3] Notation in §3: The continuous-time Bellman error definition could more explicitly distinguish the state-independent formulation from standard state-dependent versions to aid readability.
  2. [Simulation] Simulation section: The comparison to state-of-the-art methods would be strengthened by reporting quantitative metrics such as convergence time or final cost error alongside the qualitative plots.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this important point regarding the invariance of the stabilizing set. We address the comment directly below and have incorporated a strengthened analysis in the revision.

read point-by-point responses
  1. Referee: [§4] §4 (Gradient Flow and Convergence): The central claim that the ODE generates a unique trajectory exclusively comprising stabilizing feedback policies requires an invariance argument for the open set of stabilizing gains. While coerciveness and uniqueness of the stationary point inside the stability region are shown, the manuscript does not explicitly analyze the vector field's behavior near the stability boundary (where the Lyapunov solution ceases to exist) to ensure trajectories cannot escape in finite time. This is load-bearing for the assertion of global convergence from arbitrary stabilizing initial conditions.

    Authors: We agree that an explicit invariance argument is necessary for rigor and thank the referee for this observation. In the revised manuscript we have added a new lemma in §4 that directly addresses the vector-field behavior at the boundary. Because the continuous-time Bellman error is shown to be coercive on the open stability set (i.e., J(K) → +∞ as K approaches any point on the boundary where the Lyapunov equation ceases to admit a positive-definite solution), every sublevel set {K : J(K) ≤ c} is compact and lies strictly inside the stability region. The gradient flow is the negative gradient of this smooth, coercive function; consequently, the flow cannot reach the boundary in finite time, as that would require J to become infinite while decreasing along trajectories. We supply a self-contained proof that the open stability set is forward-invariant under the flow and that solutions exist globally for any initial stabilizing gain. Global convergence to the unique stationary point then follows from standard Lyapunov arguments on the compact sublevel sets. This addition makes the invariance claim fully rigorous without altering any other results. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on standard HJB and Lyapunov equations without self-referential reduction.

full rationale

The paper begins from the established continuous-time Hamilton-Jacobi-Bellman equation and the infinite-horizon Lyapunov equation for LQR cost, both of which are classical results external to this manuscript. The continuous-time Bellman error is introduced as a direct re-expression of ARE suboptimality for a stabilizing gain K; its gradient is then computed in closed form and the resulting ODE is analyzed for coerciveness and a unique critical point inside the stability region. None of these steps equates the target convergence result to its own inputs by construction, nor does any load-bearing premise collapse to a self-citation whose validity depends on the present paper. The claimed invariance of the stabilizing-gain set under the flow is asserted via the derived vector field, but this is an independent analytic claim rather than a definitional tautology. The overall chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard optimal-control background plus the newly introduced Bellman error; no free parameters are indicated and the invented entity is the error measure itself.

axioms (2)
  • standard math The Hamilton-Jacobi-Bellman equation governs the optimal value function for the LQR problem.
    Invoked to derive the continuous-time Bellman error from the HJB PDE.
  • domain assumption Lyapunov equations yield the infinite-horizon quadratic cost for any stabilizing linear feedback.
    Used to overcome the infinite-horizon difficulty when adapting the formulation to reinforcement learning.
invented entities (1)
  • Continuous-time Bellman error parametrized by feedback gain no independent evidence
    purpose: Quantifies suboptimality of any stabilizing policy and supplies the objective whose gradient induces the stabilizing flow.
    Newly defined in the paper as the central object bridging LQR and RL.

pith-pipeline@v0.9.0 · 5739 in / 1508 out tokens · 48058 ms · 2026-05-19T09:49:28.122260+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    Contributions to the theory of optimal control,

    R. E. Kalman, “Contributions to the theory of optimal control,” Bol. Soc. Mat. Mex. , no. 1, 1960

  2. [2]

    An iterative technique for the computation of the steady state gains for the discrete optimal regulator,

    G. Hewer, “An iterative technique for the computation of the steady state gains for the discrete optimal regulator,” IEEE Transactions on Automatic Control , no. 4, 1971

  3. [3]

    On an iterative technique for Riccati equation computations,

    D. Kleinman, “On an iterative technique for Riccati equation computations,” IEEE Trans. Autom. Control , no. 1, 1968

  4. [4]

    Lancaster and L

    P. Lancaster and L. Rodman, Algebraic Riccati Equations . New York, USA: Oxford University Press Inc., 1995

  5. [5]

    Semidefinite pro- gramming duality and linear time-invariant systems,

    V . Balakrishnan and L. Vandenberghe, “Semidefinite pro- gramming duality and linear time-invariant systems,” IEEE Transactions on Automatic Control , no. 1, 2003

  6. [6]

    LQ control via semidefinite programming,

    D. Yao, S. Zhang, and X. Y . Zhou, “LQ control via semidefinite programming,” in Proceedings of the 38th IEEE Conference on Decision and Control , 1999

  7. [7]

    Chen and B

    T. Chen and B. A. Francis, Optimal Sampled-Data Control Systems. Springer London, 1995

  8. [8]

    Numerical Methods for H2 Related Prob- lems,

    E. Feron et al. , “Numerical Methods for H2 Related Prob- lems,” in IEEE American Control Conference , 1992

  9. [9]

    R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018

  10. [10]

    Optimal and Autonomous Control Using Reinforcement Learning: A Survey,

    B. Kiumarsi et al., “Optimal and Autonomous Control Using Reinforcement Learning: A Survey,” IEEE Transactions on Neural Networks and Learning Systems , no. 6, 2018

  11. [11]

    A Tour of Reinforcement Learning: The View from Continuous Control,

    B. Recht, “A Tour of Reinforcement Learning: The View from Continuous Control,” Annual Review of Control, Robotics, and Autonomous Systems , no. 1, 2019

  12. [12]

    Vrabie, K

    D. Vrabie, K. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles . London, UK: The Institution of Engi- neering and Technology, 2013

  13. [13]

    On the determination of the op- timal constant output feedback gains for linear multivariable systems,

    W. Levine and M. Athans, “On the determination of the op- timal constant output feedback gains for linear multivariable systems,” IEEE Trans. Autom. Control , no. 1, 1970

  14. [14]

    Optimal decentralized control of dynamic systems,

    J. Geromel and J. Bernussou, “Optimal decentralized control of dynamic systems,” Automatica, no. 5, 1982

  15. [15]

    Gradient Methods for Large-Scale and Distributed Linear Quadratic Control,

    K. M ˚artensson, “Gradient Methods for Large-Scale and Distributed Linear Quadratic Control,” Ph.D. dissertation, Lund University, Lund, 2012

  16. [16]

    Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator,

    M. Fazel et al. , “Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator,” in Proceedings of the 35th ICML , 2018

  17. [17]

    Bu et al

    J. Bu et al. , LQR through the Lens of First Order Meth- ods: Discrete-time Case , 2019. arXiv: 1907 . 08921 [eess.SY]

  18. [18]

    J. Bu, A. Mesbahi, and M. Mesbahi, Policy Gradient-based Algorithms for Continuous-time Linear Quadratic Control ,

  19. [19]

    arXiv: 2006.09178 [eess.SY]

  20. [20]

    LQR via First Order Flows,

    J. Bu, A. Mesbahi, and M. Mesbahi, “LQR via First Order Flows,” in American Control Conference, 2020

  21. [21]

    Adaptive optimal control for continuous- time linear systems based on policy iteration,

    D. Vrabie et al., “Adaptive optimal control for continuous- time linear systems based on policy iteration,” Automatica, no. 2, 2009

  22. [22]

    Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,

    D. Vrabie and F. Lewis, “Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,” Neural Networks, no. 3, 2009

  23. [23]

    Solution of the Matrix Equation AX + XB = C,

    R. H. Bartels and G. W. Stewart, “Solution of the Matrix Equation AX + XB = C,” Communications of the ACM , no. 9, 1972

  24. [24]

    J. P. Hespanha, Linear Systems Theory , 2nd ed. Princeton, USA: Princeton University Press, 2018

  25. [25]

    J. Bu, A. Mesbahi, and M. Mesbahi, On Topological and Metrical Properties of Stabilizing Feedback Gains: the MIMO Case, 2019. arXiv: 1904.02737 [cs.SY]

  26. [26]

    B. D. O. Anderson and J. B. Moore, Optimal Control: Linear Quadratic methods . Englewood Cliffs, USA: Prentice-Hall, Inc., 1990

  27. [27]

    Construction of Suboptimal Control Sequences,

    R. J. Leake and R.-W. Liu, “Construction of Suboptimal Control Sequences,” SIAM Journal on Control , no. 1, 1967

  28. [28]

    R. A. Horn and C. R. Johnson, Matrix Analysis , 2nd ed. Cambridge University Press, 2012

  29. [29]

    Munkres, Topology, 2nd ed

    J. Munkres, Topology, 2nd ed. Harlow, UK: Pearson Educa- tion Limited, 2014

  30. [30]

    Old and New Matrix Algebra Useful for Statis- tics,

    T. Minka, “Old and New Matrix Algebra Useful for Statis- tics,” 2000

  31. [31]

    Rudin, Principles of Mathematical Analysis, 3rd ed

    W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York, USA: McGraw-Hill, 1964