Global Convergence of Policy Gradient Methods for ReLU Controllers in Linear Quadratic Regulation

C\'esar A. Uribe; Jhojan A. Rodriguez-Gil

arxiv: 2604.22138 · v1 · submitted 2026-04-24 · 🧮 math.OC · cs.SY· eess.SY

Global Convergence of Policy Gradient Methods for ReLU Controllers in Linear Quadratic Regulation

Jhojan A. Rodriguez-Gil , C\'esar A. Uribe This is my paper

Pith reviewed 2026-05-08 11:22 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY

keywords policy gradientlinear quadratic regulatorReLU networkglobal convergencenonconvex optimizationscalar LQRmodel-based reinforcement learning

0 comments

The pith

Model-based policy gradient on overparameterized ReLU networks converges globally to the optimal scalar LQR gain with high probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that policy gradient can solve the scalar discounted linear-quadratic regulator even when the controller is a redundant, nonconvex one-hidden-layer ReLU network. The key reduction is to track only two effective gains, one acting on positive states and one on negative states, which together determine the entire closed-loop dynamics. With random initialization, a wide enough network, and sufficiently small steps, the method keeps the system stable, drives the cost down at a geometric rate, and pushes both effective gains to the single optimal linear gain with high probability. This matters because it gives an exact global-convergence guarantee for a nonconvex control parameterization that is still simple enough to analyze completely.

Core claim

For the scalar deterministic discounted LQR, a one-hidden-layer ReLU network without biases induces two effective gains on the positive and negative half-lines. Under random initialization, sufficient width, and small step size, model-based policy gradient keeps the closed-loop system stable, decreases the cost geometrically, and drives the two effective gains to the unique optimal LQR gain with high probability.

What carries the argument

The pair of effective gains induced by the ReLU network on the positive and negative half-lines, which completely capture the piecewise-linear controller and allow exact tracking of the nonconvex policy-gradient flow.

If this is right

The cost to go decreases geometrically to the optimal LQR cost.
The closed-loop system remains stable for every iterate with high probability.
Both effective gains converge to the unique optimal scalar LQR gain.
The result holds for any sufficiently wide network and sufficiently small step size.
The analysis applies exactly because the two effective gains reduce the nonconvex problem to a two-dimensional dynamical system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The effective-gain reduction may extend to vector LQR if similar low-dimensional summaries of the piecewise-linear controller can be found.
The same proof technique could be used to analyze policy gradient on other piecewise-linear or ReLU-based controllers whose closed-loop map admits a low-dimensional parameterization.
Wider networks improve the probability of success but are not strictly necessary once the initialization lands in the basin that the analysis covers.
Model-free variants would require additional concentration arguments to replace the exact model-based gradient.

Load-bearing premise

The plant is scalar, deterministic and discounted, the controller is a bias-free one-hidden-layer ReLU network, gradients are model-based, and the two effective gains fully determine stability and cost.

What would settle it

An explicit scalar LQR instance and ReLU network where, after random initialization of sufficient width, a small-step model-based policy-gradient update either destabilizes the closed loop or leaves at least one effective gain bounded away from the optimal value.

Figures

Figures reproduced from arXiv: 2604.22138 by C\'esar A. Uribe, Jhojan A. Rodriguez-Gil.

**Figure 1.** Figure 1: (Left) Difference with respect to the cost of the view at source ↗

**Figure 2.** Figure 2: Movement of 50 randomly selected neurons in the view at source ↗

read the original abstract

We study the convergence of model-based policy gradient for the deterministic, scalar, discounted linear-quadratic regulator when the controller is an overparameterized one-hidden-layer ReLU network without biases. Although the optimal LQR controller is linear, neural parameterization creates a redundant nonconvex weight space with a possibly asymmetric piecewise-linear controller. We show that this structure can still be analyzed exactly through the two effective gains induced on the positive and negative half-lines. Under suitable random initialization, sufficient width, and a small step size, the model-based policy gradient remains stable, decreases the cost geometrically, and drives the effective gains to the unique optimal scalar LQR gain with high probability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean global convergence result for model-based policy gradient on overparameterized ReLU controllers in scalar LQR by reducing everything to two effective gains.

read the letter

The main takeaway is that under random initialization, enough width, and small steps, the policy gradient stays stable, cuts the cost geometrically, and pushes the two effective gains to the optimal LQR value with high probability. The authors pull this off by observing that a one-hidden-layer ReLU net without biases on a scalar state simply creates two linear gains, one on each half-line, so the nonconvex weight space collapses to a two-dimensional problem they can analyze directly.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that for the scalar deterministic discounted LQR problem, model-based policy gradient applied to an overparameterized one-hidden-layer ReLU network without biases achieves global convergence: under random initialization, sufficient width, and sufficiently small step size, the iterates remain stable, the cost decreases geometrically, and the two effective gains (induced on the positive and negative half-lines) converge to the unique optimal scalar LQR gain with high probability. The analysis exploits the exact reduction of the piecewise-linear controller to these two effective gains.

Significance. If the proofs are complete, the result supplies a fully rigorous, structure-exploiting global-convergence guarantee for a nonconvex neural parameterization of a classic control problem. The reduction to two effective gains is exact and allows direct analysis of the landscape, which is a clear technical strength. The work therefore contributes a concrete, verifiable example to the literature on optimization landscapes for policy optimization. Its immediate scope is limited by the scalar deterministic setting, but the techniques may seed extensions to higher-dimensional or stochastic cases.

minor comments (2)

[Abstract and §1] The abstract and introduction should explicitly state the admissible range for the discount factor and the precise definition of 'sufficient width' (e.g., a lower bound in terms of problem parameters) so that the high-probability statement is immediately checkable.
[Main convergence theorem] In the convergence theorem, the geometric rate is stated in terms of the effective gains; a short remark clarifying whether both gains converge to the same scalar value (as required for the optimal linear controller) would remove any ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. We appreciate the recognition that the exact reduction to two effective gains provides a clear technical strength and a verifiable example for policy optimization landscapes. We will incorporate any minor clarifications or improvements in the revised version.

Circularity Check

0 steps flagged

No significant circularity: derivation is self-contained via explicit ReLU structure and standard random-init analysis

full rationale

The paper's central argument reduces the nonconvex ReLU controller to two effective gains on the positive/negative half-lines, then proves geometric cost decrease and convergence to the unique optimal LQR gain under random initialization, sufficient width, and small step size. This reduction follows directly from the one-hidden-layer ReLU parameterization without biases (an exact structural property, not a fitted or self-defined quantity). No load-bearing step relies on self-citation, parameter fitting to the target result, or an ansatz imported from prior work by the same authors. The proof is a direct mathematical analysis of the induced landscape in the scalar deterministic discounted LQR setting, with all assumptions stated explicitly and the convergence claim derived from those assumptions rather than presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard LQR background facts and the structural property that a bias-free ReLU network induces exactly two effective gains.

axioms (2)

domain assumption The optimal controller for scalar LQR is linear
Invoked as background fact in the abstract.
domain assumption A one-hidden-layer ReLU network without biases induces two effective gains on positive and negative half-lines
Central modeling step stated in the abstract.

pith-pipeline@v0.9.0 · 5416 in / 1277 out tokens · 57703 ms · 2026-05-08T11:22:01.897369+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Kwakernaak and R

H. Kwakernaak and R. Sivan,Linear optimal control systems. Wiley- interscience New York, 1972, vol. 1

1972
[2]

Anderson and J

B. Anderson and J. Moore,Optimal Control: Linear Quadratic Meth- ods. Prentice-Hall, 1989

1989
[3]

Global convergence of policy gradient methods for the linear quadratic regulator,

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” inPro- ceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1467–1476

2018
[4]

On the theory of policy gradient methods: Optimality, approximation, and distribution shift,

A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,”Journal of Machine Learning Research, vol. 22, no. 98, pp. 1–76, 2021

2021
[5]

Global optimality guarantees for policy gradient methods,

J. Bhandari and D. Russo, “Global optimality guarantees for policy gradient methods,”Operations Research, vol. 72, no. 5, pp. 1906– 1927, 2024

1906
[6]

Sample-Efficient Model-Free Policy Gradient Methods for Stochastic LQR via Robust Linear Regression

B. Song, S. Gros, and A. Iannelli, “Sample-Efficient Model-Free Policy Gradient Methods for Stochastic LQR via Robust Linear Regression,”arXiv preprint arXiv:2512.03764, 2025

work page internal anchor Pith review arXiv 2025
[7]

Policy Gradient Adaptive Control for the

F. Zhao, A. Chiuso, and F. D ¨orfler, “Policy gradient adaptive con- trol for the LQR: Indirect and direct approaches,”arXiv preprint arXiv:2505.03706, 2025

work page arXiv 2025
[8]

Neural policy gradient methods: Global optimality and rates of convergence,

L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” inInternational Conference on Learning Representations, 2020

2020
[9]

A homotopic approach to policy gradients for linear quadratic regulators with nonlinear controls,

C. X. Chen and A. Agazzi, “A homotopic approach to policy gradients for linear quadratic regulators with nonlinear controls,” in2022 IEEE 61st Conference on Decision and Control (CDC), 2022, pp. 1588– 1595

2022
[10]

Convergence analysis of gradient flow for overparameterized lqr formulations,

A. C. B. de Oliveira, M. Siami, and E. D. Sontag, “Convergence analysis of gradient flow for overparameterized lqr formulations,” Automatica, vol. 182, p. 112504, 2025

2025
[11]

Error bounds for approximations with deep relu net- works,

D. Yarotsky, “Error bounds for approximations with deep relu net- works,”Neural Networks, vol. 94, pp. 103–114, 2017

2017
[12]

Understanding deep neural networks with rectified linear units,

R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deep neural networks with rectified linear units,” inInternational Conference on Learning Representations, 2018

2018
[13]

Symme- tries, flat minima, and the conserved quantities of gradient flow,

B. Zhao, I. Ganev, R. Walters, R. Yu, and N. Dehmamy, “Symme- tries, flat minima, and the conserved quantities of gradient flow,” in The Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[14]

The effects of mild over- parameterization on the optimization landscape of shallow ReLU neural networks,

I. Safran, G. Yehudai, and O. Shamir, “The effects of mild over- parameterization on the optimization landscape of shallow ReLU neural networks,” inProceedings of the 34th Annual Conference on Learning Theory, ser. Proceedings of Machine Learning Research, vol
[15]

3889–3934

PMLR, 2021, pp. 3889–3934

2021
[16]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduc- tion, 2nd ed. The MIT Press, 2018

2018
[17]

Linear convergence of gra- dient and proximal-gradient methods under the polyak-łojasiewicz condition,

H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gra- dient and proximal-gradient methods under the polyak-łojasiewicz condition,” inJoint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811

2016
[18]

Vershynin,High-dimensional probability: An introduction with applications in data science

R. Vershynin,High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018, vol. 47

2018
[19]

Adaptive estimation of a quadratic functional by model selection,

B. Laurent and P. Massart, “Adaptive estimation of a quadratic functional by model selection,”The Annals of Statistics, vol. 28, no. 5, pp. 1302 – 1338, 2000

2000

[1] [1]

Kwakernaak and R

H. Kwakernaak and R. Sivan,Linear optimal control systems. Wiley- interscience New York, 1972, vol. 1

1972

[2] [2]

Anderson and J

B. Anderson and J. Moore,Optimal Control: Linear Quadratic Meth- ods. Prentice-Hall, 1989

1989

[3] [3]

Global convergence of policy gradient methods for the linear quadratic regulator,

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” inPro- ceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1467–1476

2018

[4] [4]

On the theory of policy gradient methods: Optimality, approximation, and distribution shift,

A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,”Journal of Machine Learning Research, vol. 22, no. 98, pp. 1–76, 2021

2021

[5] [5]

Global optimality guarantees for policy gradient methods,

J. Bhandari and D. Russo, “Global optimality guarantees for policy gradient methods,”Operations Research, vol. 72, no. 5, pp. 1906– 1927, 2024

1906

[6] [6]

Sample-Efficient Model-Free Policy Gradient Methods for Stochastic LQR via Robust Linear Regression

B. Song, S. Gros, and A. Iannelli, “Sample-Efficient Model-Free Policy Gradient Methods for Stochastic LQR via Robust Linear Regression,”arXiv preprint arXiv:2512.03764, 2025

work page internal anchor Pith review arXiv 2025

[7] [7]

Policy Gradient Adaptive Control for the

F. Zhao, A. Chiuso, and F. D ¨orfler, “Policy gradient adaptive con- trol for the LQR: Indirect and direct approaches,”arXiv preprint arXiv:2505.03706, 2025

work page arXiv 2025

[8] [8]

Neural policy gradient methods: Global optimality and rates of convergence,

L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” inInternational Conference on Learning Representations, 2020

2020

[9] [9]

A homotopic approach to policy gradients for linear quadratic regulators with nonlinear controls,

C. X. Chen and A. Agazzi, “A homotopic approach to policy gradients for linear quadratic regulators with nonlinear controls,” in2022 IEEE 61st Conference on Decision and Control (CDC), 2022, pp. 1588– 1595

2022

[10] [10]

Convergence analysis of gradient flow for overparameterized lqr formulations,

A. C. B. de Oliveira, M. Siami, and E. D. Sontag, “Convergence analysis of gradient flow for overparameterized lqr formulations,” Automatica, vol. 182, p. 112504, 2025

2025

[11] [11]

Error bounds for approximations with deep relu net- works,

D. Yarotsky, “Error bounds for approximations with deep relu net- works,”Neural Networks, vol. 94, pp. 103–114, 2017

2017

[12] [12]

Understanding deep neural networks with rectified linear units,

R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deep neural networks with rectified linear units,” inInternational Conference on Learning Representations, 2018

2018

[13] [13]

Symme- tries, flat minima, and the conserved quantities of gradient flow,

B. Zhao, I. Ganev, R. Walters, R. Yu, and N. Dehmamy, “Symme- tries, flat minima, and the conserved quantities of gradient flow,” in The Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[14] [14]

The effects of mild over- parameterization on the optimization landscape of shallow ReLU neural networks,

I. Safran, G. Yehudai, and O. Shamir, “The effects of mild over- parameterization on the optimization landscape of shallow ReLU neural networks,” inProceedings of the 34th Annual Conference on Learning Theory, ser. Proceedings of Machine Learning Research, vol

[15] [15]

3889–3934

PMLR, 2021, pp. 3889–3934

2021

[16] [16]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduc- tion, 2nd ed. The MIT Press, 2018

2018

[17] [17]

Linear convergence of gra- dient and proximal-gradient methods under the polyak-łojasiewicz condition,

H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gra- dient and proximal-gradient methods under the polyak-łojasiewicz condition,” inJoint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811

2016

[18] [18]

Vershynin,High-dimensional probability: An introduction with applications in data science

R. Vershynin,High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018, vol. 47

2018

[19] [19]

Adaptive estimation of a quadratic functional by model selection,

B. Laurent and P. Massart, “Adaptive estimation of a quadratic functional by model selection,”The Annals of Statistics, vol. 28, no. 5, pp. 1302 – 1338, 2000

2000