Global Convergence of Policy Gradient Methods for ReLU Controllers in Linear Quadratic Regulation
Pith reviewed 2026-05-08 11:22 UTC · model grok-4.3
The pith
Model-based policy gradient on overparameterized ReLU networks converges globally to the optimal scalar LQR gain with high probability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For the scalar deterministic discounted LQR, a one-hidden-layer ReLU network without biases induces two effective gains on the positive and negative half-lines. Under random initialization, sufficient width, and small step size, model-based policy gradient keeps the closed-loop system stable, decreases the cost geometrically, and drives the two effective gains to the unique optimal LQR gain with high probability.
What carries the argument
The pair of effective gains induced by the ReLU network on the positive and negative half-lines, which completely capture the piecewise-linear controller and allow exact tracking of the nonconvex policy-gradient flow.
If this is right
- The cost to go decreases geometrically to the optimal LQR cost.
- The closed-loop system remains stable for every iterate with high probability.
- Both effective gains converge to the unique optimal scalar LQR gain.
- The result holds for any sufficiently wide network and sufficiently small step size.
- The analysis applies exactly because the two effective gains reduce the nonconvex problem to a two-dimensional dynamical system.
Where Pith is reading between the lines
- The effective-gain reduction may extend to vector LQR if similar low-dimensional summaries of the piecewise-linear controller can be found.
- The same proof technique could be used to analyze policy gradient on other piecewise-linear or ReLU-based controllers whose closed-loop map admits a low-dimensional parameterization.
- Wider networks improve the probability of success but are not strictly necessary once the initialization lands in the basin that the analysis covers.
- Model-free variants would require additional concentration arguments to replace the exact model-based gradient.
Load-bearing premise
The plant is scalar, deterministic and discounted, the controller is a bias-free one-hidden-layer ReLU network, gradients are model-based, and the two effective gains fully determine stability and cost.
What would settle it
An explicit scalar LQR instance and ReLU network where, after random initialization of sufficient width, a small-step model-based policy-gradient update either destabilizes the closed loop or leaves at least one effective gain bounded away from the optimal value.
Figures
read the original abstract
We study the convergence of model-based policy gradient for the deterministic, scalar, discounted linear-quadratic regulator when the controller is an overparameterized one-hidden-layer ReLU network without biases. Although the optimal LQR controller is linear, neural parameterization creates a redundant nonconvex weight space with a possibly asymmetric piecewise-linear controller. We show that this structure can still be analyzed exactly through the two effective gains induced on the positive and negative half-lines. Under suitable random initialization, sufficient width, and a small step size, the model-based policy gradient remains stable, decreases the cost geometrically, and drives the effective gains to the unique optimal scalar LQR gain with high probability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that for the scalar deterministic discounted LQR problem, model-based policy gradient applied to an overparameterized one-hidden-layer ReLU network without biases achieves global convergence: under random initialization, sufficient width, and sufficiently small step size, the iterates remain stable, the cost decreases geometrically, and the two effective gains (induced on the positive and negative half-lines) converge to the unique optimal scalar LQR gain with high probability. The analysis exploits the exact reduction of the piecewise-linear controller to these two effective gains.
Significance. If the proofs are complete, the result supplies a fully rigorous, structure-exploiting global-convergence guarantee for a nonconvex neural parameterization of a classic control problem. The reduction to two effective gains is exact and allows direct analysis of the landscape, which is a clear technical strength. The work therefore contributes a concrete, verifiable example to the literature on optimization landscapes for policy optimization. Its immediate scope is limited by the scalar deterministic setting, but the techniques may seed extensions to higher-dimensional or stochastic cases.
minor comments (2)
- [Abstract and §1] The abstract and introduction should explicitly state the admissible range for the discount factor and the precise definition of 'sufficient width' (e.g., a lower bound in terms of problem parameters) so that the high-probability statement is immediately checkable.
- [Main convergence theorem] In the convergence theorem, the geometric rate is stated in terms of the effective gains; a short remark clarifying whether both gains converge to the same scalar value (as required for the optimal linear controller) would remove any ambiguity.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. We appreciate the recognition that the exact reduction to two effective gains provides a clear technical strength and a verifiable example for policy optimization landscapes. We will incorporate any minor clarifications or improvements in the revised version.
Circularity Check
No significant circularity: derivation is self-contained via explicit ReLU structure and standard random-init analysis
full rationale
The paper's central argument reduces the nonconvex ReLU controller to two effective gains on the positive/negative half-lines, then proves geometric cost decrease and convergence to the unique optimal LQR gain under random initialization, sufficient width, and small step size. This reduction follows directly from the one-hidden-layer ReLU parameterization without biases (an exact structural property, not a fitted or self-defined quantity). No load-bearing step relies on self-citation, parameter fitting to the target result, or an ansatz imported from prior work by the same authors. The proof is a direct mathematical analysis of the induced landscape in the scalar deterministic discounted LQR setting, with all assumptions stated explicitly and the convergence claim derived from those assumptions rather than presupposed.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The optimal controller for scalar LQR is linear
- domain assumption A one-hidden-layer ReLU network without biases induces two effective gains on positive and negative half-lines
Reference graph
Works this paper leans on
-
[1]
Kwakernaak and R
H. Kwakernaak and R. Sivan,Linear optimal control systems. Wiley- interscience New York, 1972, vol. 1
1972
-
[2]
Anderson and J
B. Anderson and J. Moore,Optimal Control: Linear Quadratic Meth- ods. Prentice-Hall, 1989
1989
-
[3]
Global convergence of policy gradient methods for the linear quadratic regulator,
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” inPro- ceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1467–1476
2018
-
[4]
On the theory of policy gradient methods: Optimality, approximation, and distribution shift,
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,”Journal of Machine Learning Research, vol. 22, no. 98, pp. 1–76, 2021
2021
-
[5]
Global optimality guarantees for policy gradient methods,
J. Bhandari and D. Russo, “Global optimality guarantees for policy gradient methods,”Operations Research, vol. 72, no. 5, pp. 1906– 1927, 2024
1906
-
[6]
Sample-Efficient Model-Free Policy Gradient Methods for Stochastic LQR via Robust Linear Regression
B. Song, S. Gros, and A. Iannelli, “Sample-Efficient Model-Free Policy Gradient Methods for Stochastic LQR via Robust Linear Regression,”arXiv preprint arXiv:2512.03764, 2025
work page internal anchor Pith review arXiv 2025
-
[7]
Policy Gradient Adaptive Control for the
F. Zhao, A. Chiuso, and F. D ¨orfler, “Policy gradient adaptive con- trol for the LQR: Indirect and direct approaches,”arXiv preprint arXiv:2505.03706, 2025
-
[8]
Neural policy gradient methods: Global optimality and rates of convergence,
L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” inInternational Conference on Learning Representations, 2020
2020
-
[9]
A homotopic approach to policy gradients for linear quadratic regulators with nonlinear controls,
C. X. Chen and A. Agazzi, “A homotopic approach to policy gradients for linear quadratic regulators with nonlinear controls,” in2022 IEEE 61st Conference on Decision and Control (CDC), 2022, pp. 1588– 1595
2022
-
[10]
Convergence analysis of gradient flow for overparameterized lqr formulations,
A. C. B. de Oliveira, M. Siami, and E. D. Sontag, “Convergence analysis of gradient flow for overparameterized lqr formulations,” Automatica, vol. 182, p. 112504, 2025
2025
-
[11]
Error bounds for approximations with deep relu net- works,
D. Yarotsky, “Error bounds for approximations with deep relu net- works,”Neural Networks, vol. 94, pp. 103–114, 2017
2017
-
[12]
Understanding deep neural networks with rectified linear units,
R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deep neural networks with rectified linear units,” inInternational Conference on Learning Representations, 2018
2018
-
[13]
Symme- tries, flat minima, and the conserved quantities of gradient flow,
B. Zhao, I. Ganev, R. Walters, R. Yu, and N. Dehmamy, “Symme- tries, flat minima, and the conserved quantities of gradient flow,” in The Eleventh International Conference on Learning Representations (ICLR), 2023
2023
-
[14]
The effects of mild over- parameterization on the optimization landscape of shallow ReLU neural networks,
I. Safran, G. Yehudai, and O. Shamir, “The effects of mild over- parameterization on the optimization landscape of shallow ReLU neural networks,” inProceedings of the 34th Annual Conference on Learning Theory, ser. Proceedings of Machine Learning Research, vol
-
[15]
3889–3934
PMLR, 2021, pp. 3889–3934
2021
-
[16]
R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduc- tion, 2nd ed. The MIT Press, 2018
2018
-
[17]
Linear convergence of gra- dient and proximal-gradient methods under the polyak-łojasiewicz condition,
H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gra- dient and proximal-gradient methods under the polyak-łojasiewicz condition,” inJoint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811
2016
-
[18]
Vershynin,High-dimensional probability: An introduction with applications in data science
R. Vershynin,High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018, vol. 47
2018
-
[19]
Adaptive estimation of a quadratic functional by model selection,
B. Laurent and P. Massart, “Adaptive estimation of a quadratic functional by model selection,”The Annals of Statistics, vol. 28, no. 5, pp. 1302 – 1338, 2000
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.