Sample-Efficient Model-Free Policy Gradient Methods for Stochastic LQR via Robust Linear Regression
Pith reviewed 2026-05-17 02:30 UTC · model grok-4.3
The pith
A primal-dual estimation procedure yields unbiased gradients for policy gradient methods in stochastic LQR despite errors-in-variables, achieving O(1/epsilon) sample complexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By employing a primal-dual estimation procedure, the paper derives unbiased gradient estimates for the Natural Policy Gradient and Gauss-Newton methods applied to the stochastic LQR problem in unknown linear systems, despite the presence of errors-in-variables in the linear regression step, and proves convergence guarantees with sample complexity scaling as O(1/epsilon).
What carries the argument
The primal-dual estimation procedure that corrects for errors-in-variables to produce unbiased gradient estimates from noisy data in the stochastic LQR setting.
If this is right
- The natural policy gradient and Gauss-Newton methods both achieve the stated sample complexity when equipped with the primal-dual estimator.
- Convergence holds for unknown stochastic linear systems under the derived guarantees.
- Numerical experiments confirm practical effectiveness on representative stochastic LQR instances.
Where Pith is reading between the lines
- The same robust-regression correction may reduce sample needs in other linear-quadratic control settings that rely on policy gradients.
- Real-world noisy sensors could benefit from similar primal-dual debiasing when model-free methods are deployed.
- Extensions to partially observed or switched linear systems could be tested by replacing the current regression step with the same estimator.
Load-bearing premise
The primal-dual estimation procedure produces unbiased gradient estimates from noisy data despite errors-in-variables in the linear regression step for the stochastic LQR setting.
What would settle it
Apply the proposed gradient estimator to a known stochastic LQR instance with controlled noise levels and measure whether the observed sample complexity for epsilon-accurate convergence follows the O(1/epsilon) scaling or whether residual bias appears in the estimates.
Figures
read the original abstract
Policy gradient algorithms are widely used in reinforcement learning and belong to the class of approximate dynamic programming methods. This paper studies two key policy gradient algorithms, the Natural Policy Gradient and the Gauss-Newton Method, for solving the Linear Quadratic Regulator (LQR) problem in unknown stochastic linear systems. The main challenge lies in obtaining an unbiased gradient estimate from noisy data due to errors-in-variables in linear regression. This issue is addressed by employing a primal-dual estimation procedure. Using this novel gradient estimation scheme, the paper establishes convergence guarantees with a sample complexity of order O(1/epsilon). Theoretical results are further supported by numerical experiments, which demonstrate the effectiveness of the proposed algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a primal-dual robust linear regression procedure to obtain unbiased estimates of the policy gradient for the Natural Policy Gradient and Gauss-Newton algorithms applied to the stochastic LQR problem with unknown linear dynamics. The central contribution is a sample-complexity bound of order O(1/ε) for ε-optimal policies, derived from the new gradient estimator that corrects for errors-in-variables bias in noisy trajectory data; the claims are supported by convergence theorems and numerical experiments on synthetic LQR instances.
Significance. If the unbiasedness of the primal-dual estimator is rigorously established under standard stochastic LQR assumptions (stationary closed-loop distribution, sufficient excitation, and independence of process noise from past states), the result would strengthen the theoretical foundation for sample-efficient model-free policy optimization in linear systems, offering a concrete improvement over prior model-free bounds that typically scale as O(1/ε²). The numerical validation provides useful empirical support for practical applicability.
major comments (2)
- [§3.2 and Theorem 1] §3.2 (Primal-Dual Estimator) and Theorem 1: The proof that the dual correction produces unbiased gradient estimates must explicitly verify cancellation of the cross term E[w_t ϕ_t] (where ϕ_t = [x_t; u_t] is the regressor and w_t is process noise) under the stationary distribution induced by the closed-loop stochastic dynamics. The current argument assumes independence but does not bound the residual bias arising from the dependence of x_t on past noise realizations; without this step the O(1/ε) sample complexity does not follow from standard stochastic approximation arguments.
- [Theorem 2] Theorem 2 (Convergence): The sample-complexity claim of O(1/ε) relies on the gradient estimator having bias o(1/√N) and variance O(1/N). If the primal-dual correction leaves a persistent O(1) bias term under correlated noise, the iteration complexity would degrade; the analysis should include an explicit bias bound or a counter-example showing why correlation is precluded by the excitation assumption.
minor comments (2)
- [§3] Notation for the dual variable and the robust regression objective should be introduced with a clear equation reference before its first use in the main theorems.
- [Numerical Experiments] Figure 1 (numerical results): axis labels and legend entries are too small for readability; increase font size and add error bars or shaded regions indicating variability across the 10 random seeds.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on the unbiasedness of the primal-dual estimator and its implications for the sample-complexity analysis. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and bounds.
read point-by-point responses
-
Referee: [§3.2 and Theorem 1] §3.2 (Primal-Dual Estimator) and Theorem 1: The proof that the dual correction produces unbiased gradient estimates must explicitly verify cancellation of the cross term E[w_t ϕ_t] (where ϕ_t = [x_t; u_t] is the regressor and w_t is process noise) under the stationary distribution induced by the closed-loop stochastic dynamics. The current argument assumes independence but does not bound the residual bias arising from the dependence of x_t on past noise realizations; without this step the O(1/ε) sample complexity does not follow from standard stochastic approximation arguments.
Authors: We agree that an explicit verification is required. Under the standard stochastic LQR assumptions (stationary closed-loop distribution and independence of process noise w_t from the history up to time t), the regressor ϕ_t depends only on noises up to t-1, so w_t is independent of ϕ_t. This implies E[w_t ϕ_t] = E[w_t] E[ϕ_t] = 0. We will add a supporting lemma in the revised §3.2 that formally establishes this cancellation, accounts for any transient effects, and confirms that the estimator remains unbiased. This will directly support the O(1/ε) bound via standard stochastic approximation. revision: yes
-
Referee: [Theorem 2] Theorem 2 (Convergence): The sample-complexity claim of O(1/ε) relies on the gradient estimator having bias o(1/√N) and variance O(1/N). If the primal-dual correction leaves a persistent O(1) bias term under correlated noise, the iteration complexity would degrade; the analysis should include an explicit bias bound or a counter-example showing why correlation is precluded by the excitation assumption.
Authors: The primal-dual correction is designed to yield an unbiased estimator of the true policy gradient in expectation. The finite-sample bias is O(1/N) by ergodicity and the law of large numbers under the stationary distribution, satisfying the o(1/√N) condition needed for convergence. The sufficient excitation assumption ensures the regressor covariance is positive definite, which rules out persistent bias-inducing correlations. We will add an explicit bias bound as a corollary to Theorem 2 in the revision, together with a short argument showing why the excitation condition precludes the O(1) bias scenario raised. revision: yes
Circularity Check
No circularity: novel primal-dual estimator and convergence analysis are independent of target result
full rationale
The paper introduces a primal-dual estimation procedure specifically to obtain unbiased gradient estimates from noisy data in the presence of errors-in-variables for stochastic LQR. This is presented as a new scheme rather than a re-derivation or fit of quantities already present in the target convergence bound. The O(1/epsilon) sample complexity then follows from standard policy-gradient arguments applied to these estimates. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described derivation chain. The central claim rests on the independence of the proposed estimator from the final complexity result, which is not reduced by construction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The underlying system is a stochastic linear system with quadratic cost that admits a stabilizing feedback policy.
- ad hoc to paper The primal-dual procedure yields unbiased estimates of the policy gradient despite errors-in-variables.
Forward citations
Cited by 1 Pith paper
-
Global Convergence of Policy Gradient Methods for ReLU Controllers in Linear Quadratic Regulation
Model-based policy gradient converges globally to the optimal scalar LQR gain for discounted LQR using overparameterized ReLU networks by reducing the controller to two effective gains on positive and negative half-lines.
Reference graph
Works this paper leans on
-
[1]
(2019).Reinforcement Learning and Optimal Control
Bertsekas, D. (2019).Reinforcement Learning and Optimal Control. Athena Scientific optimization and computa- tion series. Athena Scientific. Cen, S. and Chi, Y. (2023). Global convergence of policy gradient methods in reinforcement learning, games and control. arXiv preprint arXiv:2310.05230. Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. (2018). Global c...
-
[2]
Policy Gradient Adaptive Control for the
PMLR. Yaghmaie, F.A., Gustafsson, F., and Ljung, L. (2023). Linear quadratic control using model-free reinforcement learning.IEEE Transactions on Automatic Control, 68(2), 737–752. Yang, Y., Kiumarsi, B., Modares, H., and Xu, C. (2023). Model-freeλ-policy iteration for discrete-time linear quadratic regulation.IEEE Transactions on Neural Networks and Lear...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.