Sample Complexity of Policy Gradient for Log-Growth Control
Pith reviewed 2026-06-29 16:01 UTC · model grok-4.3
The pith
Projected mini-batch policy gradient attains Õ(1/η) sample complexity for log-growth control when the noise density is known.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity Õ(1/η) when the noise density is known and Õ(η^{-(2s+1)/(2s)}) when it must be estimated, for C^s noise densities with s ≥ 2. The reflection pairing through the singularity controls the three quantities needed for the complexity bound: curvature of the population objective, variance of the single-transition gradient estimator, and bias from density estimation.
What carries the argument
Reflection pairing of each observed transition with its image under reflection through the moving pole b_sing(K) = -1/K, which cancels the divergent part of the Cauchy principal-value gradient estimator.
If this is right
- The algorithm converges from any compact subset of the stabilizing gains without requiring special initialization.
- Total sample complexity scales linearly with 1/η when the noise density is known exactly.
- When the density must be estimated from data the rate degrades to Õ(η^{-(2s+1)/(2s)}) for C^s densities.
- The same pairing step that removes infinite variance also removes the leading bias term that would otherwise arise from density estimation.
Where Pith is reading between the lines
- The symmetry argument suggests that other singular-gradient problems whose kernels possess an odd part around a moving pole may admit similar variance-reduction pairings.
- The derived rates assume the structural placement of the singularity inside the noise support; relaxing that placement would require a different estimator.
- The closed-form single-transition oracle used here is specific to the scalar multiplicative-noise case and would need replacement for vector or additive-noise variants.
Load-bearing premise
The optimal gain always places the noise singularity inside the interior of the support of the noise distribution, allowing the reflection symmetry to cancel the divergent contribution to the gradient.
What would settle it
A direct Monte-Carlo check near the optimum showing that the single-sample gradient estimator without reflection pairing has infinite variance, or a run of the algorithm whose observed sample count to reach accuracy η exceeds the claimed Õ(1/η) bound when the density is known.
read the original abstract
We study the sample complexity of policy gradient for log-growth control -- the problem of learning, from observed state transitions, a feedback gain that optimally stabilizes a scalar linear system driven through a multiplicative-noise actuation channel. The objective $J(K) = \mathbb{E}[\log|1+BK|]$ is the top Lyapunov exponent of the closed loop. This problem carries a structural difficulty we call the cusp obstruction: the optimal gain $K^*$ always places the noise singularity $b_{\rm sing}(K) = -1/K$ in the interior of the support. At this singular optimum the policy gradient exists only as a Cauchy principal value, not as a Lebesgue integral, and the natural single-sample gradient estimator has infinite variance. Standard first-order stochastic-optimization analysis is thus inapplicable at the optimum, and merely smoothing the objective does not resolve the difficulty. The obstruction, however, has an exploitable symmetry: the Cauchy kernel is an odd function of the displacement from the moving pole, so pairing each observation with its reflection through the pole cancels the divergent part. This one cancellation simultaneously controls the population curvature, the gradient-estimator variance, and the bias incurred when the noise density is estimated. Combining these bounds with a closed-form single-transition gradient oracle, we prove that projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity $\tilde{O}(1/\eta)$ when the noise density is known and $\tilde{O}(\eta^{-(2s+1)/(2s)})$ when it must be estimated, for $C^s$ noise densities with $s \geq 2$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies sample complexity of projected mini-batch policy gradient for log-growth control of a scalar linear system with multiplicative noise, where the objective is the top Lyapunov exponent J(K) = E[log|1+BK|]. It identifies a 'cusp obstruction' at the optimum K* where the noise singularity b_sing(K) = -1/K lies in the interior of the noise support, rendering the gradient a Cauchy principal value with infinite-variance single-sample estimators. Exploiting the odd symmetry of the Cauchy kernel via reflection pairing across the moving pole, combined with a closed-form single-transition oracle, the authors claim that the method attains total sample complexity Õ(1/η) when the noise density is known and Õ(η^{-(2s+1)/(2s)}) when the density must be estimated, for any initialization in a compact subset of the stabilizing region and for C^s noise densities with s ≥ 2.
Significance. If the central claims hold, the work supplies the first explicit sample-complexity guarantees for policy gradient on a problem whose gradient is singular at the optimum. The technical contribution lies in showing that a single structural symmetry (reflection pairing) simultaneously controls population curvature, finite variance of the paired estimator, and bias from nonparametric density estimation. This approach may be relevant to other control or reinforcement-learning settings that encounter principal-value gradients or infinite-variance estimators.
major comments (2)
- [Abstract (proof strategy)] The abstract asserts that the reflection-pairing cancellation simultaneously controls curvature, estimator variance, and density-estimation bias to produce the stated Õ(1/η) and Õ(η^{-(2s+1)/(2s)}) bounds, yet the manuscript supplies neither the detailed derivation of these bounds nor the verification that the paired estimator remains unbiased after the principal-value cancellation. Without these steps the central rate claims cannot be inspected.
- [Abstract (cusp obstruction)] The structural claim that b_sing(K*) always lies in the interior of the noise support (enabling the symmetry) is stated as always true at the optimum but is not accompanied by a supporting lemma or explicit verification that this interior placement holds for all admissible noise densities under consideration.
minor comments (2)
- Notation for the single-transition oracle and the precise definition of the mini-batch size in terms of η should be introduced earlier and used consistently.
- The dependence of the hidden constants in the Õ notation on the smoothness index s and on the compact initialization set should be stated explicitly.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for highlighting the potential significance of the cusp-obstruction analysis. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract (proof strategy)] The abstract asserts that the reflection-pairing cancellation simultaneously controls curvature, estimator variance, and density-estimation bias to produce the stated Õ(1/η) and Õ(η^{-(2s+1)/(2s)}) bounds, yet the manuscript supplies neither the detailed derivation of these bounds nor the verification that the paired estimator remains unbiased after the principal-value cancellation. Without these steps the central rate claims cannot be inspected.
Authors: The full derivations appear in Theorems 4.1 (known density) and 5.1 (estimated density), with the three-way control of curvature/variance/bias shown via the same pairing argument in Section 3. Unbiasedness of the paired estimator (preserving the Cauchy principal value) is verified in Lemma 3.3 and Appendix B. We agree the abstract is too terse for inspection and will insert a one-sentence proof outline. revision: partial
-
Referee: [Abstract (cusp obstruction)] The structural claim that b_sing(K*) always lies in the interior of the noise support (enabling the symmetry) is stated as always true at the optimum but is not accompanied by a supporting lemma or explicit verification that this interior placement holds for all admissible noise densities under consideration.
Authors: We will add Lemma 2.4, which proves that any minimizer K* of J places b_sing(K*) strictly inside the support for every C^s density (s≥2) whose support contains an interval around zero; the argument uses the fact that moving the pole outside the support strictly increases the Lyapunov exponent. revision: yes
Circularity Check
No significant circularity detected
full rationale
The derivation chain begins from the explicitly stated structural assumption that the optimal gain places the singularity in the interior of the noise support, enabling the odd symmetry of the Cauchy kernel under reflection pairing. This symmetry is an external property of principal-value integrals and is not defined in terms of the sample-complexity result or any fitted quantity. The bounds on curvature, estimator variance, and estimation bias follow directly from the cancellation, combined with the closed-form single-transition oracle; none of these steps reduce to a self-definition, a fitted input relabeled as prediction, or a load-bearing self-citation. No uniqueness theorem, ansatz smuggling, or renaming of known results appears in the provided text.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption The system is scalar linear with multiplicative-noise actuation channel.
- domain assumption The optimal gain K* places the singularity b_sing(K*) = -1/K* inside the interior of the noise support.
- domain assumption Noise density belongs to C^s with s ≥ 2 when density estimation is required.
Reference graph
Works this paper leans on
-
[1]
Anantharam and V
V. Anantharam and V. S. Borkar , A variational formula for risk-sensitive reward , SIAM Journal on Control and Optimization, 55 (2017), pp. 961--988
2017
-
[2]
L. Arnold , Random dynamical systems , in Dynamical Systems: Lectures Given at the 2nd Session of the Centro Internazionale Matematico Estivo (CIME) held in Montecatini Terme, Italy, June 13--22, 1994, Springer, 2006, pp. 1--43
1994
-
[3]
K. E. Atkinson , An introduction to numerical analysis , John wiley & sons, 2008
2008
-
[4]
Bolte and E
J. Bolte and E. Pauwels , Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning , Mathematical Programming, 188 (2021), pp. 19--51
2021
-
[5]
V. S. Borkar , A sensitivity formula for risk-sensitive cost and the actor--critic algorithm , Systems & Control Letters, 44 (2001), pp. 339--346
2001
-
[6]
V. S. Borkar , Stochastic approximation: a dynamical systems viewpoint , vol. 100, Springer, 2008
2008
-
[7]
V. S. Borkar and S. P. Meyn , Risk-sensitive optimal control for markov decision processes with monotone cost , Mathematics of Operations Research, 27 (2002), pp. 192--209
2002
-
[8]
Bottou, F
L. Bottou, F. E. Curtis, and J. Nocedal , Optimization methods for large-scale machine learning , SIAM review, 60 (2018), pp. 223--311
2018
- [9]
-
[10]
O. L. V. Costa, R. P. Marques, and M. D. Fragoso , Discrete-Time Markov Jump Linear Systems , Probability and Its Applications, Springer London, 2005, https://doi.org/10.1007/b138575
-
[11]
Damm , Rational matrix equations in stochastic control , vol
T. Damm , Rational matrix equations in stochastic control , vol. 297, Springer Science & Business Media, 2004
2004
-
[12]
Davis and D
D. Davis and D. Drusvyatskiy , Stochastic model-based minimization of weakly convex functions , SIAM Journal on Optimization, 29 (2019), pp. 207--239
2019
-
[13]
P. J. Davis and P. Rabinowitz , Methods of numerical integration , Courier Corporation, 2007
2007
-
[14]
Fazel, R
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi , Global convergence of policy gradient methods for the linear quadratic regulator , in International conference on machine learning, PMLR, 2018, pp. 1467--1476
2018
-
[15]
M. Giegrich, C. Reisinger, and Y. Zhang , Convergence of policy gradient methods for finite-horizon exploratory linear--quadratic control problems , SIAM Journal on Control and Optimization, 62 (2024), pp. 1060--1092, https://doi.org/10.1137/22M1533517
-
[16]
Giné and R
E. Giné and R. Nickl , Mathematical Foundations of Infinite-Dimensional Statistical Models , Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, 2021
2021
-
[17]
Gravell, P
B. Gravell, P. M. Esfahani, and T. Summers , Learning optimal controllers for linear systems with multiplicative noise via policy gradient , IEEE Transactions on Automatic Control, 66 (2020), pp. 5283--5298
2020
-
[18]
Hambly, R
B. Hambly, R. Xu, and H. Yang , Policy gradient methods for the noisy linear quadratic regulator over a finite horizon , SIAM Journal on Control and Optimization, 59 (2021), pp. 3359--3391
2021
-
[19]
J. M. Hammersley and K. W. Morton , A new monte carlo technique: antithetic variates , in Mathematical proceedings of the Cambridge philosophical society, vol. 52, Cambridge University Press, 1956, pp. 449--475
1956
-
[20]
D. Jacobson , Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games , IEEE Transactions on Automatic control, 18 (1973), pp. 124--131
1973
-
[21]
Karimi, J
H. Karimi, J. Nutini, and M. Schmidt , Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition , in Joint European conference on machine learning and knowledge discovery in databases, Springer, 2016, pp. 795--811
2016
-
[22]
F. W. King , Hilbert Transforms: Volume 2 , vol. 2, Cambridge University Press, 2009
2009
-
[23]
Malik, A
D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. L. Bartlett, and M. J. Wainwright , Derivative-free methods for policy optimization: Guarantees for linear quadratic systems , Journal of Machine Learning Research, 21 (2020), pp. 1--51
2020
-
[24]
Mania, S
H. Mania, S. Tu, and B. Recht , Certainty equivalence is efficient for linear quadratic control , Advances in neural information processing systems, 32 (2019)
2019
-
[25]
Mohammadi, A
H. Mohammadi, A. Zare, M. Soltanolkotabi, and M. R. Jovanovi \'c , Convergence and sample complexity of gradient methods for the model-free linear--quadratic regulator problem , IEEE Transactions on Automatic Control, 67 (2021), pp. 2435--2450
2021
-
[26]
A. S. Nemirovskij and D. B. Yudin , Problem complexity and method efficiency in optimization , Wiley-Interscience, 1983
1983
-
[27]
Nesterov , Introductory lectures on convex optimization: A basic course , vol
Y. Nesterov , Introductory lectures on convex optimization: A basic course , vol. 87, Springer Science & Business Media, 2013
2013
-
[28]
Nesterov and V
Y. Nesterov and V. Spokoiny , Random gradient-free minimization of convex functions , Foundations of Computational Mathematics, 17 (2017), pp. 527--566
2017
-
[29]
A. B. Owen , Monte carlo theory, methods and examples , 2013
2013
-
[30]
B. T. Polyak , Gradient methods for the minimisation of functionals , USSR Computational Mathematics and Mathematical Physics, 3 (1963), pp. 864--878
1963
-
[31]
Ranade and A
G. Ranade and A. Sahai , Control capacity , IEEE Transactions on Information Theory, 65 (2018), pp. 235--254
2018
-
[32]
Robbins and S
H. Robbins and S. Monro , A stochastic approximation method , The annals of mathematical statistics, (1951), pp. 400--407
1951
-
[33]
Sahai and S
A. Sahai and S. Mitter , The necessity and sufficiency of anytime capacity for stabilization of a linear system over a noisy communication link—part i: Scalar systems , IEEE transactions on Information Theory, 52 (2006), pp. 3369--3395
2006
-
[34]
Tu and B
S. Tu and B. Recht , The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint , in Conference on learning theory, PMLR, 2019, pp. 3036--3083
2019
-
[35]
H. Wang, T. Zariphopoulou, and X. Y. Zhou , Reinforcement learning in continuous time and space: A stochastic control approach , Journal of Machine Learning Research, 21 (2020), pp. 1--34
2020
-
[36]
Whittle , Risk-sensitive optimal control , Wiley-Interscience Series in Systems and Optimization, John Wiley & Sons, Chichester, 1990
P. Whittle , Risk-sensitive optimal control , Wiley-Interscience Series in Systems and Optimization, John Wiley & Sons, Chichester, 1990
1990
-
[37]
W. M. Wonham , Optimal stationary control of a linear system with state-dependent noise , SIAM Journal on Control, 5 (1967), pp. 486--500
1967
-
[38]
write newline
" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTION or pop #1 'skip if FUNCTION new.block.checka empty 'skip 'new.block if FUNCTION field.or.null duplicate empty pop "" 'skip ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.