pith. sign in

arxiv: 2605.26640 · v1 · pith:BC3WI272new · submitted 2026-05-26 · 📡 eess.SY · cs.LG· cs.SY· math.OC· stat.ML

Sample Complexity of Policy Gradient for Log-Growth Control

Pith reviewed 2026-06-29 16:01 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SYmath.OCstat.ML
keywords policy gradientsample complexitylog-growth controlmultiplicative noiseLyapunov exponentfeedback gainCauchy principal value
0
0 comments X

The pith

Projected mini-batch policy gradient attains Õ(1/η) sample complexity for log-growth control when the noise density is known.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies learning an optimal feedback gain for a scalar linear system with multiplicative noise by maximizing the expected log absolute value of the closed-loop factor, which is the top Lyapunov exponent. A structural cusp obstruction makes the policy gradient exist only as a Cauchy principal value with infinite variance at the optimum, so standard stochastic gradient analysis fails. The authors exploit the fact that the Cauchy kernel is odd with respect to displacement from the moving pole and cancel the divergent contribution by pairing each transition with its reflection across that pole. This single symmetry simultaneously bounds population curvature, estimator variance, and estimation bias, allowing projected mini-batch policy gradient started anywhere in a compact stabilizing set to reach the stated rates.

Core claim

Projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity Õ(1/η) when the noise density is known and Õ(η^{-(2s+1)/(2s)}) when it must be estimated, for C^s noise densities with s ≥ 2. The reflection pairing through the singularity controls the three quantities needed for the complexity bound: curvature of the population objective, variance of the single-transition gradient estimator, and bias from density estimation.

What carries the argument

Reflection pairing of each observed transition with its image under reflection through the moving pole b_sing(K) = -1/K, which cancels the divergent part of the Cauchy principal-value gradient estimator.

If this is right

  • The algorithm converges from any compact subset of the stabilizing gains without requiring special initialization.
  • Total sample complexity scales linearly with 1/η when the noise density is known exactly.
  • When the density must be estimated from data the rate degrades to Õ(η^{-(2s+1)/(2s)}) for C^s densities.
  • The same pairing step that removes infinite variance also removes the leading bias term that would otherwise arise from density estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The symmetry argument suggests that other singular-gradient problems whose kernels possess an odd part around a moving pole may admit similar variance-reduction pairings.
  • The derived rates assume the structural placement of the singularity inside the noise support; relaxing that placement would require a different estimator.
  • The closed-form single-transition oracle used here is specific to the scalar multiplicative-noise case and would need replacement for vector or additive-noise variants.

Load-bearing premise

The optimal gain always places the noise singularity inside the interior of the support of the noise distribution, allowing the reflection symmetry to cancel the divergent contribution to the gradient.

What would settle it

A direct Monte-Carlo check near the optimum showing that the single-sample gradient estimator without reflection pairing has infinite variance, or a run of the algorithm whose observed sample count to reach accuracy η exceeds the claimed Õ(1/η) bound when the density is known.

read the original abstract

We study the sample complexity of policy gradient for log-growth control -- the problem of learning, from observed state transitions, a feedback gain that optimally stabilizes a scalar linear system driven through a multiplicative-noise actuation channel. The objective $J(K) = \mathbb{E}[\log|1+BK|]$ is the top Lyapunov exponent of the closed loop. This problem carries a structural difficulty we call the cusp obstruction: the optimal gain $K^*$ always places the noise singularity $b_{\rm sing}(K) = -1/K$ in the interior of the support. At this singular optimum the policy gradient exists only as a Cauchy principal value, not as a Lebesgue integral, and the natural single-sample gradient estimator has infinite variance. Standard first-order stochastic-optimization analysis is thus inapplicable at the optimum, and merely smoothing the objective does not resolve the difficulty. The obstruction, however, has an exploitable symmetry: the Cauchy kernel is an odd function of the displacement from the moving pole, so pairing each observation with its reflection through the pole cancels the divergent part. This one cancellation simultaneously controls the population curvature, the gradient-estimator variance, and the bias incurred when the noise density is estimated. Combining these bounds with a closed-form single-transition gradient oracle, we prove that projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity $\tilde{O}(1/\eta)$ when the noise density is known and $\tilde{O}(\eta^{-(2s+1)/(2s)})$ when it must be estimated, for $C^s$ noise densities with $s \geq 2$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies sample complexity of projected mini-batch policy gradient for log-growth control of a scalar linear system with multiplicative noise, where the objective is the top Lyapunov exponent J(K) = E[log|1+BK|]. It identifies a 'cusp obstruction' at the optimum K* where the noise singularity b_sing(K) = -1/K lies in the interior of the noise support, rendering the gradient a Cauchy principal value with infinite-variance single-sample estimators. Exploiting the odd symmetry of the Cauchy kernel via reflection pairing across the moving pole, combined with a closed-form single-transition oracle, the authors claim that the method attains total sample complexity Õ(1/η) when the noise density is known and Õ(η^{-(2s+1)/(2s)}) when the density must be estimated, for any initialization in a compact subset of the stabilizing region and for C^s noise densities with s ≥ 2.

Significance. If the central claims hold, the work supplies the first explicit sample-complexity guarantees for policy gradient on a problem whose gradient is singular at the optimum. The technical contribution lies in showing that a single structural symmetry (reflection pairing) simultaneously controls population curvature, finite variance of the paired estimator, and bias from nonparametric density estimation. This approach may be relevant to other control or reinforcement-learning settings that encounter principal-value gradients or infinite-variance estimators.

major comments (2)
  1. [Abstract (proof strategy)] The abstract asserts that the reflection-pairing cancellation simultaneously controls curvature, estimator variance, and density-estimation bias to produce the stated Õ(1/η) and Õ(η^{-(2s+1)/(2s)}) bounds, yet the manuscript supplies neither the detailed derivation of these bounds nor the verification that the paired estimator remains unbiased after the principal-value cancellation. Without these steps the central rate claims cannot be inspected.
  2. [Abstract (cusp obstruction)] The structural claim that b_sing(K*) always lies in the interior of the noise support (enabling the symmetry) is stated as always true at the optimum but is not accompanied by a supporting lemma or explicit verification that this interior placement holds for all admissible noise densities under consideration.
minor comments (2)
  1. Notation for the single-transition oracle and the precise definition of the mini-batch size in terms of η should be introduced earlier and used consistently.
  2. The dependence of the hidden constants in the Õ notation on the smoothness index s and on the compact initialization set should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the potential significance of the cusp-obstruction analysis. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract (proof strategy)] The abstract asserts that the reflection-pairing cancellation simultaneously controls curvature, estimator variance, and density-estimation bias to produce the stated Õ(1/η) and Õ(η^{-(2s+1)/(2s)}) bounds, yet the manuscript supplies neither the detailed derivation of these bounds nor the verification that the paired estimator remains unbiased after the principal-value cancellation. Without these steps the central rate claims cannot be inspected.

    Authors: The full derivations appear in Theorems 4.1 (known density) and 5.1 (estimated density), with the three-way control of curvature/variance/bias shown via the same pairing argument in Section 3. Unbiasedness of the paired estimator (preserving the Cauchy principal value) is verified in Lemma 3.3 and Appendix B. We agree the abstract is too terse for inspection and will insert a one-sentence proof outline. revision: partial

  2. Referee: [Abstract (cusp obstruction)] The structural claim that b_sing(K*) always lies in the interior of the noise support (enabling the symmetry) is stated as always true at the optimum but is not accompanied by a supporting lemma or explicit verification that this interior placement holds for all admissible noise densities under consideration.

    Authors: We will add Lemma 2.4, which proves that any minimizer K* of J places b_sing(K*) strictly inside the support for every C^s density (s≥2) whose support contains an interval around zero; the argument uses the fact that moving the pole outside the support strictly increases the Lyapunov exponent. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation chain begins from the explicitly stated structural assumption that the optimal gain places the singularity in the interior of the noise support, enabling the odd symmetry of the Cauchy kernel under reflection pairing. This symmetry is an external property of principal-value integrals and is not defined in terms of the sample-complexity result or any fitted quantity. The bounds on curvature, estimator variance, and estimation bias follow directly from the cancellation, combined with the closed-form single-transition oracle; none of these steps reduce to a self-definition, a fitted input relabeled as prediction, or a load-bearing self-citation. No uniqueness theorem, ansatz smuggling, or renaming of known results appears in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on the problem being scalar linear, the noise density belonging to C^s (s≥2), and the structural fact that the optimum always places the singularity inside the support so the odd-kernel cancellation applies.

axioms (3)
  • domain assumption The system is scalar linear with multiplicative-noise actuation channel.
    Defines the log-growth control problem and the form of J(K).
  • domain assumption The optimal gain K* places the singularity b_sing(K*) = -1/K* inside the interior of the noise support.
    Required for the cusp obstruction to be present and for the reflection pairing to be well-defined.
  • domain assumption Noise density belongs to C^s with s ≥ 2 when density estimation is required.
    Used to obtain the nonparametric rate Õ(η^{-(2s+1)/(2s)}).

pith-pipeline@v0.9.1-grok · 5849 in / 1565 out tokens · 47648 ms · 2026-06-29T16:01:50.844365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 3 canonical work pages

  1. [1]

    Anantharam and V

    V. Anantharam and V. S. Borkar , A variational formula for risk-sensitive reward , SIAM Journal on Control and Optimization, 55 (2017), pp. 961--988

  2. [2]

    L. Arnold , Random dynamical systems , in Dynamical Systems: Lectures Given at the 2nd Session of the Centro Internazionale Matematico Estivo (CIME) held in Montecatini Terme, Italy, June 13--22, 1994, Springer, 2006, pp. 1--43

  3. [3]

    K. E. Atkinson , An introduction to numerical analysis , John wiley & sons, 2008

  4. [4]

    Bolte and E

    J. Bolte and E. Pauwels , Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning , Mathematical Programming, 188 (2021), pp. 19--51

  5. [5]

    V. S. Borkar , A sensitivity formula for risk-sensitive cost and the actor--critic algorithm , Systems & Control Letters, 44 (2001), pp. 339--346

  6. [6]

    V. S. Borkar , Stochastic approximation: a dynamical systems viewpoint , vol. 100, Springer, 2008

  7. [7]

    V. S. Borkar and S. P. Meyn , Risk-sensitive optimal control for markov decision processes with monotone cost , Mathematics of Operations Research, 27 (2002), pp. 192--209

  8. [8]

    Bottou, F

    L. Bottou, F. E. Curtis, and J. Nocedal , Optimization methods for large-scale machine learning , SIAM review, 60 (2018), pp. 223--311

  9. [9]

    J. Bu, A. Mesbahi, M. Fazel, and M. Mesbahi , Lqr through the lens of first order methods: Discrete-time case , arXiv preprint arXiv:1907.08921, (2019)

  10. [10]

    O. L. V. Costa, R. P. Marques, and M. D. Fragoso , Discrete-Time Markov Jump Linear Systems , Probability and Its Applications, Springer London, 2005, https://doi.org/10.1007/b138575

  11. [11]

    Damm , Rational matrix equations in stochastic control , vol

    T. Damm , Rational matrix equations in stochastic control , vol. 297, Springer Science & Business Media, 2004

  12. [12]

    Davis and D

    D. Davis and D. Drusvyatskiy , Stochastic model-based minimization of weakly convex functions , SIAM Journal on Optimization, 29 (2019), pp. 207--239

  13. [13]

    P. J. Davis and P. Rabinowitz , Methods of numerical integration , Courier Corporation, 2007

  14. [14]

    Fazel, R

    M. Fazel, R. Ge, S. Kakade, and M. Mesbahi , Global convergence of policy gradient methods for the linear quadratic regulator , in International conference on machine learning, PMLR, 2018, pp. 1467--1476

  15. [15]

    Giegrich, C

    M. Giegrich, C. Reisinger, and Y. Zhang , Convergence of policy gradient methods for finite-horizon exploratory linear--quadratic control problems , SIAM Journal on Control and Optimization, 62 (2024), pp. 1060--1092, https://doi.org/10.1137/22M1533517

  16. [16]

    Giné and R

    E. Giné and R. Nickl , Mathematical Foundations of Infinite-Dimensional Statistical Models , Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, 2021

  17. [17]

    Gravell, P

    B. Gravell, P. M. Esfahani, and T. Summers , Learning optimal controllers for linear systems with multiplicative noise via policy gradient , IEEE Transactions on Automatic Control, 66 (2020), pp. 5283--5298

  18. [18]

    Hambly, R

    B. Hambly, R. Xu, and H. Yang , Policy gradient methods for the noisy linear quadratic regulator over a finite horizon , SIAM Journal on Control and Optimization, 59 (2021), pp. 3359--3391

  19. [19]

    J. M. Hammersley and K. W. Morton , A new monte carlo technique: antithetic variates , in Mathematical proceedings of the Cambridge philosophical society, vol. 52, Cambridge University Press, 1956, pp. 449--475

  20. [20]

    D. Jacobson , Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games , IEEE Transactions on Automatic control, 18 (1973), pp. 124--131

  21. [21]

    Karimi, J

    H. Karimi, J. Nutini, and M. Schmidt , Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition , in Joint European conference on machine learning and knowledge discovery in databases, Springer, 2016, pp. 795--811

  22. [22]

    F. W. King , Hilbert Transforms: Volume 2 , vol. 2, Cambridge University Press, 2009

  23. [23]

    Malik, A

    D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. L. Bartlett, and M. J. Wainwright , Derivative-free methods for policy optimization: Guarantees for linear quadratic systems , Journal of Machine Learning Research, 21 (2020), pp. 1--51

  24. [24]

    Mania, S

    H. Mania, S. Tu, and B. Recht , Certainty equivalence is efficient for linear quadratic control , Advances in neural information processing systems, 32 (2019)

  25. [25]

    Mohammadi, A

    H. Mohammadi, A. Zare, M. Soltanolkotabi, and M. R. Jovanovi \'c , Convergence and sample complexity of gradient methods for the model-free linear--quadratic regulator problem , IEEE Transactions on Automatic Control, 67 (2021), pp. 2435--2450

  26. [26]

    A. S. Nemirovskij and D. B. Yudin , Problem complexity and method efficiency in optimization , Wiley-Interscience, 1983

  27. [27]

    Nesterov , Introductory lectures on convex optimization: A basic course , vol

    Y. Nesterov , Introductory lectures on convex optimization: A basic course , vol. 87, Springer Science & Business Media, 2013

  28. [28]

    Nesterov and V

    Y. Nesterov and V. Spokoiny , Random gradient-free minimization of convex functions , Foundations of Computational Mathematics, 17 (2017), pp. 527--566

  29. [29]

    A. B. Owen , Monte carlo theory, methods and examples , 2013

  30. [30]

    B. T. Polyak , Gradient methods for the minimisation of functionals , USSR Computational Mathematics and Mathematical Physics, 3 (1963), pp. 864--878

  31. [31]

    Ranade and A

    G. Ranade and A. Sahai , Control capacity , IEEE Transactions on Information Theory, 65 (2018), pp. 235--254

  32. [32]

    Robbins and S

    H. Robbins and S. Monro , A stochastic approximation method , The annals of mathematical statistics, (1951), pp. 400--407

  33. [33]

    Sahai and S

    A. Sahai and S. Mitter , The necessity and sufficiency of anytime capacity for stabilization of a linear system over a noisy communication link—part i: Scalar systems , IEEE transactions on Information Theory, 52 (2006), pp. 3369--3395

  34. [34]

    Tu and B

    S. Tu and B. Recht , The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint , in Conference on learning theory, PMLR, 2019, pp. 3036--3083

  35. [35]

    H. Wang, T. Zariphopoulou, and X. Y. Zhou , Reinforcement learning in continuous time and space: A stochastic control approach , Journal of Machine Learning Research, 21 (2020), pp. 1--34

  36. [36]

    Whittle , Risk-sensitive optimal control , Wiley-Interscience Series in Systems and Optimization, John Wiley & Sons, Chichester, 1990

    P. Whittle , Risk-sensitive optimal control , Wiley-Interscience Series in Systems and Optimization, John Wiley & Sons, Chichester, 1990

  37. [37]

    W. M. Wonham , Optimal stationary control of a linear system with state-dependent noise , SIAM Journal on Control, 5 (1967), pp. 486--500

  38. [38]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTION or pop #1 'skip if FUNCTION new.block.checka empty 'skip 'new.block if FUNCTION field.or.null duplicate empty pop "" 'skip ...