Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

Haoyu Han; Heng Yang

arxiv: 2602.01460 · v3 · pith:KJW4776Snew · submitted 2026-02-01 · 🧮 math.OC · cs.LG

Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

Haoyu Han , Heng Yang This is my paper

classification 🧮 math.OC cs.LG

keywords estimatorpoliciespolicypolicy-gradientacrossfinite-horizongaussiangeneral

0 comments

read the original abstract

Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of how NSR varies across policy parameters and how it evolves along optimization trajectories (e.g. SGD and Adam). Across a range of examples, we find that the NSR landscape is highly non-uniform and typically increases as the policy approaches an optimum; in some regimes it blows up, which can trigger training instability and policy collapse.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning
cs.MA 2026-05 unverdicted novelty 7.0

A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
On Training in Imagination
cs.LG 2026-05 unverdicted novelty 6.0

The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-...
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
cs.LG 2026-04 unverdicted novelty 6.0

Tempered sequential Monte Carlo samples efficiently from a temperature-annealed distribution over controller parameters to solve trajectory and policy optimization under differentiable dynamics.
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
cs.LG 2026-04 unverdicted novelty 6.0

Tempered sequential Monte Carlo samples from a Boltzmann-tilted distribution over controllers to optimize trajectories and policies under differentiable dynamics.
On Training in Imagination
cs.LG 2026-05 unverdicted novelty 5.0

The paper derives the optimal dynamics-to-reward sample ratio minimizing return error under power-law scaling and proves that zero-mean reward noise in REINFORCE adds only variance that shrinks with more rollouts.