pith. sign in

arxiv: 2510.08539 · v4 · submitted 2025-10-09 · 💻 cs.LG · cs.AI· cs.IT· math.IT· math.OC· stat.ML

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Pith reviewed 2026-05-18 08:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.ITmath.OCstat.ML
keywords RLVRGradient Gapstep size thresholdpolicy gradientconvergenceREINFORCEGRPOlanguage model post-training
0
0 comments X

The pith

RLVR convergence requires policy updates to align with the Gradient Gap or training collapses above a sharp step-size threshold

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds a theoretical foundation for Reinforcement Learning with Verifiable Rewards by examining optimization at full-response and token levels. It defines the Gradient Gap as the directional improvement from low-reward to high-reward response regions. The central result proves that update directions must align with this gap for convergence to occur. A precise step-size threshold follows directly from the gap magnitude: steps below it produce convergence while steps above cause performance to collapse. The analysis accounts for observed effects such as improved stability from length normalization and the possibility of success rates plateauing below 100 percent under fixed learning rates, and it applies to standard policy-gradient methods including REINFORCE and GRPO.

Core claim

We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below 100 percent. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches.

What carries the argument

The Gradient Gap, a quantity that formalizes the direction of improvement from low-reward to high-reward regions of the response space

If this is right

  • Updates whose magnitude stays below the Gradient Gap threshold produce convergence.
  • The critical step size grows with response length and shrinks as success rate rises.
  • Length normalization stabilizes training by offsetting the length-dependent scaling of the threshold.
  • Fixed learning rates allow success rates to plateau strictly below 100 percent.
  • The same alignment and threshold conditions govern any policy-gradient method such as REINFORCE or GRPO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • An adaptive scheduler that shrinks the step size as the Gradient Gap evolves could keep training inside the convergent regime throughout.
  • Token-level gap measurements might enable targeted interventions on specific response segments rather than whole trajectories.
  • The scaling predictions could be tested directly on models of varying sizes to check whether the same length and success-rate dependencies appear at larger scales.

Load-bearing premise

The response space can be partitioned into low-reward and high-reward regions whose gradient directions remain stable enough for the derived threshold to govern the entire training trajectory.

What would settle it

Measure the Gradient Gap magnitude at an early training checkpoint, then run controlled training with a step size set just above the predicted threshold and check whether performance collapses as claimed or continues improving.

Figures

Figures reproduced from arXiv: 2510.08539 by Joe Suk, Yaqi Duan.

Figure 1
Figure 1. Figure 1: Contextual Bandit Experiments. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: MAB Experiments. F.2 Further Elaboration on Contextual Bandit Experiments Our theoretical analysis so far has focused on convergence for a single prompt q. A natural question is: how does the theory extend to the case of multiple prompts or questions? To illustrate this, we consider a contextual bandit simulation. In this setup, each iteration k draws a random context xk (equivalent to a prompt qk in our f… view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Gradient Gap as a quantity formalizing the directional difference between gradients in low-reward and high-reward regions of the response space in RLVR. It claims to prove that policy update alignment with this Gradient Gap is required for convergence and derives a sharp step-size threshold based on the Gap's magnitude: training converges below the threshold and collapses above it. The theory further predicts how the critical step size scales with response length and success rate (explaining length normalization), shows that fixed learning rates can cause success rates to stagnate below 100%, and applies to general policy-gradient methods including REINFORCE and GRPO. Predictions are validated via controlled bandit simulations and post-training experiments on Qwen2.5-Math-7B with GRPO.

Significance. If the derivations are rigorous and the stability assumptions hold, the work supplies a useful theoretical lens on RLVR dynamics that could inform more reliable hyperparameter selection and algorithm design for LLM post-training. The generality across policy-gradient methods and the combination of analysis with both simulation and real-model experiments are strengths.

major comments (2)
  1. [Abstract and Gradient Gap definition / convergence proof] Abstract and section on Gradient Gap definition and convergence proof: the derivation of the sharp step-size threshold treats gradient directions within the low- and high-reward partitions as sufficiently stable that a fixed Gap magnitude governs convergence for the entire training trajectory. Policy updates necessarily shift probability mass toward high-reward sequences, changing the sampling distribution and therefore the expected gradients inside each region. If the effective Gap rotates or shrinks appreciably after the initial steps, the threshold derived under the initial partition no longer controls later dynamics. The manuscript should either prove approximate stability of the Gap or supply empirical measurements (e.g., plots of Gap magnitude versus training step) from the GRPO experiments to substantiate the uniform application of the bound.
  2. [Scaling predictions with success rate] Scaling predictions paragraph: the claim that success rate stagnates strictly below 100% with a fixed learning rate follows from the threshold interacting with the Gap. It is unclear whether this prediction remains valid once the Gap itself evolves with the increasing success rate; an explicit derivation or additional simulation isolating this interaction would be needed to support the claim.
minor comments (2)
  1. [Experiments] The bandit and LM experiment sections would benefit from explicit pseudocode or formulas showing how the Gradient Gap is estimated from sampled responses in practice.
  2. [Notation] Notation for the Gradient Gap and related quantities should be checked for consistency between the theoretical sections and the experimental figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of the stability assumptions underlying our theoretical results. We respond to each major comment below and commit to revisions that strengthen the empirical support without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract and Gradient Gap definition / convergence proof] Abstract and section on Gradient Gap definition and convergence proof: the derivation of the sharp step-size threshold treats gradient directions within the low- and high-reward partitions as sufficiently stable that a fixed Gap magnitude governs convergence for the entire training trajectory. Policy updates necessarily shift probability mass toward high-reward sequences, changing the sampling distribution and therefore the expected gradients inside each region. If the effective Gap rotates or shrinks appreciably after the initial steps, the threshold derived under the initial partition no longer controls later dynamics. The manuscript should either prove approximate stability of the Gap or supply empirical measurements (e.g., plots of Gap magnitude versus training step) from the GRPO experiments to substantiate the uniform appl

    Authors: We agree that the derivation relies on the Gradient Gap remaining directionally dominant for the bound to apply uniformly. The analysis is formulated with respect to the current partition at each step, but the referee correctly notes that distribution shifts could affect later dynamics. Rather than claiming a full proof of global stability, we will add empirical measurements in the revision: specifically, plots of Gap magnitude and direction versus training step from the Qwen2.5-Math-7B GRPO runs. These will show that the Gap stays positive and does not rotate or shrink enough to invalidate the threshold during the observed convergence/collapse regimes. A short discussion of the conditions (binary verifiable rewards and moderate success rates) under which approximate stability holds will also be included. revision: yes

  2. Referee: [Scaling predictions with success rate] Scaling predictions paragraph: the claim that success rate stagnates strictly below 100% with a fixed learning rate follows from the threshold interacting with the Gap. It is unclear whether this prediction remains valid once the Gap itself evolves with the increasing success rate; an explicit derivation or additional simulation isolating this interaction would be needed to support the claim.

    Authors: The stagnation claim is derived by noting that higher success rates reduce the effective Gap magnitude (fewer low-reward trajectories contribute to the contrast), which lowers the critical step-size threshold. A fixed learning rate that was initially safe then exceeds the new threshold, producing misalignment. We acknowledge that an explicit derivation of this feedback loop was only sketched. In the revision we will supply a short appendix derivation showing how Gap magnitude scales with success rate under the binary reward model, together with additional controlled bandit simulations that vary success rate while holding other factors fixed to isolate the stagnation effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation of Gradient Gap and step-size threshold is self-contained

full rationale

The paper defines the Gradient Gap as a formalization of directional improvement between low- and high-reward response regions, then derives the step-size threshold and scaling predictions directly from policy-gradient update analysis and stability assumptions on those regions. These steps rely on standard RL bounds rather than redefining the Gap or threshold in terms of the same fitted quantities or self-citations. The claims are presented as holding for arbitrary policy-gradient methods with external validation via bandit simulations and LM experiments, keeping the central derivation independent of its conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the existence of a well-defined Gradient Gap that remains directionally stable across training steps and on the validity of the policy-gradient update rule for binary rewards; no free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption The response space admits a stable partition into low-reward and high-reward regions whose directional gradient remains consistent enough for the threshold derivation to hold throughout training.
    Invoked in the definition of Gradient Gap and the convergence proof (abstract).

pith-pipeline@v0.9.0 · 5771 in / 1351 out tokens · 27071 ms · 2026-05-18T08:30:42.120025+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 10 internal anchors

  1. [1]

    Agarwal, S

    A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

  2. [2]

    Arnal, G

    C. Arnal, G. Narozniak, V. Cabannes, Y. Tang, J. Kempe, and R. Munos. Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards.arXiv preprint arXiv:2506.20520, 2025

  3. [3]

    Bengio, J

    Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th International Conference on Machine Learning (ICML ’09), pages 41–48, New York, NY, USA, 2009. Association for Computing Machinery

  4. [4]

    Boucheron, G

    S. Boucheron, G. Lugosi, and P. Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Series in Probability and Statistics. Oxford University Press, 2013

  5. [5]

    Brantley, M

    K. Brantley, M. Chen, Z. Gao, J. D. Lee, W. Sun, W. Zhan, and X. Zhang. Accelerating rl for llm reasoning with optimal advantage regression.arXiv preprint arXiv:2505.20686, 2025

  6. [6]

    F. Chen. Outcome-based online reinforcement learning with general function approximation. arXiv preprint arXiv:2505.20268, 2025

  7. [7]

    X. Chen, J. Lu, M. Kim, D. Zhang, J. Tang, A. Piché, N. Gontier, Y. Bengio, and E. Kamalloo. Self-evolving curriculum for llm reasoning.CoRR, abs/2505.14970, May 2025

  8. [8]

    X. Chen, H. Zhong, Z. Yang, Z. Wang, and L. Wang. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. InProceedings of 11 the 39th International Conference on Machine Learning (ICML), volume 162, pages 3773–3793. PMLR, 2022

  9. [9]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Y. Du, A. Winnicki, G. Dalal, S. Mannor, and R. Srikant. Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 11830–11887. PMLR, 2024

  11. [11]

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    Z. He, X. Luo, Y. Zhang, Y. Yang, and L. Qiu.∆L normalization: Rethink loss aggregation in RLVR.arXiv preprint arXiv:2509.07558, 2025

  13. [13]

    Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  14. [14]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIP...

  15. [15]

    Pacchiano, A

    A. Pacchiano, A. Saha, and J. Lee. Dueling RL: Reinforcement learning with trajectory preferences.arXiv preprint arXiv:2111.04850, 2021

  16. [16]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  17. [17]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  18. [18]

    Y. Song, J. Kempe, and R. Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025

  19. [19]

    K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  20. [20]

    von Werra, Y

    L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, and S. Huang. Trl: Transformer reinforcement learning.https://github.com/huggingface/trl, 2020

  21. [21]

    H. Wang, S. Hao, H. Dong, S. Zhang, Y. Bao, Z. Yang, and Y. Wu. Offline reinforcement learning for llm multi-step reasoning.arXiv preprint arXiv:2412.16145, 2024

  22. [22]

    Y. Wang, Q. Liu, and C. Jin. Is RLHF more difficult than standard RL? a theoretical perspective. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36,

  23. [23]

    Also available as arXiv preprint arXiv:2306.14111. 12

  24. [24]

    Y. F. Wu, W. Zhang, P. Xu, and Q. Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

  25. [25]

    L. Xiao. On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36, 2022

  26. [27]

    Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  27. [28]

    Zanette, M

    A. Zanette, M. J. Wainwright, and E. Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning.Advances in neural information processing systems, 34:13626– 13640, 2021

  28. [29]

    Zhang, D

    R. Zhang, D. Arora, S. Mei, and A. Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning, 2025

  29. [30]

    Group Sequence Policy Optimization

    C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  30. [31]

    B. Zhu, J. Jiao, and M. I. Jordan. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons.arXiv preprint arXiv:2301.11270, 2023

  31. [32]

    Fine-Tuning Language Models from Human Preferences

    D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. F. Chris- tiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 13 A Additional Related Work A growing body of work has begun to examine the theoretical foundations of preference-based RLHF and verifiable-reward RL. Early stu...