On the optimization dynamics of RLVR: Gradient gap and step size thresholds
Pith reviewed 2026-05-18 08:30 UTC · model grok-4.3
The pith
RLVR convergence requires policy updates to align with the Gradient Gap or training collapses above a sharp step-size threshold
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below 100 percent. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches.
What carries the argument
The Gradient Gap, a quantity that formalizes the direction of improvement from low-reward to high-reward regions of the response space
If this is right
- Updates whose magnitude stays below the Gradient Gap threshold produce convergence.
- The critical step size grows with response length and shrinks as success rate rises.
- Length normalization stabilizes training by offsetting the length-dependent scaling of the threshold.
- Fixed learning rates allow success rates to plateau strictly below 100 percent.
- The same alignment and threshold conditions govern any policy-gradient method such as REINFORCE or GRPO.
Where Pith is reading between the lines
- An adaptive scheduler that shrinks the step size as the Gradient Gap evolves could keep training inside the convergent regime throughout.
- Token-level gap measurements might enable targeted interventions on specific response segments rather than whole trajectories.
- The scaling predictions could be tested directly on models of varying sizes to check whether the same length and success-rate dependencies appear at larger scales.
Load-bearing premise
The response space can be partitioned into low-reward and high-reward regions whose gradient directions remain stable enough for the derived threshold to govern the entire training trajectory.
What would settle it
Measure the Gradient Gap magnitude at an early training checkpoint, then run controlled training with a step size set just above the predicted threshold and check whether performance collapses as claimed or continues improving.
Figures
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Gradient Gap as a quantity formalizing the directional difference between gradients in low-reward and high-reward regions of the response space in RLVR. It claims to prove that policy update alignment with this Gradient Gap is required for convergence and derives a sharp step-size threshold based on the Gap's magnitude: training converges below the threshold and collapses above it. The theory further predicts how the critical step size scales with response length and success rate (explaining length normalization), shows that fixed learning rates can cause success rates to stagnate below 100%, and applies to general policy-gradient methods including REINFORCE and GRPO. Predictions are validated via controlled bandit simulations and post-training experiments on Qwen2.5-Math-7B with GRPO.
Significance. If the derivations are rigorous and the stability assumptions hold, the work supplies a useful theoretical lens on RLVR dynamics that could inform more reliable hyperparameter selection and algorithm design for LLM post-training. The generality across policy-gradient methods and the combination of analysis with both simulation and real-model experiments are strengths.
major comments (2)
- [Abstract and Gradient Gap definition / convergence proof] Abstract and section on Gradient Gap definition and convergence proof: the derivation of the sharp step-size threshold treats gradient directions within the low- and high-reward partitions as sufficiently stable that a fixed Gap magnitude governs convergence for the entire training trajectory. Policy updates necessarily shift probability mass toward high-reward sequences, changing the sampling distribution and therefore the expected gradients inside each region. If the effective Gap rotates or shrinks appreciably after the initial steps, the threshold derived under the initial partition no longer controls later dynamics. The manuscript should either prove approximate stability of the Gap or supply empirical measurements (e.g., plots of Gap magnitude versus training step) from the GRPO experiments to substantiate the uniform application of the bound.
- [Scaling predictions with success rate] Scaling predictions paragraph: the claim that success rate stagnates strictly below 100% with a fixed learning rate follows from the threshold interacting with the Gap. It is unclear whether this prediction remains valid once the Gap itself evolves with the increasing success rate; an explicit derivation or additional simulation isolating this interaction would be needed to support the claim.
minor comments (2)
- [Experiments] The bandit and LM experiment sections would benefit from explicit pseudocode or formulas showing how the Gradient Gap is estimated from sampled responses in practice.
- [Notation] Notation for the Gradient Gap and related quantities should be checked for consistency between the theoretical sections and the experimental figures.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of the stability assumptions underlying our theoretical results. We respond to each major comment below and commit to revisions that strengthen the empirical support without altering the core claims.
read point-by-point responses
-
Referee: [Abstract and Gradient Gap definition / convergence proof] Abstract and section on Gradient Gap definition and convergence proof: the derivation of the sharp step-size threshold treats gradient directions within the low- and high-reward partitions as sufficiently stable that a fixed Gap magnitude governs convergence for the entire training trajectory. Policy updates necessarily shift probability mass toward high-reward sequences, changing the sampling distribution and therefore the expected gradients inside each region. If the effective Gap rotates or shrinks appreciably after the initial steps, the threshold derived under the initial partition no longer controls later dynamics. The manuscript should either prove approximate stability of the Gap or supply empirical measurements (e.g., plots of Gap magnitude versus training step) from the GRPO experiments to substantiate the uniform appl
Authors: We agree that the derivation relies on the Gradient Gap remaining directionally dominant for the bound to apply uniformly. The analysis is formulated with respect to the current partition at each step, but the referee correctly notes that distribution shifts could affect later dynamics. Rather than claiming a full proof of global stability, we will add empirical measurements in the revision: specifically, plots of Gap magnitude and direction versus training step from the Qwen2.5-Math-7B GRPO runs. These will show that the Gap stays positive and does not rotate or shrink enough to invalidate the threshold during the observed convergence/collapse regimes. A short discussion of the conditions (binary verifiable rewards and moderate success rates) under which approximate stability holds will also be included. revision: yes
-
Referee: [Scaling predictions with success rate] Scaling predictions paragraph: the claim that success rate stagnates strictly below 100% with a fixed learning rate follows from the threshold interacting with the Gap. It is unclear whether this prediction remains valid once the Gap itself evolves with the increasing success rate; an explicit derivation or additional simulation isolating this interaction would be needed to support the claim.
Authors: The stagnation claim is derived by noting that higher success rates reduce the effective Gap magnitude (fewer low-reward trajectories contribute to the contrast), which lowers the critical step-size threshold. A fixed learning rate that was initially safe then exceeds the new threshold, producing misalignment. We acknowledge that an explicit derivation of this feedback loop was only sketched. In the revision we will supply a short appendix derivation showing how Gap magnitude scales with success rate under the binary reward model, together with additional controlled bandit simulations that vary success rate while holding other factors fixed to isolate the stagnation effect. revision: yes
Circularity Check
No significant circularity: derivation of Gradient Gap and step-size threshold is self-contained
full rationale
The paper defines the Gradient Gap as a formalization of directional improvement between low- and high-reward response regions, then derives the step-size threshold and scaling predictions directly from policy-gradient update analysis and stability assumptions on those regions. These steps rely on standard RL bounds rather than redefining the Gap or threshold in terms of the same fitted quantities or self-citations. The claims are presented as holding for arbitrary policy-gradient methods with external validation via bandit simulations and LM experiments, keeping the central derivation independent of its conclusions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The response space admits a stable partition into low-reward and high-reward regions whose directional gradient remains consistent enough for the threshold derivation to hold throughout training.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions... derive a sharp step-size threshold based on the magnitude of the Gradient Gap
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
partition the response space O into two sets... O+_q and O-_q... g+_q(πθ) - g-_q(πθ) is the Gradient Gap
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021
work page 2021
- [2]
- [3]
-
[4]
S. Boucheron, G. Lugosi, and P. Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Series in Probability and Statistics. Oxford University Press, 2013
work page 2013
-
[5]
K. Brantley, M. Chen, Z. Gao, J. D. Lee, W. Sun, W. Zhan, and X. Zhang. Accelerating rl for llm reasoning with optimal advantage regression.arXiv preprint arXiv:2505.20686, 2025
- [6]
- [7]
-
[8]
X. Chen, H. Zhong, Z. Yang, Z. Wang, and L. Wang. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. InProceedings of 11 the 39th International Conference on Machine Learning (ICML), volume 162, pages 3773–3793. PMLR, 2022
work page 2022
-
[9]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Y. Du, A. Winnicki, G. Dalal, S. Mannor, and R. Srikant. Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 11830–11887. PMLR, 2024
work page 2024
-
[11]
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [12]
-
[13]
Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Training language models to follow instructions with human feedback
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIP...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
A. Pacchiano, A. Saha, and J. Lee. Dueling RL: Reinforcement learning with trajectory preferences.arXiv preprint arXiv:2111.04850, 2021
-
[16]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [18]
-
[19]
K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, and S. Huang. Trl: Transformer reinforcement learning.https://github.com/huggingface/trl, 2020
work page 2020
- [21]
-
[22]
Y. Wang, Q. Liu, and C. Jin. Is RLHF more difficult than standard RL? a theoretical perspective. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36,
- [23]
-
[24]
Y. F. Wu, W. Zhang, P. Xu, and Q. Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020
work page 2020
-
[25]
L. Xiao. On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36, 2022
work page 2022
-
[27]
Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
A. Zanette, M. J. Wainwright, and E. Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning.Advances in neural information processing systems, 34:13626– 13640, 2021
work page 2021
- [29]
-
[30]
Group Sequence Policy Optimization
C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [31]
-
[32]
Fine-Tuning Language Models from Human Preferences
D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. F. Chris- tiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 13 A Additional Related Work A growing body of work has begun to examine the theoretical foundations of preference-based RLHF and verifiable-reward RL. Early stu...
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.