On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Joe Suk; Yaqi Duan

arxiv: 2510.08539 · v4 · submitted 2025-10-09 · 💻 cs.LG · cs.AI· cs.IT· math.IT· math.OC· stat.ML

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Joe Suk , Yaqi Duan This is my paper

Pith reviewed 2026-05-18 08:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.ITmath.OCstat.ML

keywords RLVRGradient Gapstep size thresholdpolicy gradientconvergenceREINFORCEGRPOlanguage model post-training

0 comments

The pith

RLVR convergence requires policy updates to align with the Gradient Gap or training collapses above a sharp step-size threshold

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds a theoretical foundation for Reinforcement Learning with Verifiable Rewards by examining optimization at full-response and token levels. It defines the Gradient Gap as the directional improvement from low-reward to high-reward response regions. The central result proves that update directions must align with this gap for convergence to occur. A precise step-size threshold follows directly from the gap magnitude: steps below it produce convergence while steps above cause performance to collapse. The analysis accounts for observed effects such as improved stability from length normalization and the possibility of success rates plateauing below 100 percent under fixed learning rates, and it applies to standard policy-gradient methods including REINFORCE and GRPO.

Core claim

We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below 100 percent. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches.

What carries the argument

The Gradient Gap, a quantity that formalizes the direction of improvement from low-reward to high-reward regions of the response space

If this is right

Updates whose magnitude stays below the Gradient Gap threshold produce convergence.
The critical step size grows with response length and shrinks as success rate rises.
Length normalization stabilizes training by offsetting the length-dependent scaling of the threshold.
Fixed learning rates allow success rates to plateau strictly below 100 percent.
The same alignment and threshold conditions govern any policy-gradient method such as REINFORCE or GRPO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

An adaptive scheduler that shrinks the step size as the Gradient Gap evolves could keep training inside the convergent regime throughout.
Token-level gap measurements might enable targeted interventions on specific response segments rather than whole trajectories.
The scaling predictions could be tested directly on models of varying sizes to check whether the same length and success-rate dependencies appear at larger scales.

Load-bearing premise

The response space can be partitioned into low-reward and high-reward regions whose gradient directions remain stable enough for the derived threshold to govern the entire training trajectory.

What would settle it

Measure the Gradient Gap magnitude at an early training checkpoint, then run controlled training with a step size set just above the predicted threshold and check whether performance collapses as claimed or continues improving.

Figures

Figures reproduced from arXiv: 2510.08539 by Joe Suk, Yaqi Duan.

**Figure 3.** Figure 3: MAB Experiments. F.2 Further Elaboration on Contextual Bandit Experiments Our theoretical analysis so far has focused on convergence for a single prompt q. A natural question is: how does the theory extend to the case of multiple prompts or questions? To illustrate this, we consider a contextual bandit simulation. In this setup, each iteration k draws a random context xk (equivalent to a prompt qk in our f… view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a Gradient Gap and explicit step-size threshold for RLVR convergence that explains some heuristics, but the stability of gradient directions as the policy shifts is the key assumption to verify.

read the letter

The one thing to know is that this paper defines a Gradient Gap between low- and high-reward responses and uses it to get an explicit step-size threshold below which RLVR training converges and above which it collapses. That threshold also scales with length and success rate in ways that match some common heuristics like length normalization. They prove alignment with the gap is needed for convergence and show the result holds for any policy-gradient method, including REINFORCE and GRPO. The predictions about success rate stagnating below 100% with a fixed learning rate follow directly from the scaling. The bandit simulations isolate the effect cleanly, and the GRPO runs on Qwen2.5-Math-7B tie it to current post-training practice. This is the part that could actually change how people set step sizes. The softer spot is whether the low- and high-reward gradient directions remain stable enough for the fixed threshold to control the entire trajectory. As the policy puts more mass on high-reward sequences, the sampling distribution changes, which can alter the expected gradients inside each region. If the gap rotates or shrinks by more than a small amount after the first updates, the bound derived from the initial partition may not govern later stages. The paper reports overall validation, but direct measurements of gap evolution would strengthen the claim that the threshold applies uniformly. This is aimed at researchers working on reliable RLVR for reasoning models. Anyone tuning GRPO or similar methods will get concrete scaling rules and an explanation for observed instabilities. It deserves a serious referee because the core derivation connects to standard policy-gradient analysis and the experiments are specific enough to check the predictions against.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Gradient Gap as a quantity formalizing the directional difference between gradients in low-reward and high-reward regions of the response space in RLVR. It claims to prove that policy update alignment with this Gradient Gap is required for convergence and derives a sharp step-size threshold based on the Gap's magnitude: training converges below the threshold and collapses above it. The theory further predicts how the critical step size scales with response length and success rate (explaining length normalization), shows that fixed learning rates can cause success rates to stagnate below 100%, and applies to general policy-gradient methods including REINFORCE and GRPO. Predictions are validated via controlled bandit simulations and post-training experiments on Qwen2.5-Math-7B with GRPO.

Significance. If the derivations are rigorous and the stability assumptions hold, the work supplies a useful theoretical lens on RLVR dynamics that could inform more reliable hyperparameter selection and algorithm design for LLM post-training. The generality across policy-gradient methods and the combination of analysis with both simulation and real-model experiments are strengths.

major comments (2)

[Abstract and Gradient Gap definition / convergence proof] Abstract and section on Gradient Gap definition and convergence proof: the derivation of the sharp step-size threshold treats gradient directions within the low- and high-reward partitions as sufficiently stable that a fixed Gap magnitude governs convergence for the entire training trajectory. Policy updates necessarily shift probability mass toward high-reward sequences, changing the sampling distribution and therefore the expected gradients inside each region. If the effective Gap rotates or shrinks appreciably after the initial steps, the threshold derived under the initial partition no longer controls later dynamics. The manuscript should either prove approximate stability of the Gap or supply empirical measurements (e.g., plots of Gap magnitude versus training step) from the GRPO experiments to substantiate the uniform application of the bound.
[Scaling predictions with success rate] Scaling predictions paragraph: the claim that success rate stagnates strictly below 100% with a fixed learning rate follows from the threshold interacting with the Gap. It is unclear whether this prediction remains valid once the Gap itself evolves with the increasing success rate; an explicit derivation or additional simulation isolating this interaction would be needed to support the claim.

minor comments (2)

[Experiments] The bandit and LM experiment sections would benefit from explicit pseudocode or formulas showing how the Gradient Gap is estimated from sampled responses in practice.
[Notation] Notation for the Gradient Gap and related quantities should be checked for consistency between the theoretical sections and the experimental figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of the stability assumptions underlying our theoretical results. We respond to each major comment below and commit to revisions that strengthen the empirical support without altering the core claims.

read point-by-point responses

Referee: [Abstract and Gradient Gap definition / convergence proof] Abstract and section on Gradient Gap definition and convergence proof: the derivation of the sharp step-size threshold treats gradient directions within the low- and high-reward partitions as sufficiently stable that a fixed Gap magnitude governs convergence for the entire training trajectory. Policy updates necessarily shift probability mass toward high-reward sequences, changing the sampling distribution and therefore the expected gradients inside each region. If the effective Gap rotates or shrinks appreciably after the initial steps, the threshold derived under the initial partition no longer controls later dynamics. The manuscript should either prove approximate stability of the Gap or supply empirical measurements (e.g., plots of Gap magnitude versus training step) from the GRPO experiments to substantiate the uniform appl

Authors: We agree that the derivation relies on the Gradient Gap remaining directionally dominant for the bound to apply uniformly. The analysis is formulated with respect to the current partition at each step, but the referee correctly notes that distribution shifts could affect later dynamics. Rather than claiming a full proof of global stability, we will add empirical measurements in the revision: specifically, plots of Gap magnitude and direction versus training step from the Qwen2.5-Math-7B GRPO runs. These will show that the Gap stays positive and does not rotate or shrink enough to invalidate the threshold during the observed convergence/collapse regimes. A short discussion of the conditions (binary verifiable rewards and moderate success rates) under which approximate stability holds will also be included. revision: yes
Referee: [Scaling predictions with success rate] Scaling predictions paragraph: the claim that success rate stagnates strictly below 100% with a fixed learning rate follows from the threshold interacting with the Gap. It is unclear whether this prediction remains valid once the Gap itself evolves with the increasing success rate; an explicit derivation or additional simulation isolating this interaction would be needed to support the claim.

Authors: The stagnation claim is derived by noting that higher success rates reduce the effective Gap magnitude (fewer low-reward trajectories contribute to the contrast), which lowers the critical step-size threshold. A fixed learning rate that was initially safe then exceeds the new threshold, producing misalignment. We acknowledge that an explicit derivation of this feedback loop was only sketched. In the revision we will supply a short appendix derivation showing how Gap magnitude scales with success rate under the binary reward model, together with additional controlled bandit simulations that vary success rate while holding other factors fixed to isolate the stagnation effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation of Gradient Gap and step-size threshold is self-contained

full rationale

The paper defines the Gradient Gap as a formalization of directional improvement between low- and high-reward response regions, then derives the step-size threshold and scaling predictions directly from policy-gradient update analysis and stability assumptions on those regions. These steps rely on standard RL bounds rather than redefining the Gap or threshold in terms of the same fitted quantities or self-citations. The claims are presented as holding for arbitrary policy-gradient methods with external validation via bandit simulations and LM experiments, keeping the central derivation independent of its conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the existence of a well-defined Gradient Gap that remains directionally stable across training steps and on the validity of the policy-gradient update rule for binary rewards; no free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption The response space admits a stable partition into low-reward and high-reward regions whose directional gradient remains consistent enough for the threshold derivation to hold throughout training.
Invoked in the definition of Gradient Gap and the convergence proof (abstract).

pith-pipeline@v0.9.0 · 5771 in / 1351 out tokens · 27071 ms · 2026-05-18T08:30:42.120025+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions... derive a sharp step-size threshold based on the magnitude of the Gradient Gap
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

partition the response space O into two sets... O+_q and O-_q... g+_q(πθ) - g-_q(πθ) is the Gradient Gap

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 10 internal anchors

[1]

Agarwal, S

A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

work page 2021
[2]

Arnal, G

C. Arnal, G. Narozniak, V. Cabannes, Y. Tang, J. Kempe, and R. Munos. Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards.arXiv preprint arXiv:2506.20520, 2025

work page arXiv 2025
[3]

Bengio, J

Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th International Conference on Machine Learning (ICML ’09), pages 41–48, New York, NY, USA, 2009. Association for Computing Machinery

work page 2009
[4]

Boucheron, G

S. Boucheron, G. Lugosi, and P. Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Series in Probability and Statistics. Oxford University Press, 2013

work page 2013
[5]

Brantley, M

K. Brantley, M. Chen, Z. Gao, J. D. Lee, W. Sun, W. Zhan, and X. Zhang. Accelerating rl for llm reasoning with optimal advantage regression.arXiv preprint arXiv:2505.20686, 2025

work page arXiv 2025
[6]

F. Chen. Outcome-based online reinforcement learning with general function approximation. arXiv preprint arXiv:2505.20268, 2025

work page arXiv 2025
[7]

X. Chen, J. Lu, M. Kim, D. Zhang, J. Tang, A. Piché, N. Gontier, Y. Bengio, and E. Kamalloo. Self-evolving curriculum for llm reasoning.CoRR, abs/2505.14970, May 2025

work page arXiv 2025
[8]

X. Chen, H. Zhong, Z. Yang, Z. Wang, and L. Wang. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. InProceedings of 11 the 39th International Conference on Machine Learning (ICML), volume 162, pages 3773–3793. PMLR, 2022

work page 2022
[9]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Y. Du, A. Winnicki, G. Dalal, S. Mannor, and R. Srikant. Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 11830–11887. PMLR, 2024

work page 2024
[11]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Z. He, X. Luo, Y. Zhang, Y. Yang, and L. Qiu.∆L normalization: Rethink loss aggregation in RLVR.arXiv preprint arXiv:2509.07558, 2025

work page arXiv 2025
[13]

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIP...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Pacchiano, A

A. Pacchiano, A. Saha, and J. Lee. Dueling RL: Reinforcement learning with trajectory preferences.arXiv preprint arXiv:2111.04850, 2021

work page arXiv 2021
[16]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Y. Song, J. Kempe, and R. Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025

work page arXiv 2025
[19]

K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

von Werra, Y

L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, and S. Huang. Trl: Transformer reinforcement learning.https://github.com/huggingface/trl, 2020

work page 2020
[21]

H. Wang, S. Hao, H. Dong, S. Zhang, Y. Bao, Z. Yang, and Y. Wu. Offline reinforcement learning for llm multi-step reasoning.arXiv preprint arXiv:2412.16145, 2024

work page arXiv 2024
[22]

Y. Wang, Q. Liu, and C. Jin. Is RLHF more difficult than standard RL? a theoretical perspective. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36,

work page
[23]

Also available as arXiv preprint arXiv:2306.14111. 12

work page arXiv
[24]

Y. F. Wu, W. Zhang, P. Xu, and Q. Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

work page 2020
[25]

L. Xiao. On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36, 2022

work page 2022
[27]

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Zanette, M

A. Zanette, M. J. Wainwright, and E. Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning.Advances in neural information processing systems, 34:13626– 13640, 2021

work page 2021
[29]

Zhang, D

R. Zhang, D. Arora, S. Mei, and A. Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning, 2025

work page 2025
[30]

Group Sequence Policy Optimization

C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

B. Zhu, J. Jiao, and M. I. Jordan. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons.arXiv preprint arXiv:2301.11270, 2023

work page arXiv 2023
[32]

Fine-Tuning Language Models from Human Preferences

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. F. Chris- tiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 13 A Additional Related Work A growing body of work has begun to examine the theoretical foundations of preference-based RLHF and verifiable-reward RL. Early stu...

work page internal anchor Pith review Pith/arXiv arXiv 1909

[1] [1]

Agarwal, S

A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

work page 2021

[2] [2]

Arnal, G

C. Arnal, G. Narozniak, V. Cabannes, Y. Tang, J. Kempe, and R. Munos. Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards.arXiv preprint arXiv:2506.20520, 2025

work page arXiv 2025

[3] [3]

Bengio, J

Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th International Conference on Machine Learning (ICML ’09), pages 41–48, New York, NY, USA, 2009. Association for Computing Machinery

work page 2009

[4] [4]

Boucheron, G

S. Boucheron, G. Lugosi, and P. Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Series in Probability and Statistics. Oxford University Press, 2013

work page 2013

[5] [5]

Brantley, M

K. Brantley, M. Chen, Z. Gao, J. D. Lee, W. Sun, W. Zhan, and X. Zhang. Accelerating rl for llm reasoning with optimal advantage regression.arXiv preprint arXiv:2505.20686, 2025

work page arXiv 2025

[6] [6]

F. Chen. Outcome-based online reinforcement learning with general function approximation. arXiv preprint arXiv:2505.20268, 2025

work page arXiv 2025

[7] [7]

X. Chen, J. Lu, M. Kim, D. Zhang, J. Tang, A. Piché, N. Gontier, Y. Bengio, and E. Kamalloo. Self-evolving curriculum for llm reasoning.CoRR, abs/2505.14970, May 2025

work page arXiv 2025

[8] [8]

X. Chen, H. Zhong, Z. Yang, Z. Wang, and L. Wang. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. InProceedings of 11 the 39th International Conference on Machine Learning (ICML), volume 162, pages 3773–3793. PMLR, 2022

work page 2022

[9] [9]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Y. Du, A. Winnicki, G. Dalal, S. Mannor, and R. Srikant. Exploration-driven policy optimization in rlhf: Theoretical insights on efficient data utilization. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 11830–11887. PMLR, 2024

work page 2024

[11] [11]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Z. He, X. Luo, Y. Zhang, Y. Yang, and L. Qiu.∆L normalization: Rethink loss aggregation in RLVR.arXiv preprint arXiv:2509.07558, 2025

work page arXiv 2025

[13] [13]

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIP...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Pacchiano, A

A. Pacchiano, A. Saha, and J. Lee. Dueling RL: Reinforcement learning with trajectory preferences.arXiv preprint arXiv:2111.04850, 2021

work page arXiv 2021

[16] [16]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Y. Song, J. Kempe, and R. Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025

work page arXiv 2025

[19] [19]

K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

von Werra, Y

L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, and S. Huang. Trl: Transformer reinforcement learning.https://github.com/huggingface/trl, 2020

work page 2020

[21] [21]

H. Wang, S. Hao, H. Dong, S. Zhang, Y. Bao, Z. Yang, and Y. Wu. Offline reinforcement learning for llm multi-step reasoning.arXiv preprint arXiv:2412.16145, 2024

work page arXiv 2024

[22] [22]

Y. Wang, Q. Liu, and C. Jin. Is RLHF more difficult than standard RL? a theoretical perspective. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36,

work page

[23] [23]

Also available as arXiv preprint arXiv:2306.14111. 12

work page arXiv

[24] [24]

Y. F. Wu, W. Zhang, P. Xu, and Q. Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

work page 2020

[25] [25]

L. Xiao. On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36, 2022

work page 2022

[26] [27]

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, T. Fan, G. Liu, L. Liu, X. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [28]

Zanette, M

A. Zanette, M. J. Wainwright, and E. Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning.Advances in neural information processing systems, 34:13626– 13640, 2021

work page 2021

[28] [29]

Zhang, D

R. Zhang, D. Arora, S. Mei, and A. Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning, 2025

work page 2025

[29] [30]

Group Sequence Policy Optimization

C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [31]

B. Zhu, J. Jiao, and M. I. Jordan. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons.arXiv preprint arXiv:2301.11270, 2023

work page arXiv 2023

[31] [32]

Fine-Tuning Language Models from Human Preferences

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. F. Chris- tiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 13 A Additional Related Work A growing body of work has begun to examine the theoretical foundations of preference-based RLHF and verifiable-reward RL. Early stu...

work page internal anchor Pith review Pith/arXiv arXiv 1909