arxiv: 2605.06755 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Gradient Extrapolation-Based Policy Optimization

Ismam Nur Swapnil , Aranya Saha , Tanvir Ahmed Khan , Mohammad Ariful Haque , Ser-Nam Lim

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords policy optimizationreinforcement learninglarge language modelsmathematical reasoninggradient extrapolationGRPOlookaheadpass@1 accuracy

0 comments

The pith

GXPO approximates longer local lookahead in policy updates with only three backward passes by extrapolating gradient changes after two fast optimizer steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gradient Extrapolation-Based Policy Optimization (GXPO) as a plug-compatible rule for GRPO-style reinforcement learning when training large language models on reasoning tasks. Standard GRPO updates use only the current step, while true multi-step lookahead improves the direction but demands many new backward passes and rollouts. GXPO reuses the existing batch of rollouts, rewards, and advantages, runs two fast optimizer steps to observe how gradients shift, predicts a virtual K-step position, moves the policy partway there, and finishes with a corrective gradient step at the new location. When the predicted signal becomes unstable, an automatic check reverts to ordinary GRPO. On math-reasoning benchmarks with Qwen2.5 and Llama models, this yields higher average pass@1 scores and measurable speedups while holding the active-phase cost fixed at three backward passes.

Core claim

GXPO approximates a longer local lookahead using only three backward passes during an active phase by taking two fast optimizer steps, measuring how the gradients change, predicting a virtual K-step lookahead point, moving the policy partway toward that point, and then applying a corrective update using the true gradient at the new position while reusing the same batch of rollouts, rewards, advantages, and GRPO loss; it automatically switches back to standard GRPO when the lookahead signal becomes unstable, and a plain-gradient-descent surrogate analysis explains when the extrapolation is exact and where its local errors come from.

What carries the argument

The gradient extrapolation step that observes changes after two fast optimizer steps to construct a predicted K-step policy position for virtual lookahead without new rollouts or reward computation.

Load-bearing premise

That the observed gradient change after two fast optimizer steps provides a sufficiently accurate local linear or low-order extrapolation of the policy trajectory over K steps, and that the automatic stability check reliably detects when this approximation breaks without missing useful updates.

What would settle it

A direct comparison on a small model where actual K-step lookahead trajectories are computed and shown to deviate substantially from GXPO's extrapolated point on the same rollouts even when the stability check passes.

Figures

Figures reproduced from arXiv: 2605.06755 by Aranya Saha, Ismam Nur Swapnil, Mohammad Ariful Haque, Ser-Nam Lim, Tanvir Ahmed Khan.

**Figure 2.** Figure 2: Pass@16 accuracy versus training steps across [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Training efficiency across GRPO, GXPO, and SFPO, with results reported up to 300 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: GXPO ablations on Math-500 with Qwen2.5-1.5B. Left: peak Pass@16 versus time-to-peak [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Pass@16 (EMA) versus backward passes across [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Pass@16 (EMA) for k = 5 under τ ∈ {0.7, 1, 1.5, 2} versus training steps (left), wall-clock time (center), and backward passes (right). Larger τ achieves higher accuracy across all views. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Mean response length (in tokens) versus training steps for [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: GXPO diagnostic metrics versus training steps for [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Retention ratio versus training steps across [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-update rule for GRPO-style reasoning RL. GXPO approximates a longer local lookahead using only three backward passes during an active phase. It reuses the same batch of rollouts, rewards, advantages, and GRPO loss, so it does not require new rollouts or reward computation at the lookahead points. GXPO takes two fast optimizer steps, measures how the gradients change, predicts a virtual K-step lookahead point, moves the policy partway toward that point, and then applies a corrective update using the true gradient at the new position. When the lookahead signal becomes unstable, GXPO automatically switches back to standard single-pass GRPO. We also give a plain-gradient-descent surrogate analysis that explains when the extrapolation is exact and where its local errors come from. Across Qwen2.5 and Llama math-reasoning experiments, GXPO improves the average sampled pass@1 by +1.65 to +5.00 points over GRPO and by +0.14 to +1.28 points over the strongest SFPO setting, while keeping the active-phase cost fixed at three backward passes. It also achieves up to 4.00x step speedup, 2.33x wall-clock speedup, and 1.33x backward-pass speedup in reaching GRPO's peak accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GXPO is a practical plug-in for GRPO that extrapolates gradients from two fast steps to approximate a longer lookahead on the same batch, delivering modest pass@1 gains and speedups on math reasoning tasks.

read the letter

The paper's main new thing is the GXPO rule itself. It runs two quick optimizer steps on the current GRPO loss, looks at how the gradient shifts, uses that to guess where the policy would be after K steps, takes a partial step toward that point, then does a corrective update with the real gradient there. If the signal looks unstable it just falls back to normal GRPO. All of this reuses the same rollouts and advantages, so no new sampling or reward calls. That setup is what lets them claim fixed three-backward-pass cost during the active phase. The plain-GD surrogate analysis explains the cases where the extrapolation is exact and where local errors appear, which is honest. The experiments on Qwen2.5 and Llama models for math reasoning report consistent pass@1 lifts over both GRPO and the best SFPO baseline, plus step and wall-clock speedups. Those numbers are the practical payoff. The weak point is the gap between the surrogate and the real objective. GRPO clips the policy ratio and weights by advantages, so the gradient trajectory after two steps does not necessarily follow the simple linear or low-order behavior the analysis assumes. The automatic stability check might therefore trigger on the wrong signal. Without seeing error bars or ablation tables in the full paper it is hard to judge how robust the gains really are. This is aimed at people already running GRPO or similar on reasoning tasks who want to squeeze more out of each batch. It is not a foundational shift, but the concrete implementation and the reported improvements make it worth a referee's time. I would recommend sending it out for peer review. The engineering is clear enough that reviewers can verify the claims on the actual code and data.

Referee Report

2 major / 3 minor

Summary. The paper introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible update rule for GRPO-style RL on LLM reasoning tasks. GXPO approximates a K-step local lookahead by performing two fast optimizer steps on the current batch of rollouts/rewards/advantages/GRPO loss, measuring the resulting gradient change, and extrapolating a virtual policy position; it then applies a corrective update and falls back to standard single-pass GRPO when an automatic stability check detects instability. A plain-gradient-descent surrogate analysis is supplied to characterize when the extrapolation is exact and where local errors arise. Experiments on Qwen2.5 and Llama math-reasoning benchmarks report pass@1 gains of +1.65 to +5.00 over GRPO and +0.14 to +1.28 over the strongest SFPO baseline, with the active phase fixed at three backward passes and speedups up to 4.00x in steps, 2.33x wall-clock, and 1.33x backward passes.

Significance. If the extrapolation rule transfers reliably from the plain-GD surrogate to the actual GRPO objective, the method would supply a low-overhead mechanism for incorporating limited multi-step lookahead without new rollouts or reward evaluations. The reported accuracy gains combined with fixed three-backward-pass cost and the speedups to reach peak accuracy would constitute a practical advance for RL-based reasoning training, provided the stability check and extrapolation remain accurate across model scales and task distributions.

major comments (2)

[Surrogate analysis and method description] The surrogate analysis is stated to cover only plain gradient descent and to identify conditions under which the two-step gradient difference exactly predicts the K-step trajectory. However, the deployed loss is the GRPO objective (advantage-weighted log-probability terms, clipping, and any KL or entropy regularizers). Because the derivation assumes an unconstrained quadratic or smooth GD flow, the measured gradient difference on the composite GRPO loss need not obey the same linear or low-order extrapolation; this directly affects whether the virtual lookahead point and the stability check are reliable. (See abstract description of the surrogate and the method overview.)
[Experiments] The abstract reports average pass@1 improvements but supplies no error bars, number of random seeds, ablation results on the stability-check threshold or extrapolation horizon K, or statistical significance tests. Without these, it is impossible to determine whether the +1.65 to +5.00 point gains over GRPO are robust or whether they could be explained by variance in the base GRPO runs.

minor comments (3)

[Method overview] The high-level description of the three-backward-pass procedure would benefit from explicit pseudocode or a numbered algorithmic listing that distinguishes the two fast steps, the extrapolation computation, the corrective update, and the stability check.
[Abstract and method] Notation for the extrapolated policy position, the gradient-difference vector, and the stability metric should be introduced once and used consistently; currently the abstract leaves several quantities implicit.
[Method overview] The claim of 'plug-compatible' with GRPO should be accompanied by a short statement of which GRPO hyperparameters (clipping threshold, KL coefficient, etc.) remain unchanged under GXPO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our work. Below we provide point-by-point responses to the major comments and indicate the revisions we intend to make.

read point-by-point responses

Referee: [Surrogate analysis and method description] The surrogate analysis is stated to cover only plain gradient descent and to identify conditions under which the two-step gradient difference exactly predicts the K-step trajectory. However, the deployed loss is the GRPO objective (advantage-weighted log-probability terms, clipping, and any KL or entropy regularizers). Because the derivation assumes an unconstrained quadratic or smooth GD flow, the measured gradient difference on the composite GRPO loss need not obey the same linear or low-order extrapolation; this directly affects whether the virtual lookahead point and the stability check are reliable. (See abstract description of the surrogate and the method overview.)

Authors: We clarify that the surrogate analysis under plain gradient descent is provided to offer theoretical intuition regarding the conditions for exact extrapolation and the origins of approximation errors in a controlled setting. The actual GXPO implementation operates on the GRPO loss and includes a stability check to revert to standard updates when the extrapolated signal is deemed unreliable. We agree that further analysis bridging the surrogate to the full GRPO objective would be beneficial. In the revised manuscript, we will add a dedicated subsection discussing the limitations of the surrogate and providing empirical evidence from our training runs on the frequency and impact of fallback to GRPO. revision: partial
Referee: [Experiments] The abstract reports average pass@1 improvements but supplies no error bars, number of random seeds, ablation results on the stability-check threshold or extrapolation horizon K, or statistical significance tests. Without these, it is impossible to determine whether the +1.65 to +5.00 point gains over GRPO are robust or whether they could be explained by variance in the base GRPO runs.

Authors: The referee correctly identifies the lack of statistical details in the current presentation. To address this, we will revise the experimental section to include results from multiple random seeds, error bars, ablations on the stability-check threshold and the value of K, and appropriate statistical significance tests. These additions will help demonstrate the robustness of the observed improvements across the Qwen2.5 and Llama models. revision: yes

Circularity Check

0 steps flagged

No significant circularity; surrogate analysis is explanatory and results are empirical.

full rationale

The paper introduces GXPO as a practical algorithm that reuses existing rollouts and GRPO loss computations to approximate a K-step lookahead via two fast optimizer steps and a stability check. The provided surrogate analysis is explicitly described as explanatory for the plain-GD case and does not define or derive the GRPO-specific update rule; the reported gains (+1.65 to +5.00 pass@1 over GRPO) are presented as experimental measurements on Qwen2.5 and Llama models rather than quantities obtained by fitting or renaming the same inputs. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no prediction reduces by construction to a fitted parameter or prior result. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper builds on the existing GRPO framework and standard policy-gradient assumptions. No new free parameters, axioms, or invented entities are explicitly introduced in the abstract; the extrapolation rule is presented as a derived heuristic whose validity is checked via a surrogate analysis.

pith-pipeline@v0.9.0 · 5617 in / 1319 out tokens · 46981 ms · 2026-05-11T02:01:30.958602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 12 internal anchors

[1]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

AlpaGasus : Training a better Alpaca with fewer data

Chen, L., Li, S., Yan, J., Wang, H., Gunaratna, K., Yadav, V., Tang, Z., Srinivasan, V., Zhou, T., Huang, H., and Jin, H. AlpaGasus : Training a better Alpaca with fewer data. In ICLR, 2024

work page 2024
[3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Cui, G., Zhang, Y., Chen, J., Yuan, L., Wang, Z., Zuo, Y., Li, H., Fan, Y., Chen, H., Chen, W., Liu, Z., Peng, H., Bai, L., Ouyang, W., Cheng, Y., Zhou, B., and Ding, N. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review arXiv 2025
[5]

Stable reinforcement learning for efficient reasoning

Dai, M., Liu, S., and Si, Q. Stable reinforcement learning for efficient reasoning. arXiv preprint arXiv:2505.18086, 2025

work page arXiv 2025
[6]

Policy gradient with tree expansion

Dalal, G., Hallak, A., Thoppe, G., Mannor, S., and Chechik, G. Policy gradient with tree expansion. In ICML, 2025

work page 2025
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1 : Incentivizing reasoning capability in LLM s via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Concise reasoning via reinforcement learning

Fatemi, M., Rafiee, B., Tang, M., and Talamadupula, K. Concise reasoning via reinforcement learning. arXiv preprint arXiv:2504.05185, 2025

work page arXiv 2025
[10]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., et al. OlympiadBench : A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review arXiv 2024
[11]

History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588,

He, J., Li, T., Feng, E., Du, D., Liu, Q., Liu, T., Xia, Y., and Chen, H. History rhymes: Accelerating LLM reinforcement learning with RhymeRL . arXiv preprint arXiv:2508.18588, 2025

work page arXiv 2025
[12]

Measuring mathematical problem solving with the MATH dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset. In NeurIPS, 2021

work page 2021
[13]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA : Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23), pages 611--626, 2023. doi:10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023
[15]

Solving quantitative reasoning problems with language models

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. In NeurIPS, 2022

work page 2022
[16]

Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

Li, X., Zou, H., and Liu, P. LIMR : Less is more for RL scaling. arXiv preprint arXiv:2502.11886, 2025

work page arXiv 2025
[17]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding R1-Zero -like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025

work page Pith review arXiv 2025
[18]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In ICLR, 2019

work page 2019
[19]

AMC Problems and Solutions

Art of Problem Solving. AMC Problems and Solutions. https://artofproblemsolving.com/wiki/index.php?title=AMC_Problems_and_Solutions, 2024

work page 2024
[20]

Llama 3.2 : Revolutionizing edge AI and vision with open, customizable models

Meta. Llama 3.2 : Revolutionizing edge AI and vision with open, customizable models. Technical blog, 2024. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

work page 2024
[21]

Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

Mroueh, Y. Reinforcement learning with verifiable rewards: GRPO 's effective loss, dynamics, and success amplification. arXiv preprint arXiv:2503.06639, 2025

work page arXiv 2025
[22]

Revisiting group relative policy op- timization: Insights into on-policy and off-policy training.arXiv preprint arXiv:2505.22257,

Mroueh, Y., Dupuis, N., Belgodere, B., Nitsure, A., Rigotti, M., Greenewald, K., Navratil, J., Ross, J., and Rios, J. Revisiting group relative policy optimization: Insights into on-policy and off-policy training. arXiv preprint arXiv:2505.22257, 2025

work page arXiv 2025
[23]

s1: Simple test-time scaling

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Li, F.-F., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page Pith review arXiv 2025
[24]

and Barakat, A

Protopapas, K. and Barakat, A. Policy mirror descent with lookahead. In NeurIPS, 2024

work page 2024
[25]

Qwen2.5 Technical Report

Qwen, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Planning and learning with adaptive lookahead

Rosenberg, A., Hallak, A., Mannor, S., Chechik, G., and Dalal, G. Planning and learning with adaptive lookahead. In AAAI, 2023

work page 2023
[27]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

arXiv preprint arXiv:2508.05928

Shen, S., Shen, P., Zhao, W., and Zhu, D. Mitigating think-answer mismatch in LLM reasoning through noise-aware advantage reweighting. arXiv preprint arXiv:2508.05928, 2025

work page arXiv 2025
[30]

HybridFlow: A Flexible and Efficient RLHF Framework , url=

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. HybridFlow : A flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25), 2025. doi:10.1145/3689031.3696075

work page doi:10.1145/3689031.3696075 2025
[31]

Learning off-policy with online planning

Sikchi, H., Zhou, W., and Held, D. Learning off-policy with online planning. In CoRL, 2021

work page 2021
[32]

S., McAllester, D., Singh, S., and Mansour, Y

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, 1999

work page 1999
[33]

Slow-Fast Policy Optimization : Reposition-before-update for LLM reasoning

Wang, Z., Wang, Z., Fu, J., Qu, X., Cheng, Q., Tang, S., Zhang, M., and Huo, X. Slow-Fast Policy Optimization : Reposition-before-update for LLM reasoning. In ICLR, 2026

work page 2026
[34]

Reinforcement learning for reasoning in large language models with one training example, 2025

Wang, Y., Yang, Q., Zeng, Z., Ren, L., Liu, L., Peng, B., Cheng, H., He, X., Wang, K., Gao, J., et al. Reinforcement learning for reasoning in large language models with one training example. arXiv preprint arXiv:2504.20571, 2025

work page arXiv 2025
[35]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Wen, X., Liu, Z., Zheng, S., Ye, S., Wu, Z., Wang, Y., Xu, Z., Liang, X., Li, J., Miao, Z., Bian, J., and Yang, M. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLM s. arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review arXiv 2025
[36]

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3--4):229--256, 1992

work page 1992
[37]

LESS : Selecting influential data for targeted instruction tuning

Xia, M., Malladi, S., Gururangan, S., Arora, S., and Chen, D. LESS : Selecting influential data for targeted instruction tuning. In Proceedings of the 41st International Conference on Machine Learning, PMLR 235:54104--54132, 2024

work page 2024
[38]

A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025

Xiong, W., Yao, J., Xu, Y., Pang, B., Wang, L., Sahoo, D., Li, J., Jiang, N., Zhang, T., Xiong, C., and Dong, H. A minimalist approach to LLM reasoning: From rejection sampling to reinforce. arXiv preprint arXiv:2504.11343, 2025

work page arXiv 2025
[39]

E., Savani, Y., Fang, F., and Kolter, J

Xu, Y. E., Savani, Y., Fang, F., and Kolter, J. Z. Not all rollouts are useful: Down-sampling rollouts in LLM reinforcement learning. Transactions on Machine Learning Research, 2026

work page 2026
[40]

LIMO : Less is more for reasoning

Ye, Y., Huang, Z., Xiao, Y., Chern, E., Xia, S., and Liu, P. LIMO : Less is more for reasoning. In COLM, 2025

work page 2025
[41]

DAPO : An open-source LLM reinforcement learning system at scale

Yu, Q., Zhang, Z., Zhu, R., et al. DAPO : An open-source LLM reinforcement learning system at scale. In NeurIPS, 2025

work page 2025
[42]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yue, Y., Yuan, Y., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., Wei, X., Yu, X., Liu, G., Liu, J., Liu, L., Lin, H., Lin, Z., Ma, B., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhang, R., Liu, X., Wang, M., Wu, Y., and Yan, L. VAPO : Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2...

work page internal anchor Pith review arXiv 2025
[43]

Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. Lookahead optimizer: k steps forward, 1 step back. In NeurIPS, 2019

work page 2019
[44]

Towards understanding why lookahead generalizes better than SGD and beyond

Zhou, P., Yan, H., Yuan, X., Feng, J., and Yan, S. Towards understanding why lookahead generalizes better than SGD and beyond. In NeurIPS, 2021

work page 2021
[45]

Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

Zheng, H., Zhou, Y., Bartoldson, B. R., Kailkhura, B., Lai, F., Zhao, J., and Chen, B. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. arXiv preprint arXiv:2506.02177, 2025

work page arXiv 2025
[46]

TTRL: Test-Time Reinforcement Learning

Zuo, Y., Zhang, K., Sheng, L., Qu, S., Cui, G., Zhu, X., Li, H., Zhang, Y., Long, X., Hua, E., Qi, B., Sun, Y., Ma, Z., Yuan, L., Ding, N., and Zhou, B. TTRL : Test-time reinforcement learning. arXiv preprint arXiv:2504.16084, 2025

work page Pith review arXiv 2025