pith. machine review for the scientific record. sign in

arxiv: 2605.06755 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Gradient Extrapolation-Based Policy Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords policy optimizationreinforcement learninglarge language modelsmathematical reasoninggradient extrapolationGRPOlookaheadpass@1 accuracy
0
0 comments X

The pith

GXPO approximates longer local lookahead in policy updates with only three backward passes by extrapolating gradient changes after two fast optimizer steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gradient Extrapolation-Based Policy Optimization (GXPO) as a plug-compatible rule for GRPO-style reinforcement learning when training large language models on reasoning tasks. Standard GRPO updates use only the current step, while true multi-step lookahead improves the direction but demands many new backward passes and rollouts. GXPO reuses the existing batch of rollouts, rewards, and advantages, runs two fast optimizer steps to observe how gradients shift, predicts a virtual K-step position, moves the policy partway there, and finishes with a corrective gradient step at the new location. When the predicted signal becomes unstable, an automatic check reverts to ordinary GRPO. On math-reasoning benchmarks with Qwen2.5 and Llama models, this yields higher average pass@1 scores and measurable speedups while holding the active-phase cost fixed at three backward passes.

Core claim

GXPO approximates a longer local lookahead using only three backward passes during an active phase by taking two fast optimizer steps, measuring how the gradients change, predicting a virtual K-step lookahead point, moving the policy partway toward that point, and then applying a corrective update using the true gradient at the new position while reusing the same batch of rollouts, rewards, advantages, and GRPO loss; it automatically switches back to standard GRPO when the lookahead signal becomes unstable, and a plain-gradient-descent surrogate analysis explains when the extrapolation is exact and where its local errors come from.

What carries the argument

The gradient extrapolation step that observes changes after two fast optimizer steps to construct a predicted K-step policy position for virtual lookahead without new rollouts or reward computation.

Load-bearing premise

That the observed gradient change after two fast optimizer steps provides a sufficiently accurate local linear or low-order extrapolation of the policy trajectory over K steps, and that the automatic stability check reliably detects when this approximation breaks without missing useful updates.

What would settle it

A direct comparison on a small model where actual K-step lookahead trajectories are computed and shown to deviate substantially from GXPO's extrapolated point on the same rollouts even when the stability check passes.

Figures

Figures reproduced from arXiv: 2605.06755 by Aranya Saha, Ismam Nur Swapnil, Mohammad Ariful Haque, Ser-Nam Lim, Tanvir Ahmed Khan.

Figure 1
Figure 1. Figure 1: Overall GXPO training framework. Each active step performs three backward passes: two [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pass@16 accuracy versus training steps across [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training efficiency across GRPO, GXPO, and SFPO, with results reported up to 300 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GXPO ablations on Math-500 with Qwen2.5-1.5B. Left: peak Pass@16 versus time-to-peak [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pass@16 (EMA) versus backward passes across [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pass@16 (EMA) for k = 5 under τ ∈ {0.7, 1, 1.5, 2} versus training steps (left), wall-clock time (center), and backward passes (right). Larger τ achieves higher accuracy across all views. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean response length (in tokens) versus training steps for [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: GXPO diagnostic metrics versus training steps for [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Retention ratio versus training steps across [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-update rule for GRPO-style reasoning RL. GXPO approximates a longer local lookahead using only three backward passes during an active phase. It reuses the same batch of rollouts, rewards, advantages, and GRPO loss, so it does not require new rollouts or reward computation at the lookahead points. GXPO takes two fast optimizer steps, measures how the gradients change, predicts a virtual K-step lookahead point, moves the policy partway toward that point, and then applies a corrective update using the true gradient at the new position. When the lookahead signal becomes unstable, GXPO automatically switches back to standard single-pass GRPO. We also give a plain-gradient-descent surrogate analysis that explains when the extrapolation is exact and where its local errors come from. Across Qwen2.5 and Llama math-reasoning experiments, GXPO improves the average sampled pass@1 by +1.65 to +5.00 points over GRPO and by +0.14 to +1.28 points over the strongest SFPO setting, while keeping the active-phase cost fixed at three backward passes. It also achieves up to 4.00x step speedup, 2.33x wall-clock speedup, and 1.33x backward-pass speedup in reaching GRPO's peak accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible update rule for GRPO-style RL on LLM reasoning tasks. GXPO approximates a K-step local lookahead by performing two fast optimizer steps on the current batch of rollouts/rewards/advantages/GRPO loss, measuring the resulting gradient change, and extrapolating a virtual policy position; it then applies a corrective update and falls back to standard single-pass GRPO when an automatic stability check detects instability. A plain-gradient-descent surrogate analysis is supplied to characterize when the extrapolation is exact and where local errors arise. Experiments on Qwen2.5 and Llama math-reasoning benchmarks report pass@1 gains of +1.65 to +5.00 over GRPO and +0.14 to +1.28 over the strongest SFPO baseline, with the active phase fixed at three backward passes and speedups up to 4.00x in steps, 2.33x wall-clock, and 1.33x backward passes.

Significance. If the extrapolation rule transfers reliably from the plain-GD surrogate to the actual GRPO objective, the method would supply a low-overhead mechanism for incorporating limited multi-step lookahead without new rollouts or reward evaluations. The reported accuracy gains combined with fixed three-backward-pass cost and the speedups to reach peak accuracy would constitute a practical advance for RL-based reasoning training, provided the stability check and extrapolation remain accurate across model scales and task distributions.

major comments (2)
  1. [Surrogate analysis and method description] The surrogate analysis is stated to cover only plain gradient descent and to identify conditions under which the two-step gradient difference exactly predicts the K-step trajectory. However, the deployed loss is the GRPO objective (advantage-weighted log-probability terms, clipping, and any KL or entropy regularizers). Because the derivation assumes an unconstrained quadratic or smooth GD flow, the measured gradient difference on the composite GRPO loss need not obey the same linear or low-order extrapolation; this directly affects whether the virtual lookahead point and the stability check are reliable. (See abstract description of the surrogate and the method overview.)
  2. [Experiments] The abstract reports average pass@1 improvements but supplies no error bars, number of random seeds, ablation results on the stability-check threshold or extrapolation horizon K, or statistical significance tests. Without these, it is impossible to determine whether the +1.65 to +5.00 point gains over GRPO are robust or whether they could be explained by variance in the base GRPO runs.
minor comments (3)
  1. [Method overview] The high-level description of the three-backward-pass procedure would benefit from explicit pseudocode or a numbered algorithmic listing that distinguishes the two fast steps, the extrapolation computation, the corrective update, and the stability check.
  2. [Abstract and method] Notation for the extrapolated policy position, the gradient-difference vector, and the stability metric should be introduced once and used consistently; currently the abstract leaves several quantities implicit.
  3. [Method overview] The claim of 'plug-compatible' with GRPO should be accompanied by a short statement of which GRPO hyperparameters (clipping threshold, KL coefficient, etc.) remain unchanged under GXPO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our work. Below we provide point-by-point responses to the major comments and indicate the revisions we intend to make.

read point-by-point responses
  1. Referee: [Surrogate analysis and method description] The surrogate analysis is stated to cover only plain gradient descent and to identify conditions under which the two-step gradient difference exactly predicts the K-step trajectory. However, the deployed loss is the GRPO objective (advantage-weighted log-probability terms, clipping, and any KL or entropy regularizers). Because the derivation assumes an unconstrained quadratic or smooth GD flow, the measured gradient difference on the composite GRPO loss need not obey the same linear or low-order extrapolation; this directly affects whether the virtual lookahead point and the stability check are reliable. (See abstract description of the surrogate and the method overview.)

    Authors: We clarify that the surrogate analysis under plain gradient descent is provided to offer theoretical intuition regarding the conditions for exact extrapolation and the origins of approximation errors in a controlled setting. The actual GXPO implementation operates on the GRPO loss and includes a stability check to revert to standard updates when the extrapolated signal is deemed unreliable. We agree that further analysis bridging the surrogate to the full GRPO objective would be beneficial. In the revised manuscript, we will add a dedicated subsection discussing the limitations of the surrogate and providing empirical evidence from our training runs on the frequency and impact of fallback to GRPO. revision: partial

  2. Referee: [Experiments] The abstract reports average pass@1 improvements but supplies no error bars, number of random seeds, ablation results on the stability-check threshold or extrapolation horizon K, or statistical significance tests. Without these, it is impossible to determine whether the +1.65 to +5.00 point gains over GRPO are robust or whether they could be explained by variance in the base GRPO runs.

    Authors: The referee correctly identifies the lack of statistical details in the current presentation. To address this, we will revise the experimental section to include results from multiple random seeds, error bars, ablations on the stability-check threshold and the value of K, and appropriate statistical significance tests. These additions will help demonstrate the robustness of the observed improvements across the Qwen2.5 and Llama models. revision: yes

Circularity Check

0 steps flagged

No significant circularity; surrogate analysis is explanatory and results are empirical.

full rationale

The paper introduces GXPO as a practical algorithm that reuses existing rollouts and GRPO loss computations to approximate a K-step lookahead via two fast optimizer steps and a stability check. The provided surrogate analysis is explicitly described as explanatory for the plain-GD case and does not define or derive the GRPO-specific update rule; the reported gains (+1.65 to +5.00 pass@1 over GRPO) are presented as experimental measurements on Qwen2.5 and Llama models rather than quantities obtained by fitting or renaming the same inputs. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no prediction reduces by construction to a fitted parameter or prior result. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper builds on the existing GRPO framework and standard policy-gradient assumptions. No new free parameters, axioms, or invented entities are explicitly introduced in the abstract; the extrapolation rule is presented as a derived heuristic whose validity is checked via a surrogate analysis.

pith-pipeline@v0.9.0 · 5617 in / 1319 out tokens · 46981 ms · 2026-05-11T02:01:30.958602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 12 internal anchors

  1. [1]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  2. [2]

    AlpaGasus : Training a better Alpaca with fewer data

    Chen, L., Li, S., Yan, J., Wang, H., Gunaratna, K., Yadav, V., Tang, Z., Srinivasan, V., Zhou, T., Huang, H., and Jin, H. AlpaGasus : Training a better Alpaca with fewer data. In ICLR, 2024

  3. [3]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  4. [4]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Cui, G., Zhang, Y., Chen, J., Yuan, L., Wang, Z., Zuo, Y., Li, H., Fan, Y., Chen, H., Chen, W., Liu, Z., Peng, H., Bai, L., Ouyang, W., Cheng, Y., Zhou, B., and Ding, N. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

  5. [5]

    Stable reinforcement learning for efficient reasoning

    Dai, M., Liu, S., and Si, Q. Stable reinforcement learning for efficient reasoning. arXiv preprint arXiv:2505.18086, 2025

  6. [6]

    Policy gradient with tree expansion

    Dalal, G., Hallak, A., Thoppe, G., Mannor, S., and Chechik, G. Policy gradient with tree expansion. In ICML, 2025

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1 : Incentivizing reasoning capability in LLM s via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    Concise reasoning via reinforcement learning

    Fatemi, M., Rafiee, B., Tang, M., and Talamadupula, K. Concise reasoning via reinforcement learning. arXiv preprint arXiv:2504.05185, 2025

  10. [10]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., et al. OlympiadBench : A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024

  11. [11]

    History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588,

    He, J., Li, T., Feng, E., Du, D., Liu, Q., Liu, T., Xia, Y., and Chen, H. History rhymes: Accelerating LLM reinforcement learning with RhymeRL . arXiv preprint arXiv:2508.18588, 2025

  12. [12]

    Measuring mathematical problem solving with the MATH dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset. In NeurIPS, 2021

  13. [13]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA : Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  14. [14]

    In: Proceedings of the 29th Symposium on Operating Systems Principles

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23), pages 611--626, 2023. doi:10.1145/3600006.3613165

  15. [15]

    Solving quantitative reasoning problems with language models

    Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. In NeurIPS, 2022

  16. [16]

    Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

    Li, X., Zou, H., and Liu, P. LIMR : Less is more for RL scaling. arXiv preprint arXiv:2502.11886, 2025

  17. [17]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding R1-Zero -like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025

  18. [18]

    and Hutter, F

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In ICLR, 2019

  19. [19]

    AMC Problems and Solutions

    Art of Problem Solving. AMC Problems and Solutions. https://artofproblemsolving.com/wiki/index.php?title=AMC_Problems_and_Solutions, 2024

  20. [20]

    Llama 3.2 : Revolutionizing edge AI and vision with open, customizable models

    Meta. Llama 3.2 : Revolutionizing edge AI and vision with open, customizable models. Technical blog, 2024. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

  21. [21]

    Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

    Mroueh, Y. Reinforcement learning with verifiable rewards: GRPO 's effective loss, dynamics, and success amplification. arXiv preprint arXiv:2503.06639, 2025

  22. [22]

    Revisiting group relative policy op- timization: Insights into on-policy and off-policy training.arXiv preprint arXiv:2505.22257,

    Mroueh, Y., Dupuis, N., Belgodere, B., Nitsure, A., Rigotti, M., Greenewald, K., Navratil, J., Ross, J., and Rios, J. Revisiting group relative policy optimization: Insights into on-policy and off-policy training. arXiv preprint arXiv:2505.22257, 2025

  23. [23]

    s1: Simple test-time scaling

    Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Li, F.-F., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  24. [24]

    and Barakat, A

    Protopapas, K. and Barakat, A. Policy mirror descent with lookahead. In NeurIPS, 2024

  25. [25]

    Qwen2.5 Technical Report

    Qwen, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

  26. [26]

    Planning and learning with adaptive lookahead

    Rosenberg, A., Hallak, A., Mannor, S., Chechik, G., and Dalal, G. Planning and learning with adaptive lookahead. In AAAI, 2023

  27. [27]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  28. [28]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  29. [29]

    arXiv preprint arXiv:2508.05928

    Shen, S., Shen, P., Zhao, W., and Zhu, D. Mitigating think-answer mismatch in LLM reasoning through noise-aware advantage reweighting. arXiv preprint arXiv:2508.05928, 2025

  30. [30]

    HybridFlow: A Flexible and Efficient RLHF Framework , url=

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. HybridFlow : A flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25), 2025. doi:10.1145/3689031.3696075

  31. [31]

    Learning off-policy with online planning

    Sikchi, H., Zhou, W., and Held, D. Learning off-policy with online planning. In CoRL, 2021

  32. [32]

    S., McAllester, D., Singh, S., and Mansour, Y

    Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, 1999

  33. [33]

    Slow-Fast Policy Optimization : Reposition-before-update for LLM reasoning

    Wang, Z., Wang, Z., Fu, J., Qu, X., Cheng, Q., Tang, S., Zhang, M., and Huo, X. Slow-Fast Policy Optimization : Reposition-before-update for LLM reasoning. In ICLR, 2026

  34. [34]

    Reinforcement learning for reasoning in large language models with one training example, 2025

    Wang, Y., Yang, Q., Zeng, Z., Ren, L., Liu, L., Peng, B., Cheng, H., He, X., Wang, K., Gao, J., et al. Reinforcement learning for reasoning in large language models with one training example. arXiv preprint arXiv:2504.20571, 2025

  35. [35]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Wen, X., Liu, Z., Zheng, S., Ye, S., Wu, Z., Wang, Y., Xu, Z., Liang, X., Li, J., Miao, Z., Bian, J., and Yang, M. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLM s. arXiv preprint arXiv:2506.14245, 2025

  36. [36]

    Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3--4):229--256, 1992

  37. [37]

    LESS : Selecting influential data for targeted instruction tuning

    Xia, M., Malladi, S., Gururangan, S., Arora, S., and Chen, D. LESS : Selecting influential data for targeted instruction tuning. In Proceedings of the 41st International Conference on Machine Learning, PMLR 235:54104--54132, 2024

  38. [38]

    A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025

    Xiong, W., Yao, J., Xu, Y., Pang, B., Wang, L., Sahoo, D., Li, J., Jiang, N., Zhang, T., Xiong, C., and Dong, H. A minimalist approach to LLM reasoning: From rejection sampling to reinforce. arXiv preprint arXiv:2504.11343, 2025

  39. [39]

    E., Savani, Y., Fang, F., and Kolter, J

    Xu, Y. E., Savani, Y., Fang, F., and Kolter, J. Z. Not all rollouts are useful: Down-sampling rollouts in LLM reinforcement learning. Transactions on Machine Learning Research, 2026

  40. [40]

    LIMO : Less is more for reasoning

    Ye, Y., Huang, Z., Xiao, Y., Chern, E., Xia, S., and Liu, P. LIMO : Less is more for reasoning. In COLM, 2025

  41. [41]

    DAPO : An open-source LLM reinforcement learning system at scale

    Yu, Q., Zhang, Z., Zhu, R., et al. DAPO : An open-source LLM reinforcement learning system at scale. In NeurIPS, 2025

  42. [42]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yue, Y., Yuan, Y., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., Wei, X., Yu, X., Liu, G., Liu, J., Liu, L., Lin, H., Lin, Z., Ma, B., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhang, R., Liu, X., Wang, M., Wu, Y., and Yan, L. VAPO : Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2...

  43. [43]

    Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. Lookahead optimizer: k steps forward, 1 step back. In NeurIPS, 2019

  44. [44]

    Towards understanding why lookahead generalizes better than SGD and beyond

    Zhou, P., Yan, H., Yuan, X., Feng, J., and Yan, S. Towards understanding why lookahead generalizes better than SGD and beyond. In NeurIPS, 2021

  45. [45]

    Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

    Zheng, H., Zhou, Y., Bartoldson, B. R., Kailkhura, B., Lai, F., Zhao, J., and Chen, B. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. arXiv preprint arXiv:2506.02177, 2025

  46. [46]

    TTRL: Test-Time Reinforcement Learning

    Zuo, Y., Zhang, K., Sheng, L., Qu, S., Cui, G., Zhu, X., Li, H., Zhang, Y., Long, X., Hua, E., Qi, B., Sun, Y., Ma, Z., Yuan, L., Ding, N., and Zhou, B. TTRL : Test-time reinforcement learning. arXiv preprint arXiv:2504.16084, 2025