pith. machine review for the scientific record. sign in

arxiv: 2510.13786 · v1 · submitted 2025-10-15 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

The Art of Scaling Reinforcement Learning Compute for LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningscaling lawslarge language modelscompute scalingsigmoidal curvesRL efficiencyasymptotic performancepredictive models
0
0 comments X

The pith

RL training for LLMs follows predictable sigmoidal scaling curves that enable extrapolation from small-scale runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a large-scale study exceeding 400,000 GPU-hours to develop a framework for scaling reinforcement learning in large language models. It fits sigmoidal curves to model how performance improves with compute and ablates various design choices to determine their impact on the ultimate performance limit versus the speed of reaching it. The work finds that stable recipes produce reliable trajectories, allowing predictions of performance at scales an order of magnitude larger, as demonstrated with a run reaching 100,000 GPU-hours. This matters because it provides the kind of forecasting power for RL that has been available for pre-training, helping to use compute more effectively.

Core claim

We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe that not all recipes yield similar asymptotic performance, that details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and that stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. We propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single

What carries the argument

Sigmoidal compute-performance curves fitted to RL training data after systematic ablations of design choices to separate effects on performance asymptote from efficiency.

If this is right

  • Not every RL recipe reaches the same final performance level no matter how much compute is used.
  • Many implementation details affect only the compute required to approach the performance limit.
  • Predictive curves from small runs allow testing new ideas without full-scale experiments.
  • The ScaleRL recipe offers a stable baseline for large RL training runs.
  • Validation performance predictions can guide decisions on whether to continue scaling a given setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could let teams test dozens of algorithmic variants at small scale and only fully scale the most promising ones.
  • If the pattern holds for other post-training techniques, similar curves might organize supervised fine-tuning and preference tuning as well.
  • Early detection of poor scaling could save substantial compute by abandoning inefficient recipes sooner.
  • The framework opens the door to automated search over RL hyperparameters guided by predicted scaling efficiency.

Load-bearing premise

The sigmoidal functional form fitted on smaller runs will continue to accurately describe performance when compute is scaled up by a factor of ten or more.

What would settle it

Fit a sigmoid to performance data from runs using less than 10,000 GPU-hours for a stable recipe, then run the same recipe at 100,000 GPU-hours and check whether the actual result matches the extrapolated prediction within a small error margin.

read the original abstract

Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports a large-scale empirical study (>400,000 GPU-hours) of RL training for LLMs. It fits sigmoidal compute-performance curves, ablates design choices (loss aggregation, normalization, curriculum, off-policy algorithms), concludes that many choices affect only compute efficiency and not asymptotic performance, proposes a ScaleRL recipe, and validates the approach by successfully extrapolating from smaller runs to predict performance on one 100,000 GPU-hour RL training run.

Significance. If the central claims hold, the work would establish the first systematic, predictive framework for RL scaling in LLMs, analogous to established pre-training scaling laws. The study size, the distinction between efficiency and asymptotic effects, and the large-scale validation run are concrete strengths that could guide more efficient compute allocation and recipe design.

major comments (2)
  1. [Abstract / Validation Experiment] Abstract and validation section: the extrapolation claim rests on a single successful 100,000 GPU-hour run. Because the sigmoidal parameters and the conclusion that ablated factors do not shift the asymptote were fitted on smaller-scale data, evidence that the same asymptotes and functional form continue to hold at an order-of-magnitude larger scale for multiple independent recipes is required to support the generality of the scaling trajectories.
  2. [Ablation and Curve-Fitting Sections] Ablation and curve-fitting sections: the paper reports that design choices modulate efficiency but not asymptote, yet provides no details on the exact sigmoidal functional form, fitting procedure, error bars, number of independent runs per data point, or whether any extrapolations were performed on strictly held-out data. Without these, the robustness of the predictability claim cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract would benefit from a concise statement of the precise sigmoidal equation and the criteria used to declare a recipe 'stable and scalable.'

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript while being transparent about the scope of our study.

read point-by-point responses
  1. Referee: [Abstract / Validation Experiment] Abstract and validation section: the extrapolation claim rests on a single successful 100,000 GPU-hour run. Because the sigmoidal parameters and the conclusion that ablated factors do not shift the asymptote were fitted on smaller-scale data, evidence that the same asymptotes and functional form continue to hold at an order-of-magnitude larger scale for multiple independent recipes is required to support the generality of the scaling trajectories.

    Authors: We agree that validation on a single 100,000 GPU-hour run provides only initial evidence rather than comprehensive proof of generality across recipes. This run was performed using the ScaleRL recipe and was predicted in advance from parameters fitted exclusively on smaller-scale data; the close agreement between the extrapolated curve and observed performance supports that the sigmoidal form and asymptote remain stable at this scale. We acknowledge the limitation that additional independent large-scale runs for other recipes would be needed for stronger claims. In the revised manuscript we will update the abstract and validation section to explicitly frame this as a single successful extrapolation test, add a limitations paragraph discussing the need for future multi-recipe validation at scale, and clarify that the efficiency-vs-asymptote distinction is evidenced by the smaller-scale ablations where asymptotes were consistently unchanged. We cannot conduct further 100k-scale runs within the current study. revision: partial

  2. Referee: [Ablation and Curve-Fitting Sections] Ablation and curve-fitting sections: the paper reports that design choices modulate efficiency but not asymptote, yet provides no details on the exact sigmoidal functional form, fitting procedure, error bars, number of independent runs per data point, or whether any extrapolations were performed on strictly held-out data. Without these, the robustness of the predictability claim cannot be assessed.

    Authors: We fully agree that these methodological details are required to evaluate the robustness of the predictability claims. The revised manuscript will include a dedicated methods subsection (and appendix) specifying: the exact sigmoidal form used (P(C) = A / (1 + exp(-k*(log10(C) - x0)))), the nonlinear least-squares fitting procedure, error bars computed as standard deviation across independent random seeds, the number of runs per compute level (typically 3-5), and explicit confirmation that the 100,000 GPU-hour extrapolation used parameters fitted only on data up to approximately 10,000 GPU-hours, treating the large run as held-out validation. These additions will directly address the concern and improve reproducibility. revision: yes

standing simulated objections not resolved
  • Conducting multiple independent 100,000 GPU-hour RL training runs across different recipes to further validate generality of the scaling trajectories, as this would require compute resources substantially exceeding the 400,000 GPU-hours already expended.

Circularity Check

0 steps flagged

No significant circularity in empirical scaling study

full rationale

The paper reports an empirical investigation that fits sigmoidal curves to observed compute-performance data from multiple RL runs and validates extrapolation on one held-out 100,000 GPU-hour experiment. No closed-form derivation, uniqueness theorem, or self-citation chain is invoked that would reduce the central claims to tautology or fitted inputs by construction. The scaling trajectories are presented as observed patterns supported by direct large-scale measurement rather than as predictions forced by the fitting procedure itself. This constitutes a standard data-driven methodology whose validity rests on the external large-scale run rather than on any self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on fitting sigmoidal curves to empirical RL performance data; the main added elements are the fitted parameters of those curves and the assumption that the functional form remains valid across scales.

free parameters (1)
  • sigmoidal curve parameters (asymptote, scale, midpoint)
    Fitted separately for each recipe to the observed compute-performance data points.
axioms (1)
  • domain assumption Performance versus compute follows a sigmoidal functional form
    Invoked to enable fitting and extrapolation; justified by observed data shapes in the study.

pith-pipeline@v0.9.0 · 5566 in / 1326 out tokens · 46602 ms · 2026-05-16T16:24:18.978810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  2. TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

    cs.AI 2026-05 unverdicted novelty 7.0

    TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...

  3. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  4. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  5. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  6. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  7. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  8. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.

  9. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.

  10. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  11. Cost-Aware Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.

  12. Scaling Self-Play with Self-Guidance

    cs.LG 2026-04 unverdicted novelty 6.0

    SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.

  13. Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

    cs.LG 2026-04 unverdicted novelty 6.0

    Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable...

  14. Target Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.

  15. Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

  16. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

    cs.AI 2026-05 unverdicted novelty 5.0

    Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

  17. Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

    cs.AI 2026-04 unverdicted novelty 5.0

    Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.

  18. Beyond Distribution Sharpening: The Importance of Task Rewards

    cs.LG 2026-04 unverdicted novelty 5.0

    Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.

  19. Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

    cs.LG 2026-02 unverdicted novelty 5.0

    A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.

  20. Continued AI Scaling Requires Repeated Efficiency Doublings

    cs.LG 2026-03 unverdicted novelty 3.0

    Continued AI scaling remains feasible only if efficiency doublings recur repeatedly to keep logical compute affordable.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 17 Pith papers · 14 internal anchors

  1. [1]

    URLhttps://hkunlp.github.io/blog/2025/Polaris. AoPS. AIME problem set 1983-2025,

  2. [2]

    Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel 13 Meyer, Yuxiang Wei, David Zhang, et al

    URL https://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions. Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel 13 Meyer, Yuxiang Wei, David Zhang, et al. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387,

  3. [3]

    URLhttps://arxiv.org/abs/2505.22617. GLM-V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Ch...

  4. [4]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    URLhttps://arxiv.org/abs/2507.01006. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638,

  5. [5]

    Measuring Mathematical Problem Solving With the MATH Dataset

    doi: 10.64434/tml.20250910. https://thinkingmachines.ai/blog/defeating-nondeterminism-in- llm-inference/. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  6. [6]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  7. [7]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models, 2025a. URLhttps://arxiv.org/abs/2501.03262. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on ...

  8. [8]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  9. [9]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025a. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Sc...

  10. [10]

    Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025a

    14 Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025a. URLhttps://arxiv.org/abs/ 2505.24864. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-lik...

  11. [11]

    Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al

    URLhttps://arxiv.org/abs/2406.10229. Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al. Sympy: symbolic computing in python.PeerJ Computer Science, 3:e103,

  12. [12]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    URL https://arxiv.org/abs/2506.13585. Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models,

  13. [13]

    Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville

    URLhttps://arxiv.org/ abs/2305.16264. Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models.arXiv preprint arXiv:2410.18252,

  14. [14]

    OpenAI o1 System Card

    OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  15. [15]

    How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757,

    David Owen. How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757,

  16. [16]

    Abhinav Rastogi, Albert Q Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, et al

    URLhttps://arxiv.org/abs/2406.19146. Abhinav Rastogi, Albert Q Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, et al. Magistral.arXiv preprint arXiv:2506.10910,

  17. [17]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

    URLhttps://arxiv.org/abs/2405.10938. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms,

  18. [18]

    Proximal Policy Optimization Algorithms

    URLhttps://arxiv.org/abs/1707.06347. ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914,

  19. [19]

    Lucy Xiaoyang Shi, Joseph J Lim, and Youngwoon Lee

    URLhttps://arxiv.org/abs/2410.08146. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  20. [20]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615,

  21. [21]

    Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh

    URLhttps://x.ai/news/grok-4. Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451,

  22. [22]

    Qwen3 Technical Report

    URLhttps://arxiv.org/abs/2505.09388. Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient RL framework secretly brings you off-policy RL training,

  23. [23]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    URLhttps://fengyao.notion.site/off-policy-rl. Accessed through a social media reference. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  24. [24]

    What’s behind PPO’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491,

    Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind PPO’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491,

  25. [25]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

  26. [26]

    Generative verifiers: Reward modeling as next-token prediction, 2025a

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction, 2025a. URLhttps://arxiv.org/abs/2408.15240. Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. SPEED-RL: Faster training of reasoning models via online curriculum learning, 2025b. URLhttps://ar...

  27. [27]

    or Monte Carlo Tree Search (MCTS) (Xie et al., 2024). The earliest widely referenced RLVR (verifiable-reward) algorithm underlying this wave of reasoning develop- ment is Group Relative Policy Optimization (GRPO), introduced in Shao et al. (2024). GRPO is a critic-free, group-relative policy gradient with PPO-style clipping that replaces a learned value b...

  28. [28]

    (2017) for LLM fine-tuning with verifiable rewards

    adapts PPO Schulman et al. (2017) for LLM fine-tuning with verifiable rewards. For a given promptx, the old policyπgen(θold)generates G candidate completions {yi}G i=1, each assigned a scalar rewardri. To emphasize relative quality within the group, rewards are normalized as ˆAi = ri −mean({r j}G j=1) std({rj}G j=1) +ϵ .(5) Each completiony i of length|y ...

  29. [29]

    extends GRPO with two key modifications. First, it replaces symmetric clipping withasymmetric clipping, using distinct thresholds for upward and downward deviations:clipasym(ρ, a) = clip(ρ, 1 −ϵ −, 1 + ϵ+), where ϵ− and ϵ+ are hyper-parameters. Second, DAPO changes the aggregation scheme to operate at theprompt level. For a given promptx∼D , the old polic...

  30. [30]

    The lowerϵ is to avoid gradient clipping (epsilon underflow) (Wortsman et al., 2023)

    withϵ = 10 −15, weight decay of 0.01 (default in AdamW), and a linear warmup of 100 steps. The lowerϵ is to avoid gradient clipping (epsilon underflow) (Wortsman et al., 2023). We use automated checkers like Sympy (Meurer et al.,

  31. [31]

    We use a custom code execution environment for coding problems involving unit tests and desired outputs

    or Math-Verify1 for assessing the correctness of the final answer for math problems after stripping out the thinking trace (<think>· · ·</think> ). We use a custom code execution environment for coding problems involving unit tests and desired outputs. We used 80 Nvidia GB200 GPU for a single run, with a compute budget ranging from 3.5-4K GPU hours for es...

  32. [32]

    We consider two approaches: (a)interruptions, used in works like GLM-4.1V (GLM-V Team et al., 2025), and Qwen3 (Yang et al.,

    A.10 Controlling generation length One common concern in reasoning RL is to control exploding generation lengths, which harms both training efficiency and stability (Appendix A.15). We consider two approaches: (a)interruptions, used in works like GLM-4.1V (GLM-V Team et al., 2025), and Qwen3 (Yang et al.,

  33. [33]

    Okay, time is up. Let me stop thinking and formulate a final answer</think>

    and (b)length penalties, used in works like DAPO (Yu et al., 2025), Kimi (Kimi Team et al., 2025b), Magistral (Rastogi et al., 2025), and Minimax-M1 (MiniMax et al., 2025). Interruptionsforcibly stop generation by appending a marker phrase such as “Okay, time is up. Let me stop thinking and formulate a final answer</think>", signaling the model to termina...

  34. [34]

    A.14 Downstream performance In Figure 1, 9, 10b, and 18, we report a representative set of downstream evaluation curves. These include ScaleRLruns with batch sizes {512, 768, 2048}, long-context training run with32k generation length, the large-model (Scout) training run, a multi-task run (math + code), and different number of generations per prompt (with...

  35. [35]

    performance on math+code run, (c) AIME-24 performance on math+code run Overall, we suggest practitioners monitor truncation rates closely. Our findings indicate that high truncation rates are a reliable warning signal of instability, while larger models, higher generation budgets, and careful design choices (as inScaleRL) substantially mitigate this risk....

  36. [36]

    Specifically DAPO drops 0-variance prompts and samples more prompts until the batch is full

    was regarding dynamically filling in the batch. Specifically DAPO drops 0-variance prompts and samples more prompts until the batch is full. In our codebase, this was not efficent because for PPO-offpolicy algorithm, we had generators pre-decide that each generator will generate rollouts for#prompts/#generators. Therefore, if a specific generator had more...