Recognition: 3 theorem links
· Lean TheoremThe Art of Scaling Reinforcement Learning Compute for LLMs
Pith reviewed 2026-05-16 16:24 UTC · model grok-4.3
The pith
RL training for LLMs follows predictable sigmoidal scaling curves that enable extrapolation from small-scale runs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe that not all recipes yield similar asymptotic performance, that details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and that stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. We propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single
What carries the argument
Sigmoidal compute-performance curves fitted to RL training data after systematic ablations of design choices to separate effects on performance asymptote from efficiency.
If this is right
- Not every RL recipe reaches the same final performance level no matter how much compute is used.
- Many implementation details affect only the compute required to approach the performance limit.
- Predictive curves from small runs allow testing new ideas without full-scale experiments.
- The ScaleRL recipe offers a stable baseline for large RL training runs.
- Validation performance predictions can guide decisions on whether to continue scaling a given setup.
Where Pith is reading between the lines
- This approach could let teams test dozens of algorithmic variants at small scale and only fully scale the most promising ones.
- If the pattern holds for other post-training techniques, similar curves might organize supervised fine-tuning and preference tuning as well.
- Early detection of poor scaling could save substantial compute by abandoning inefficient recipes sooner.
- The framework opens the door to automated search over RL hyperparameters guided by predicted scaling efficiency.
Load-bearing premise
The sigmoidal functional form fitted on smaller runs will continue to accurately describe performance when compute is scaled up by a factor of ten or more.
What would settle it
Fit a sigmoid to performance data from runs using less than 10,000 GPU-hours for a stable recipe, then run the same recipe at 100,000 GPU-hours and check whether the actual result matches the extrapolated prediction within a small error margin.
read the original abstract
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a large-scale empirical study (>400,000 GPU-hours) of RL training for LLMs. It fits sigmoidal compute-performance curves, ablates design choices (loss aggregation, normalization, curriculum, off-policy algorithms), concludes that many choices affect only compute efficiency and not asymptotic performance, proposes a ScaleRL recipe, and validates the approach by successfully extrapolating from smaller runs to predict performance on one 100,000 GPU-hour RL training run.
Significance. If the central claims hold, the work would establish the first systematic, predictive framework for RL scaling in LLMs, analogous to established pre-training scaling laws. The study size, the distinction between efficiency and asymptotic effects, and the large-scale validation run are concrete strengths that could guide more efficient compute allocation and recipe design.
major comments (2)
- [Abstract / Validation Experiment] Abstract and validation section: the extrapolation claim rests on a single successful 100,000 GPU-hour run. Because the sigmoidal parameters and the conclusion that ablated factors do not shift the asymptote were fitted on smaller-scale data, evidence that the same asymptotes and functional form continue to hold at an order-of-magnitude larger scale for multiple independent recipes is required to support the generality of the scaling trajectories.
- [Ablation and Curve-Fitting Sections] Ablation and curve-fitting sections: the paper reports that design choices modulate efficiency but not asymptote, yet provides no details on the exact sigmoidal functional form, fitting procedure, error bars, number of independent runs per data point, or whether any extrapolations were performed on strictly held-out data. Without these, the robustness of the predictability claim cannot be assessed.
minor comments (1)
- [Abstract] The abstract would benefit from a concise statement of the precise sigmoidal equation and the criteria used to declare a recipe 'stable and scalable.'
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript while being transparent about the scope of our study.
read point-by-point responses
-
Referee: [Abstract / Validation Experiment] Abstract and validation section: the extrapolation claim rests on a single successful 100,000 GPU-hour run. Because the sigmoidal parameters and the conclusion that ablated factors do not shift the asymptote were fitted on smaller-scale data, evidence that the same asymptotes and functional form continue to hold at an order-of-magnitude larger scale for multiple independent recipes is required to support the generality of the scaling trajectories.
Authors: We agree that validation on a single 100,000 GPU-hour run provides only initial evidence rather than comprehensive proof of generality across recipes. This run was performed using the ScaleRL recipe and was predicted in advance from parameters fitted exclusively on smaller-scale data; the close agreement between the extrapolated curve and observed performance supports that the sigmoidal form and asymptote remain stable at this scale. We acknowledge the limitation that additional independent large-scale runs for other recipes would be needed for stronger claims. In the revised manuscript we will update the abstract and validation section to explicitly frame this as a single successful extrapolation test, add a limitations paragraph discussing the need for future multi-recipe validation at scale, and clarify that the efficiency-vs-asymptote distinction is evidenced by the smaller-scale ablations where asymptotes were consistently unchanged. We cannot conduct further 100k-scale runs within the current study. revision: partial
-
Referee: [Ablation and Curve-Fitting Sections] Ablation and curve-fitting sections: the paper reports that design choices modulate efficiency but not asymptote, yet provides no details on the exact sigmoidal functional form, fitting procedure, error bars, number of independent runs per data point, or whether any extrapolations were performed on strictly held-out data. Without these, the robustness of the predictability claim cannot be assessed.
Authors: We fully agree that these methodological details are required to evaluate the robustness of the predictability claims. The revised manuscript will include a dedicated methods subsection (and appendix) specifying: the exact sigmoidal form used (P(C) = A / (1 + exp(-k*(log10(C) - x0)))), the nonlinear least-squares fitting procedure, error bars computed as standard deviation across independent random seeds, the number of runs per compute level (typically 3-5), and explicit confirmation that the 100,000 GPU-hour extrapolation used parameters fitted only on data up to approximately 10,000 GPU-hours, treating the large run as held-out validation. These additions will directly address the concern and improve reproducibility. revision: yes
- Conducting multiple independent 100,000 GPU-hour RL training runs across different recipes to further validate generality of the scaling trajectories, as this would require compute resources substantially exceeding the 400,000 GPU-hours already expended.
Circularity Check
No significant circularity in empirical scaling study
full rationale
The paper reports an empirical investigation that fits sigmoidal curves to observed compute-performance data from multiple RL runs and validates extrapolation on one held-out 100,000 GPU-hour experiment. No closed-form derivation, uniqueness theorem, or self-citation chain is invoked that would reduce the central claims to tautology or fitted inputs by construction. The scaling trajectories are presented as observed patterns supported by direct large-scale measurement rather than as predictions forced by the fitting procedure itself. This constitutes a standard data-driven methodology whose validity rests on the external large-scale run rather than on any self-referential reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- sigmoidal curve parameters (asymptote, scale, midpoint)
axioms (1)
- domain assumption Performance versus compute follows a sigmoidal functional form
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs.
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Cost-Aware Learning
Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
-
Scaling Self-Play with Self-Guidance
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
-
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable...
-
Target Policy Optimization
TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
-
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.
-
Beyond Distribution Sharpening: The Importance of Task Rewards
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
-
Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.
-
Continued AI Scaling Requires Repeated Efficiency Doublings
Continued AI scaling remains feasible only if efficiency doublings recur repeatedly to keep logical compute affordable.
Reference graph
Works this paper leans on
-
[1]
URLhttps://hkunlp.github.io/blog/2025/Polaris. AoPS. AIME problem set 1983-2025,
work page 2025
-
[2]
URL https://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions. Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel 13 Meyer, Yuxiang Wei, David Zhang, et al. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387,
-
[3]
URLhttps://arxiv.org/abs/2505.22617. GLM-V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Ch...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
URLhttps://arxiv.org/abs/2507.01006. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Measuring Mathematical Problem Solving With the MATH Dataset
doi: 10.64434/tml.20250910. https://thinkingmachines.ai/blog/defeating-nondeterminism-in- llm-inference/. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml.20250910
-
[6]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models, 2025a. URLhttps://arxiv.org/abs/2501.03262. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[9]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025a. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Sc...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025a
14 Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025a. URLhttps://arxiv.org/abs/ 2505.24864. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-lik...
-
[11]
URLhttps://arxiv.org/abs/2406.10229. Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al. Sympy: symbolic computing in python.PeerJ Computer Science, 3:e103,
-
[12]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
URL https://arxiv.org/abs/2506.13585. Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
URLhttps://arxiv.org/ abs/2305.16264. Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models.arXiv preprint arXiv:2410.18252,
-
[14]
OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757,
David Owen. How predictable is language model benchmark performance?arXiv preprint arXiv:2401.04757,
-
[16]
URLhttps://arxiv.org/abs/2406.19146. Abhinav Rastogi, Albert Q Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, et al. Magistral.arXiv preprint arXiv:2506.10910,
-
[17]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
URLhttps://arxiv.org/abs/2405.10938. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms,
-
[18]
Proximal Policy Optimization Algorithms
URLhttps://arxiv.org/abs/1707.06347. ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Lucy Xiaoyang Shi, Joseph J Lim, and Youngwoon Lee
URLhttps://arxiv.org/abs/2410.08146. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
-
[20]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
URLhttps://x.ai/news/grok-4. Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451,
-
[22]
URLhttps://arxiv.org/abs/2505.09388. Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient RL framework secretly brings you off-policy RL training,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
URLhttps://fengyao.notion.site/off-policy-rl. Accessed through a social media reference. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind PPO’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491,
-
[25]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Generative verifiers: Reward modeling as next-token prediction, 2025a
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction, 2025a. URLhttps://arxiv.org/abs/2408.15240. Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. SPEED-RL: Faster training of reasoning models via online curriculum learning, 2025b. URLhttps://ar...
-
[27]
or Monte Carlo Tree Search (MCTS) (Xie et al., 2024). The earliest widely referenced RLVR (verifiable-reward) algorithm underlying this wave of reasoning develop- ment is Group Relative Policy Optimization (GRPO), introduced in Shao et al. (2024). GRPO is a critic-free, group-relative policy gradient with PPO-style clipping that replaces a learned value b...
work page 2024
-
[28]
(2017) for LLM fine-tuning with verifiable rewards
adapts PPO Schulman et al. (2017) for LLM fine-tuning with verifiable rewards. For a given promptx, the old policyπgen(θold)generates G candidate completions {yi}G i=1, each assigned a scalar rewardri. To emphasize relative quality within the group, rewards are normalized as ˆAi = ri −mean({r j}G j=1) std({rj}G j=1) +ϵ .(5) Each completiony i of length|y ...
work page 2017
-
[29]
extends GRPO with two key modifications. First, it replaces symmetric clipping withasymmetric clipping, using distinct thresholds for upward and downward deviations:clipasym(ρ, a) = clip(ρ, 1 −ϵ −, 1 + ϵ+), where ϵ− and ϵ+ are hyper-parameters. Second, DAPO changes the aggregation scheme to operate at theprompt level. For a given promptx∼D , the old polic...
work page 2024
-
[30]
The lowerϵ is to avoid gradient clipping (epsilon underflow) (Wortsman et al., 2023)
withϵ = 10 −15, weight decay of 0.01 (default in AdamW), and a linear warmup of 100 steps. The lowerϵ is to avoid gradient clipping (epsilon underflow) (Wortsman et al., 2023). We use automated checkers like Sympy (Meurer et al.,
work page 2023
-
[31]
or Math-Verify1 for assessing the correctness of the final answer for math problems after stripping out the thinking trace (<think>· · ·</think> ). We use a custom code execution environment for coding problems involving unit tests and desired outputs. We used 80 Nvidia GB200 GPU for a single run, with a compute budget ranging from 3.5-4K GPU hours for es...
work page 2020
-
[32]
A.10 Controlling generation length One common concern in reasoning RL is to control exploding generation lengths, which harms both training efficiency and stability (Appendix A.15). We consider two approaches: (a)interruptions, used in works like GLM-4.1V (GLM-V Team et al., 2025), and Qwen3 (Yang et al.,
work page 2025
-
[33]
Okay, time is up. Let me stop thinking and formulate a final answer</think>
and (b)length penalties, used in works like DAPO (Yu et al., 2025), Kimi (Kimi Team et al., 2025b), Magistral (Rastogi et al., 2025), and Minimax-M1 (MiniMax et al., 2025). Interruptionsforcibly stop generation by appending a marker phrase such as “Okay, time is up. Let me stop thinking and formulate a final answer</think>", signaling the model to termina...
work page 2025
-
[34]
A.14 Downstream performance In Figure 1, 9, 10b, and 18, we report a representative set of downstream evaluation curves. These include ScaleRLruns with batch sizes {512, 768, 2048}, long-context training run with32k generation length, the large-model (Scout) training run, a multi-task run (math + code), and different number of generations per prompt (with...
work page 2048
-
[35]
performance on math+code run, (c) AIME-24 performance on math+code run Overall, we suggest practitioners monitor truncation rates closely. Our findings indicate that high truncation rates are a reliable warning signal of instability, while larger models, higher generation budgets, and careful design choices (as inScaleRL) substantially mitigate this risk....
work page 2025
-
[36]
Specifically DAPO drops 0-variance prompts and samples more prompts until the batch is full
was regarding dynamically filling in the batch. Specifically DAPO drops 0-variance prompts and samples more prompts until the batch is full. In our codebase, this was not efficent because for PPO-offpolicy algorithm, we had generators pre-decide that each generator will generate rollouts for#prompts/#generators. Therefore, if a specific generator had more...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.