pith. sign in

arxiv: 2605.17570 · v1 · pith:YJIDXWRMnew · submitted 2026-05-17 · 💻 cs.LG · cs.CL

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

Pith reviewed 2026-05-20 13:34 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords GRPOMu-GRPOoff-policy RLLLM reinforcement learningrollout stalenessmath reasoningefficient trainingverifiable rewards
0
0 comments X

The pith

Mu-GRPO lets GRPO-style training tolerate much staler rollout data from large sequential stages, matching standard performance while cutting wall-clock time by roughly half.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how off-policy GRPO algorithms can become for large language model reinforcement learning with verifiable rewards. It introduces Mu-GRPO, which restructures the process into a few large sequential generation-optimization stages instead of frequent switching. This creates higher rollout staleness but slashes system overhead. To keep learning stable, the method applies relaxed clipping to retain useful gradients from old data and negative-advantage veto to block harmful suffix updates. Experiments across five models and multiple math reasoning benchmarks show the approach matches or exceeds regular GRPO while delivering around 2x wall-clock speedup.

Core claim

GRPO-style algorithms can operate effectively under substantially higher rollout staleness than the low-staleness regime typically used. Mu-GRPO achieves this by scheduling training into a small number of large sequential stages that separate generation and optimization, then stabilizes the process with relaxed clipping that keeps stale gradients and negative-advantage veto that discards destabilizing updates on negative-advantage responses.

What carries the argument

Mu-GRPO framework with its four-stage sequential schedule, relaxed clipping, and negative-advantage veto that together enable high-staleness rollouts while preserving optimization stability.

If this is right

  • Wall-clock training time drops by a factor of about two across tested models and benchmarks.
  • The same performance level is reached without needing frequent rollout-optimization switches.
  • Stale rollouts can supply the majority of training data without collapsing learning.
  • The approach applies directly to existing GRPO pipelines on math reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar stage-based scheduling could reduce overhead in other on-policy RL methods that currently require tight synchronization.
  • The tolerance for staleness might allow larger effective batch sizes or longer training horizons in compute-limited environments.
  • If the stabilization techniques generalize, they could support training on even older data collected from previous model versions.

Load-bearing premise

Relaxed clipping plus negative-advantage veto will keep optimization stable and unbiased even when all rollout data comes from the high-staleness regime of the four-stage schedule.

What would settle it

A controlled run on the same math benchmarks where rollout staleness is increased to the Mu-GRPO level but without the relaxed clipping and veto, showing clear performance drop or training divergence.

Figures

Figures reproduced from arXiv: 2605.17570 by Chen Wei, Minghao Tian, Yunfei Xie.

Figure 1
Figure 1. Figure 1: Comparing GRPO and µ-GRPO. Average accuracy across five math benchmarks over wall-clock time. µ-GRPO uses four large rollout–optimization stages, inducing high rollout staleness while reducing rollout– training switching overhead. It reaches GRPO’s perfor￾mance with a 2.2× wall-clock speedup on DeepSeek￾7B [22]. Both methods are trained for 4096 updates. In this paper, we show that GRPO￾style algorithms ca… view at source ↗
Figure 2
Figure 2. Figure 2: Clipping creates a high-staleness dilemma. At µ = 1024, [0.8, 1.2] bound clips many more tokens and plateaus at lower accuracy (blue), while relaxed clipping recovers early gains but later collapses (yellow). Results are on Qwen2.5-Math-7B. Under low staleness, the behav￾ior policy β and current policy πθ remain close, so the standard [0.8, 1.2] interval clips only out￾lying updates. Under high stale￾ness,… view at source ↗
Figure 3
Figure 3. Figure 3: Localizing the harmful updates. Under relaxed clipping [0, 5], masking only trigger tokens T still collapses, while masking the non-trigger suffix Hκ or broader scopes keeps both accuracy and negative-advantage importance ratios stable. Results are on Qwen2.5-Math-7B. useful gradient signal. However, this signal is not continuously safe to use: after the initial improve￾ment phase, the relaxed-clipping run… view at source ↗
Figure 4
Figure 4. Figure 4: System-level execution of standard GRPO and µ-GRPO. Over 4096 model updates, standard GRPO with µ = 4 requires 1024 rollout refreshes and weight synchronizations, while µ-GRPO with µ= 1024 requires only four. Large-stage rollout generation reduces synchronization overhead and avoids GPU memory contention during generation. 0 1000 2000 3000 4000 Model Updates 40 60 80 100 Average Reward (%) Stage 0 Stage 1 … view at source ↗
Figure 5
Figure 5. Figure 5: Training-set reward dynamics on DeepSeek-7B. Bold lines are smoothed. Multi-Stage µ-GRPO. Both standard GRPO and µ-GRPO can be viewed as repeated generation– optimization cycles, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 1
Figure 1. Figure 1: For GRPO, we set ϵ = 0.2 and omit the KL loss following prior work [31, 34]. For M2PO, we use the recommended second-moment threshold of 0.04. We use a fully aligned wall-clock measurement protocol on identical hardware configurations with 4×H200 GPUs, including both rollout generation and policy optimization. For evaluation, we use five widely adopted math reasoning benchmarks: AMC23/24 [1], AIME24/25 [2]… view at source ↗
Figure 6
Figure 6. Figure 6: NAV-vetoed token fraction during µ-GRPO training on DeepSeek-7B. To complement the averages in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger suffix updates in negative-advantage responses. Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved performance-efficiency trade-off for LLM reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Mu-GRPO, a GRPO-style RL framework for LLMs that organizes training into a small number (e.g. four) of large sequential generation-optimization stages. This induces high rollout staleness to reduce switching overhead. Stabilization is achieved via relaxed clipping (to preserve useful stale gradients) and negative-advantage veto (to remove destabilizing post-trigger suffix updates). The authors report that Mu-GRPO matches or exceeds standard GRPO on multiple math reasoning benchmarks across five language models while delivering approximately 2x wall-clock speedup.

Significance. If the stabilization techniques prove reliable, the work would establish that GRPO can operate effectively under substantially higher staleness than previously assumed, yielding a meaningfully better performance-efficiency trade-off for RLVR. The multi-model, multi-benchmark evaluation is a positive feature. However, the absence of ablations, diagnostics, and statistical reporting on the key stabilization components limits the strength of the central claim.

major comments (3)
  1. [Methods] Methods section (description of Mu-GRPO and the four-stage schedule): the claim that relaxed clipping together with negative-advantage veto reliably stabilizes optimization under high-staleness rollouts is load-bearing, yet the manuscript provides no ablations isolating each component, no gradient-norm or policy-divergence measurements, and no statistics on advantage distributions or rollout staleness to confirm the techniques control off-policyness effects without systematic bias.
  2. [Experiments] Experiments / Results tables: benchmark scores are reported as matching or exceeding GRPO without error bars, confidence intervals, or statistical significance tests; exact hyper-parameter tables are also absent. This makes it impossible to assess whether the reported 2x speedup and performance parity are robust or could be affected by post-hoc benchmark selection.
  3. [Methods] Stabilization subsection: the negative-advantage veto is asserted to remove only destabilizing post-trigger suffix updates, but without direct measurements of how the veto interacts with the sequential schedule or any sensitivity analysis, the risk of introducing bias or missing instability in the high-staleness regime remains unaddressed.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'post-trigger suffix updates' is used without a brief definition or illustrative example; adding one sentence of clarification would improve accessibility.
  2. [Experiments] Figure or table captions: ensure all reported speedups explicitly state the baseline (standard GRPO with what batching/parallelism) to allow direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below. We agree that additional ablations, statistical reporting, and diagnostics will strengthen the manuscript and will incorporate them in the revision.

read point-by-point responses
  1. Referee: [Methods] Methods section (description of Mu-GRPO and the four-stage schedule): the claim that relaxed clipping together with negative-advantage veto reliably stabilizes optimization under high-staleness rollouts is load-bearing, yet the manuscript provides no ablations isolating each component, no gradient-norm or policy-divergence measurements, and no statistics on advantage distributions or rollout staleness to confirm the techniques control off-policyness effects without systematic bias.

    Authors: We acknowledge that isolating the individual contributions of relaxed clipping and negative-advantage veto through dedicated ablations would provide stronger support for the stabilization claim. In the revised manuscript we will add an ablation study that disables each component in turn while keeping the four-stage schedule fixed, and report the resulting training curves, final benchmark scores, and stability indicators. We will also include plots of gradient norms and approximate policy divergence (KL) across training steps, together with histograms and summary statistics of advantage values and measured rollout staleness (token-age distribution) for both Mu-GRPO and the baseline. These additions will directly address whether the techniques control off-policy effects without introducing systematic bias. revision: yes

  2. Referee: [Experiments] Experiments / Results tables: benchmark scores are reported as matching or exceeding GRPO without error bars, confidence intervals, or statistical significance tests; exact hyper-parameter tables are also absent. This makes it impossible to assess whether the reported 2x speedup and performance parity are robust or could be affected by post-hoc benchmark selection.

    Authors: We agree that the current presentation lacks the statistical detail needed to evaluate robustness. In the revision we will rerun the main experiments with at least three independent seeds per model-benchmark pair, add error bars and 95% confidence intervals to all tables, and include paired statistical significance tests (e.g., Wilcoxon or t-tests) between Mu-GRPO and GRPO. A complete hyper-parameter table listing all generation, optimization, and scheduling values will be placed in the appendix. We will also explicitly state that the five models and math-reasoning benchmarks were selected prior to experimentation following the protocol used in prior RLVR literature, thereby ruling out post-hoc selection. revision: yes

  3. Referee: [Methods] Stabilization subsection: the negative-advantage veto is asserted to remove only destabilizing post-trigger suffix updates, but without direct measurements of how the veto interacts with the sequential schedule or any sensitivity analysis, the risk of introducing bias or missing instability in the high-staleness regime remains unaddressed.

    Authors: We appreciate the referee's emphasis on direct validation of the veto mechanism. In the revised version we will add a dedicated analysis subsection that reports (i) the fraction of tokens vetoed per stage as a function of the sequential schedule, (ii) a sensitivity sweep over the veto threshold showing its effect on both final performance and training stability metrics, and (iii) a comparison of advantage distributions before and after veto application. These measurements will quantify how the veto interacts with the staged schedule and will allow readers to assess any residual risk of bias or undetected instability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation on external benchmarks with independent scoring

full rationale

The paper proposes Mu-GRPO as an algorithmic organization into sequential generation-optimization stages combined with relaxed clipping and negative-advantage veto to tolerate higher rollout staleness. No equations or derivations are presented that reduce the reported performance or speedup claims to quantities defined by fitted constants, self-referential definitions, or prior self-citations within the paper. Results are measured on external math reasoning benchmarks whose evaluation is independent of the training procedure and fitted values. The central claim rests on end-to-end empirical matching rather than any load-bearing step that collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about how clipping and veto interact with stale rollouts; the only explicit free parameter mentioned is the small number of stages (example value four).

free parameters (1)
  • number of sequential stages
    Example value of four is given to induce the desired high-staleness regime while limiting rollout-optimization switches.
axioms (2)
  • domain assumption Relaxed clipping preserves useful gradients from stale rollouts
    Invoked to keep learning stable when data is generated many steps earlier.
  • domain assumption Negative-advantage veto removes destabilizing suffix updates
    Applied to responses whose advantage becomes negative after a trigger point.

pith-pipeline@v0.9.0 · 5733 in / 1431 out tokens · 76649 ms · 2026-05-20T13:34:18.945425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 11 internal anchors

  1. [1]

    AMC problems and solutions, 2024

    Art of Problem Solving. AMC problems and solutions, 2024

  2. [2]

    AIME problems and solutions, 2025

    Art of Problem Solving. AIME problems and solutions, 2025

  3. [3]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-M1: Scaling test-time compute efficiently with lightning attention.arXiv:2506.13585, 2025

  4. [4]

    Areal: A large-scale asynchronous reinforcement learning system for language reasoning.Advances in Neural Information Processing Systems, 38:36256–36282, 2026

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.Advances in Neural Information Processing Systems, 38:36256–36282, 2026

  5. [5]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 2025

  6. [6]

    Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. AsyncFlow: An asynchronous streaming RL framework for efficient LLM post-training.arXiv:2507.01663, 2025

  7. [7]

    History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

    Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating LLM reinforcement learning with RhymeRL.arXiv:2508.18588, 2025

  8. [8]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS, 2021

  9. [9]

    Open R1: A fully open reproduction of deepseek-r1, January 2025

    Hugging Face. Open R1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

  10. [10]

    Qwen2.5-coder technical report, 2024

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024

  11. [11]

    Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

  12. [12]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  13. [13]

    Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

  14. [14]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, volume 2024, pages 39578–39601, 2024. 10

  15. [15]

    When speed kills stability: Demystifying RL collapse from the training-inference mismatch, September 2025

    Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Zhuo Jiang. When speed kills stability: Demystifying RL collapse from the training-inference mismatch, September 2025. URLhttps://richardli.xyz/rl-collapse

  16. [16]

    The Llama 3 Herd of Models

    Llama 3 Team. The Llama 3 herd of models.arXiv:2407.21783, 2024

  17. [17]

    Asynchronous rlhf: Faster and more efficient off-policy rl for language models

    Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models. InInternational Conference on Learning Representations, volume 2025, pages 4003– 4029, 2025

  18. [18]

    OpenAI o1 System Card

    OpenAI. OpenAI o1 system card.arXiv:2412.16720, 2024

  19. [19]

    Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

    Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via FP16.arXiv:2510.26788, 2025

  20. [20]

    Tapered off-policy REINFORCE: Stable and efficient reinforcement learning for LLMs,

    Nicolas Le Roux, Marc G. Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alexandre Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer, Sándor Toth, and Sam Work. Tapered off-policy REINFORCE: Stable and efficient reinforcement learning for LLMs. arXiv:2503.14286, 2025

  21. [21]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv:1707.06347, 2017

  22. [22]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv:2402.03300, 2024

  23. [23]

    HybridFlow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. In EuroSys, 2025

  24. [24]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv:1909.08053, 2019

  25. [25]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Jiayi Su, Zeyu Chen, et al. KLEAR: Gradient-preserving clipping for efficient policy optimiza- tion.arXiv:2506.01939, 2025

  26. [26]

    Improving data efficiency for LLM reinforcement fine-tuning through difficulty- targeted online data selection and rollout replay

    Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for LLM reinforcement fine-tuning through difficulty- targeted online data selection and rollout replay. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= uwUkETPIJN

  27. [27]

    Reinforcement learning for reasoning in large language models with one training example

    Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and yelong shen. Reinforcement learning for reasoning in large language models with one training example. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  28. [28]

    URLhttps://openreview.net/forum?id=IBrRNLr6JA

  29. [29]

    BAPO: Stabilizing off-policy reinforcement learning for LLMs via balanced policy optimization with adaptive clipping

    Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Xun Deng, Zhihao Zhang, Honglin Guo, Zhikai Lei, Miao Zheng, Guoteng Wang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, and Xuanjing Huang. BAPO: Stabilizing off-policy reinforcement learning for LLMs via balanced policy optimization with adaptive clipping....

  30. [30]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-Math technical report: Toward mathematical expert model via self-improvement.arXiv:2409.12122, 2024. 11

  31. [31]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025

  32. [32]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

  33. [33]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. PyTorch FSDP: Experiences on scaling fully sharded data parallel.arXiv:2304.11277, 2023

  34. [34]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv:2507.18071, 2025

  35. [35]

    Prosperity before collapse: How far can off-policy RL reach with stale data on LLMs? InThe Fourteenth International Conference on Learning Representations, 2026

    Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy RL reach with stale data on LLMs? InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=IIgl5MWelz

  36. [36]

    Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

    Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=x5lITYXmW2

  37. [37]

    Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

  38. [38]

    Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation

    Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. StreamRL: Scalable, heterogeneous, and elastic RL for LLMs with disaggregated stream generation.arXiv:2504.15930, 2025

  39. [39]

    slime: An LLM post-training framework for RL scaling, 2025

    Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An LLM post-training framework for RL scaling, 2025. 12 A Overview This appendix provides details and analyses that support the main text. Appendix B describes the experimental setup, including model and dataset details, training hyperparameters, efficiency mea- surement, evaluation protocol...