Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models

Aram Galstyan; Beidi Chen; Haizhong Zheng; Ranajoy Sadhukhan; Sai Muralidhar Jayanthi; Saket Dingliwal; Souvik Kundu; Yang Zhou; Zhaofeng Sun; Zhuoming Chen

arxiv: 2606.08446 · v1 · pith:HNTXAGV5new · submitted 2026-06-07 · 💻 cs.LG · cs.AI

Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models

Yang Zhou , Ranajoy Sadhukhan , Zhaofeng Sun , Zhuoming Chen , Souvik Kundu , Saket Dingliwal , Sai Muralidhar Jayanthi , Aram Galstyan

show 2 more authors

Haizhong Zheng Beidi Chen

This is my paper

Pith reviewed 2026-06-27 19:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse rolloutactor-policy mismatchreinforcement learninglong-context LLMsRLVRdynamic sparsityQwen3DistillSparse

0 comments

The pith

Controlling the lower tail of per-token actor-policy mismatch above a threshold keeps sparse rollouts stable and yields up to 2.4x speedup in long-context RL for language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that rollout collapse under aggressive sparsity is caused by the worst-case tokens rather than average degradation, and that holding the lower tail of sparse-to-dense mismatch above a fixed level during generation prevents collapse while allowing substantial compute savings. A sympathetic reader would care because RL with verifiable rewards on long chain-of-thought traces is dominated by the cost of generating full trajectories, so any reliable way to sparsify that step multiplies the number of training steps possible per GPU hour. The authors test this by introducing a dynamic sparsity schedule that monitors the tail statistic in real time and adjusts sparsity on the fly, then pair it with a cost model that picks the most aggressive schedule still meeting the threshold. The resulting speedups are measured directly on three sizes of Qwen3 thinking models and shown to hold when the same threshold is reused on a larger model and on a coding task.

Core claim

Sparse rollout collapse is not driven by uniform degradation across tokens; most sparse tokens align with their dense counterparts even under aggressive sparsity. Training remains stable if the lower tail of the per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. A dynamic sparsity schedule that keeps this tail statistic constant, combined with a cost model that maximizes speedup subject to the threshold, produces 2.2x, 2.4x, and 2.0x rollout speedups on Qwen3-1.7B, 4B, and 8B while preserving task performance. The same threshold generalizes to Qwen3-14B and to coding RL, and a lightweight LoRA distillation step (DistillSparse) allows even higher spars

What carries the argument

The lower tail of the per-token sparse-to-dense actor-policy mismatch, held constant by a dynamic sparsity schedule that adjusts sparsity level during generation to meet a fixed threshold.

If this is right

Rollout generation for Qwen3-1.7B, 4B, and 8B achieves 2.2x, 2.4x, and 2.0x speedups under stable training.
The same mismatch threshold transfers to Qwen3-14B and to a coding RL domain without adjustment.
LoRA-based distillation on sparse rollouts permits more aggressive sparsity while still satisfying the mismatch threshold.
A cost model can be used to select the sparsity schedule that maximizes speedup subject only to the mismatch constraint.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The mismatch-tail diagnostic could be applied to other attention or generation approximations beyond the paper's specific sparse method.
If the threshold generalizes further, it would allow RL training runs on models too large for dense rollouts on current hardware.
DistillSparse suggests a general pattern in which light distillation can relax the stability constraint for other efficiency techniques.
The approach opens the possibility of running many more RL iterations within the same compute budget, potentially improving final model capability on long-horizon tasks.

Load-bearing premise

The lower tail of the per-token mismatch distribution is the primary driver of rollout collapse, and a single fixed threshold value will keep training stable across model scales and RL domains without retuning.

What would settle it

Training a new model or RL task with the reported threshold enforced yet still observing rollout collapse or performance drop would falsify the central claim.

read the original abstract

Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensive. Since RLVR per-step cost is dominated by long-context rollout generation, sparse attention offers a promising way to accelerate dense rollout. However, sparse rollouts require a delicate stability-efficiency tradeoff: overly aggressive sparsity causes collapse, while overly lenient sparsity gives insufficient speedup. In this work, we study this tradeoff through sparse-to-dense actor-policy mismatch. We first observe that sparse rollout collapse is not driven by uniform degradation across tokens: most sparse tokens align perfectly with dense even under aggressive sparsity. Motivated by this, we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training. We then use a cost model to find the sparsity schedule for maximum speedup under this mismatch threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Empirically, we show the thresholds generalize to a larger model (Qwen3-14B) and another RL domain (coding). Finally, our analysis naturally motivates DistillSparse: lightweight LoRA-based distillation on sparse rollout lets more aggressive sparsity reach the same sparse-to-dense mismatch threshold, yielding higher speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparrow's main contribution is a dynamic sparsity schedule that holds the lower tail of sparse-to-dense mismatch fixed, delivering reported 2x speedups on Qwen RL rollouts, but the fixed-threshold generalization claim rests on narrow checks within one model family.

read the letter

The central idea is that sparse rollout collapse in long-context RL is driven by the lower tail of per-token actor-policy mismatch rather than average degradation. They keep that tail statistic above a fixed threshold with a dynamic schedule, then pick the schedule for max speedup under the constraint. This yields the claimed 2.2x–2.4x speedups on the 1.7B–8B Qwen3 models.

What stands out is the observation that most tokens still match even under aggressive sparsity, so the tail is the real limiter. They validate the schedule on three model sizes, show transfer to 14B and a coding domain, and add a LoRA distillation step (DistillSparse) that lets them push sparsity further while staying above the mismatch threshold. The cost model for choosing the schedule is a practical addition.

The soft spots are proportionate. The threshold itself is selected to maintain stability, which introduces some circularity, and the paper does not compare the tail statistic against mean or variance mismatch controls. All reported results stay inside the Qwen3 family and only two task types, so the claim that one fixed threshold generalizes without retuning is supported by limited evidence. No results appear on other architectures.

This is useful for groups already running RLVR on long-CoT models and looking for rollout speedups. Readers who care about sparse attention inside RL loops will get concrete numbers and a testable control mechanism. The work is coherent enough on its own terms to merit a serious referee, though the generalization section will need tighter ablations.

Referee Report

3 major / 2 minor

Summary. The paper claims that sparse rollout collapse in RLVR for long-CoT LLMs is driven by the lower tail of per-token actor-policy mismatch rather than uniform degradation, and that maintaining this tail above a fixed critical threshold via a dynamic sparsity schedule enables stable training. It introduces such a schedule, uses a cost model to maximize speedup under the threshold constraint, reports 2.2x–2.4x rollout speedups on Qwen3-1.7B/4B/8B, shows transfer to Qwen3-14B and a coding domain, and proposes DistillSparse (LoRA distillation on sparse rollouts) to allow more aggressive sparsity while meeting the same mismatch threshold.

Significance. If the core hypothesis holds, the work offers a practical route to reduce the dominant cost of long-context RLVR without collapse, with reported speedups that could scale training of thinking models. The mismatch-tail framing and dynamic schedule are a concrete, testable contribution; the DistillSparse extension adds a secondary efficiency lever. However, the significance is tempered by the narrow empirical base (single model family, two task types) and the empirical selection of the threshold itself.

major comments (3)

[§4.2, §5.1] §4.2 and §5.1: The central hypothesis states that the lower tail (not mean or variance) of per-token mismatch is the primary driver of collapse, yet no ablation compares controlling the tail statistic versus mean mismatch or other quantiles; without this, the claim that the tail alone enables the observed stability remains unisolated.
[§5.3, Table 3] §5.3, Table 3: The mismatch threshold is selected for stability on Qwen3-1.7B/4B/8B and then applied to Qwen3-14B and coding; the paper reports successful transfer but does not tabulate the realized lower-tail mismatch values on the new settings or demonstrate that the identical numerical threshold (without retuning) was used, weakening the generalization claim.
[§4.1, Eq. (3)–(5)] §4.1, Eq. (3)–(5): The dynamic schedule is defined to keep the lower-tail mismatch statistic constant, but the derivation of the per-step sparsity level from the cost model and the threshold appears to involve an empirical fitting step; this introduces moderate circularity between the stability criterion and the schedule parameters that is not quantified.

minor comments (2)

[Figure 2, §3.2] Figure 2 caption and §3.2: the definition of “per-token actor-policy mismatch” should explicitly state whether it is KL, total variation, or another divergence, and whether it is computed on log-probabilities or normalized probabilities.
[Table 1] Table 1: baseline dense rollout times are given but the hardware (GPU count, precision) and exact sequence lengths used for the 2.2x–2.4x measurements are not restated, making direct reproduction harder.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will make revisions to strengthen the empirical support and clarity of the manuscript.

read point-by-point responses

Referee: [§4.2, §5.1] §4.2 and §5.1: The central hypothesis states that the lower tail (not mean or variance) of per-token mismatch is the primary driver of collapse, yet no ablation compares controlling the tail statistic versus mean mismatch or other quantiles; without this, the claim that the tail alone enables the observed stability remains unisolated.

Authors: We agree that an ablation isolating the effect of the lower tail from other statistics such as the mean would provide stronger support for the hypothesis. Our observations in §4.2 indicate that the mean mismatch does not drop significantly while the tail does prior to collapse, but this is correlational. In the revision, we will add an ablation study comparing a tail-controlled schedule against a mean-controlled schedule to isolate the contribution of the lower tail. revision: yes
Referee: [§5.3, Table 3] §5.3, Table 3: The mismatch threshold is selected for stability on Qwen3-1.7B/4B/8B and then applied to Qwen3-14B and coding; the paper reports successful transfer but does not tabulate the realized lower-tail mismatch values on the new settings or demonstrate that the identical numerical threshold (without retuning) was used, weakening the generalization claim.

Authors: We will revise Table 3 and §5.3 to include the realized lower-tail mismatch values achieved on Qwen3-14B and the coding domain. This will explicitly show that the same numerical threshold was used without retuning and that it was maintained throughout training. revision: yes
Referee: [§4.1, Eq. (3)–(5)] §4.1, Eq. (3)–(5): The dynamic schedule is defined to keep the lower-tail mismatch statistic constant, but the derivation of the per-step sparsity level from the cost model and the threshold appears to involve an empirical fitting step; this introduces moderate circularity between the stability criterion and the schedule parameters that is not quantified.

Authors: The stability threshold is selected based on empirical stability results independent of the cost model (§5.1). The cost model is then used to determine the sparsity schedule that achieves the target threshold with maximum speedup. We will clarify this separation in §4.1 and provide more details on the fitting procedure to quantify any potential dependencies. revision: yes

Circularity Check

1 steps flagged

Threshold for tail mismatch is empirically fitted for stability then used to define the schedule

specific steps

fitted input called prediction [Abstract / hypothesis paragraph]
"we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training."

The 'critical threshold' is not derived from any equation or external principle; it is the value chosen so that training remains stable. The schedule is then defined to enforce constancy at exactly that fitted value, rendering the claim that the schedule 'enables stable training' partly tautological to the selection criterion.

full rationale

The paper selects a critical threshold value specifically because it maintains training stability on the evaluated models, then constructs a dynamic sparsity schedule whose explicit goal is to hold the lower-tail statistic at or above that same fitted value. The reported speedups and generalization claims therefore rest on an input that was tuned to produce the desired outcome rather than an independent first-principles derivation. No equations or external theorems are shown to derive the threshold; it is presented as an empirical choice validated post-hoc on the same model family.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on an empirical observation about mismatch distribution and a tunable threshold whose value is chosen to avoid collapse; no new physical or mathematical entities are introduced.

free parameters (1)

mismatch_threshold
Value chosen so that training remains stable; appears tuned per model size and task.

axioms (1)

domain assumption Sparse attention produces per-token outputs whose mismatch with dense attention has a lower tail that controls training stability.
Invoked to justify the dynamic schedule.

pith-pipeline@v0.9.1-grok · 5871 in / 1205 out tokens · 15530 ms · 2026-06-27T19:09:02.403458+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 19 canonical work pages · 10 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263,

2024
[2]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gall´ e, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet¨Ust¨ un, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris

Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris. Anthropic. Claude code.https://code.claude.com/,

2025
[4]

Accelerating Large Language Model Decoding with Speculative Sampling

AI coding assistant. Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023.https://arxiv.org/abs/2302.01318. Zhuoming Chen. Vortex documentation, 2025.https://infini-ai-lab.github.io/vortex torch/. Zhuoming Chen, Ranajoy Sadhukhan...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018.https://arxiv.org/abs/1802.01561. Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Ch...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

K Han, A Gu, WD Li, F Yan, T Zhang, S Wang, A Solar-Lezama, K Sen, and I Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Defeating Nondeterminism in

doi: 10.64434/tml.20250910. https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/. Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588. Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, D...

work page doi:10.64434/tml.20250910 2025
[9]

Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https://arxiv.org/abs/2510.11696

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, and Yukang Chen. Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https://arxiv.org/abs/2510.11696. Infini-AI-Lab. Vortex: A flexible and efficient sparse attention framework,

work page arXiv 2025
[10]

Fast Inference from Transformers via Speculative Decoding

https://arxiv.org/abs/2211.17192. Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023a. Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinfor...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

https://arxiv.org/abs/2509. 23232. Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression, 2025a.https://arxiv.org/abs/2412.03213. Hongyi Liu, Zhuoming Chen, Yang Zhou, Haizhong Zheng, and Beidi Chen. Jackpot: Optimal budgeted rejection sampling for extreme actor...

work page arXiv
[12]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang

https://github.com/ganler/code-r1. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023.https://openreview.net/forum?id=1qvx610Cu7. Liyuan Liu, Feng Yao, D...

work page arXiv 2023
[13]

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Kinetics: Rethinking test-time scaling laws, 2025.https://arxiv.org/abs/2506.05333

Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, and Beidi Chen. Kinetics: Rethinking test-time scaling laws, 2025.https://arxiv.org/abs/2506.05333. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoni...

work page arXiv 2025
[15]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025a. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haib...

work page arXiv
[16]

The synergy of speculative decoding and batching in serving large language models, 2023.https://arxiv.org/abs/2310.18813

Qidong Su, Christina Giannoula, and Gennady Pekhimenko. The synergy of speculative decoding and batching in serving large language models, 2023.https://arxiv.org/abs/2310.18813. Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context ll...

work page arXiv 2023
[17]

Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training, 2025a.https://arxiv.org/abs/2505.24034

14 Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training, 2025a.https://arxiv.org/abs/2505.24034. Yongji Wu, Xueshen Liu, Haizhong ...

work page arXiv
[18]

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

https://openreview.net/forum?id= NG7sS51zVF. Ran Yan, Youhe Jiang, and Binhang Yuan. Flash sparse attention: More efficient natively trainable sparse attention. arXiv preprint arXiv:2508.18224,

work page internal anchor Pith review arXiv
[19]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024.https://arxiv.org/abs/2312.07104. Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

15 Appendix Contents This appendix provides additional analyses, implementation details, and supporting empirical evidence for the main paper

https://arxiv.org/abs/2509.18521. 15 Appendix Contents This appendix provides additional analyses, implementation details, and supporting empirical evidence for the main paper. The sections are organized as follows: Appendix A: Sparse Rollout Instability and High-Reward Rollouts p. 16 We examine whether insufficient rollout quality is the primary cause of...

work page arXiv 2025
[22]

However, the blue curve still fails to recover the training performance

The average reward curve is shown in blue, which is significantly higher than dense average reward and that of the original sparse rollout. However, the blue curve still fails to recover the training performance. B Extended Related Works We would like to divide the discussion of the related work into four aspects: RL for LLMs, prior works on distribution ...

2024
[23]

and DPO (Rafailov et al., 2023), which are based on offline RL, have also been employed for human alignment. RL training systems for LLMs, such as Verl (Sheng et al., 2025b), AReal (Fu et al., 2025), TRL (von Werra et al., 2020), and OpenRLHF (Hu et al., 2024), have been developed to improve training throughput and scalability. Distribution Mismatch Corre...

2023
[24]

Prior Rollout Speedup Methods.Many recent works have been proposed to address this rollout efficiency challenge, but have several key limitations

are implemented to mitigate the numerical issue of serving systems during rollout. Prior Rollout Speedup Methods.Many recent works have been proposed to address this rollout efficiency challenge, but have several key limitations. Several recent works (Zheng et al., 2025; Pich´ e et al., 2025; Zhou et al.,

2025
[25]

and speculative decoding (Leviathan et al., 2023; Chen et al., 2023). Although model quantization can significantly reduce the cost of loading model weights, it cannot effectively mitigate the rollout overhead for long-sequence generation, where KV-cache loading remains the primary bottleneck (Sadhukhan et al., 2025). Conversely, speculative decoding can ...

2023
[26]

Furthermore, speculative decoding introduces an additional draft model that requires extra training resources and thus complicates the whole training pipeline

in RL training because the verification process becomes compute-intensive. Furthermore, speculative decoding introduces an additional draft model that requires extra training resources and thus complicates the whole training pipeline. Sparse attention.Attention-operation cost dominates the latency of generating long-context output, a consensus shared by m...

2025
[27]

Despite robust performance in general tasks, under aggressive sparsity settings, these methods incur an unacceptable accuracy drop

or more accurate dynamic block-sparse attention (Tang et al., 2024b; Sun et al., 2024b; Liu et al., 2025a). Despite robust performance in general tasks, under aggressive sparsity settings, these methods incur an unacceptable accuracy drop. Pretrained sparse attention methods (Yuan et al., 2025a; DeepSeek-AI, 2025), on the other hand, achieve scalable resu...

2025
[28]

Experiments are run on Qwen3-4B-Instruct with generation length 16K

as the inference engine. Experiments are run on Qwen3-4B-Instruct with generation length 16K. Training is run on 2xH200 GPUs. For efficient sparse-attention rollouts, we use Vortex torch (Chen, 2025). We adopt block top- k attention with a page size of 16, and set the number of top- k pages according to the sparse KV budget. In addition, we use Flash Spar...

2025
[29]

As shown in Figure 10, we report the efficiency of our implementation

for LoRA adaptation. As shown in Figure 10, we report the efficiency of our implementation. When training a 4B instruct model with 16K max context length, dense rollouts account for roughly 90% of the per-epoch time. Sparse attention directly alleviates this bottleneck and accelerates rollouts by roughly 1 .9×. Although the dense policy update contributes...

2025

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263,

2024

[2] [2]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gall´ e, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet¨Ust¨ un, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris

Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris. Anthropic. Claude code.https://code.claude.com/,

2025

[4] [4]

Accelerating Large Language Model Decoding with Speculative Sampling

AI coding assistant. Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023.https://arxiv.org/abs/2302.01318. Zhuoming Chen. Vortex documentation, 2025.https://infini-ai-lab.github.io/vortex torch/. Zhuoming Chen, Ranajoy Sadhukhan...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018.https://arxiv.org/abs/1802.01561. Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Ch...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

K Han, A Gu, WD Li, F Yan, T Zhang, S Wang, A Solar-Lezama, K Sen, and I Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Defeating Nondeterminism in

doi: 10.64434/tml.20250910. https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/. Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588. Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, D...

work page doi:10.64434/tml.20250910 2025

[9] [9]

Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https://arxiv.org/abs/2510.11696

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, and Yukang Chen. Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https://arxiv.org/abs/2510.11696. Infini-AI-Lab. Vortex: A flexible and efficient sparse attention framework,

work page arXiv 2025

[10] [10]

Fast Inference from Transformers via Speculative Decoding

https://arxiv.org/abs/2211.17192. Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023a. Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinfor...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

https://arxiv.org/abs/2509. 23232. Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression, 2025a.https://arxiv.org/abs/2412.03213. Hongyi Liu, Zhuoming Chen, Yang Zhou, Haizhong Zheng, and Beidi Chen. Jackpot: Optimal budgeted rejection sampling for extreme actor...

work page arXiv

[12] [12]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang

https://github.com/ganler/code-r1. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023.https://openreview.net/forum?id=1qvx610Cu7. Liyuan Liu, Feng Yao, D...

work page arXiv 2023

[13] [13]

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Kinetics: Rethinking test-time scaling laws, 2025.https://arxiv.org/abs/2506.05333

Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, and Beidi Chen. Kinetics: Rethinking test-time scaling laws, 2025.https://arxiv.org/abs/2506.05333. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoni...

work page arXiv 2025

[15] [15]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025a. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haib...

work page arXiv

[16] [16]

The synergy of speculative decoding and batching in serving large language models, 2023.https://arxiv.org/abs/2310.18813

Qidong Su, Christina Giannoula, and Gennady Pekhimenko. The synergy of speculative decoding and batching in serving large language models, 2023.https://arxiv.org/abs/2310.18813. Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context ll...

work page arXiv 2023

[17] [17]

Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training, 2025a.https://arxiv.org/abs/2505.24034

14 Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training, 2025a.https://arxiv.org/abs/2505.24034. Yongji Wu, Xueshen Liu, Haizhong ...

work page arXiv

[18] [18]

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

https://openreview.net/forum?id= NG7sS51zVF. Ran Yan, Youhe Jiang, and Binhang Yuan. Flash sparse attention: More efficient natively trainable sparse attention. arXiv preprint arXiv:2508.18224,

work page internal anchor Pith review arXiv

[19] [19]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024.https://arxiv.org/abs/2312.07104. Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

15 Appendix Contents This appendix provides additional analyses, implementation details, and supporting empirical evidence for the main paper

https://arxiv.org/abs/2509.18521. 15 Appendix Contents This appendix provides additional analyses, implementation details, and supporting empirical evidence for the main paper. The sections are organized as follows: Appendix A: Sparse Rollout Instability and High-Reward Rollouts p. 16 We examine whether insufficient rollout quality is the primary cause of...

work page arXiv 2025

[22] [22]

However, the blue curve still fails to recover the training performance

The average reward curve is shown in blue, which is significantly higher than dense average reward and that of the original sparse rollout. However, the blue curve still fails to recover the training performance. B Extended Related Works We would like to divide the discussion of the related work into four aspects: RL for LLMs, prior works on distribution ...

2024

[23] [23]

and DPO (Rafailov et al., 2023), which are based on offline RL, have also been employed for human alignment. RL training systems for LLMs, such as Verl (Sheng et al., 2025b), AReal (Fu et al., 2025), TRL (von Werra et al., 2020), and OpenRLHF (Hu et al., 2024), have been developed to improve training throughput and scalability. Distribution Mismatch Corre...

2023

[24] [24]

Prior Rollout Speedup Methods.Many recent works have been proposed to address this rollout efficiency challenge, but have several key limitations

are implemented to mitigate the numerical issue of serving systems during rollout. Prior Rollout Speedup Methods.Many recent works have been proposed to address this rollout efficiency challenge, but have several key limitations. Several recent works (Zheng et al., 2025; Pich´ e et al., 2025; Zhou et al.,

2025

[25] [25]

and speculative decoding (Leviathan et al., 2023; Chen et al., 2023). Although model quantization can significantly reduce the cost of loading model weights, it cannot effectively mitigate the rollout overhead for long-sequence generation, where KV-cache loading remains the primary bottleneck (Sadhukhan et al., 2025). Conversely, speculative decoding can ...

2023

[26] [26]

Furthermore, speculative decoding introduces an additional draft model that requires extra training resources and thus complicates the whole training pipeline

in RL training because the verification process becomes compute-intensive. Furthermore, speculative decoding introduces an additional draft model that requires extra training resources and thus complicates the whole training pipeline. Sparse attention.Attention-operation cost dominates the latency of generating long-context output, a consensus shared by m...

2025

[27] [27]

Despite robust performance in general tasks, under aggressive sparsity settings, these methods incur an unacceptable accuracy drop

or more accurate dynamic block-sparse attention (Tang et al., 2024b; Sun et al., 2024b; Liu et al., 2025a). Despite robust performance in general tasks, under aggressive sparsity settings, these methods incur an unacceptable accuracy drop. Pretrained sparse attention methods (Yuan et al., 2025a; DeepSeek-AI, 2025), on the other hand, achieve scalable resu...

2025

[28] [28]

Experiments are run on Qwen3-4B-Instruct with generation length 16K

as the inference engine. Experiments are run on Qwen3-4B-Instruct with generation length 16K. Training is run on 2xH200 GPUs. For efficient sparse-attention rollouts, we use Vortex torch (Chen, 2025). We adopt block top- k attention with a page size of 16, and set the number of top- k pages according to the sparse KV budget. In addition, we use Flash Spar...

2025

[29] [29]

As shown in Figure 10, we report the efficiency of our implementation

for LoRA adaptation. As shown in Figure 10, we report the efficiency of our implementation. When training a 4B instruct model with 16K max context length, dense rollouts account for roughly 90% of the per-epoch time. Sparse attention directly alleviates this bottleneck and accelerates rollouts by roughly 1 .9×. Although the dense policy update contributes...

2025