Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models
Pith reviewed 2026-06-27 19:09 UTC · model grok-4.3
The pith
Controlling the lower tail of per-token actor-policy mismatch above a threshold keeps sparse rollouts stable and yields up to 2.4x speedup in long-context RL for language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sparse rollout collapse is not driven by uniform degradation across tokens; most sparse tokens align with their dense counterparts even under aggressive sparsity. Training remains stable if the lower tail of the per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. A dynamic sparsity schedule that keeps this tail statistic constant, combined with a cost model that maximizes speedup subject to the threshold, produces 2.2x, 2.4x, and 2.0x rollout speedups on Qwen3-1.7B, 4B, and 8B while preserving task performance. The same threshold generalizes to Qwen3-14B and to coding RL, and a lightweight LoRA distillation step (DistillSparse) allows even higher spars
What carries the argument
The lower tail of the per-token sparse-to-dense actor-policy mismatch, held constant by a dynamic sparsity schedule that adjusts sparsity level during generation to meet a fixed threshold.
If this is right
- Rollout generation for Qwen3-1.7B, 4B, and 8B achieves 2.2x, 2.4x, and 2.0x speedups under stable training.
- The same mismatch threshold transfers to Qwen3-14B and to a coding RL domain without adjustment.
- LoRA-based distillation on sparse rollouts permits more aggressive sparsity while still satisfying the mismatch threshold.
- A cost model can be used to select the sparsity schedule that maximizes speedup subject only to the mismatch constraint.
Where Pith is reading between the lines
- The mismatch-tail diagnostic could be applied to other attention or generation approximations beyond the paper's specific sparse method.
- If the threshold generalizes further, it would allow RL training runs on models too large for dense rollouts on current hardware.
- DistillSparse suggests a general pattern in which light distillation can relax the stability constraint for other efficiency techniques.
- The approach opens the possibility of running many more RL iterations within the same compute budget, potentially improving final model capability on long-horizon tasks.
Load-bearing premise
The lower tail of the per-token mismatch distribution is the primary driver of rollout collapse, and a single fixed threshold value will keep training stable across model scales and RL domains without retuning.
What would settle it
Training a new model or RL task with the reported threshold enforced yet still observing rollout collapse or performance drop would falsify the central claim.
read the original abstract
Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensive. Since RLVR per-step cost is dominated by long-context rollout generation, sparse attention offers a promising way to accelerate dense rollout. However, sparse rollouts require a delicate stability-efficiency tradeoff: overly aggressive sparsity causes collapse, while overly lenient sparsity gives insufficient speedup. In this work, we study this tradeoff through sparse-to-dense actor-policy mismatch. We first observe that sparse rollout collapse is not driven by uniform degradation across tokens: most sparse tokens align perfectly with dense even under aggressive sparsity. Motivated by this, we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training. We then use a cost model to find the sparsity schedule for maximum speedup under this mismatch threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Empirically, we show the thresholds generalize to a larger model (Qwen3-14B) and another RL domain (coding). Finally, our analysis naturally motivates DistillSparse: lightweight LoRA-based distillation on sparse rollout lets more aggressive sparsity reach the same sparse-to-dense mismatch threshold, yielding higher speedup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that sparse rollout collapse in RLVR for long-CoT LLMs is driven by the lower tail of per-token actor-policy mismatch rather than uniform degradation, and that maintaining this tail above a fixed critical threshold via a dynamic sparsity schedule enables stable training. It introduces such a schedule, uses a cost model to maximize speedup under the threshold constraint, reports 2.2x–2.4x rollout speedups on Qwen3-1.7B/4B/8B, shows transfer to Qwen3-14B and a coding domain, and proposes DistillSparse (LoRA distillation on sparse rollouts) to allow more aggressive sparsity while meeting the same mismatch threshold.
Significance. If the core hypothesis holds, the work offers a practical route to reduce the dominant cost of long-context RLVR without collapse, with reported speedups that could scale training of thinking models. The mismatch-tail framing and dynamic schedule are a concrete, testable contribution; the DistillSparse extension adds a secondary efficiency lever. However, the significance is tempered by the narrow empirical base (single model family, two task types) and the empirical selection of the threshold itself.
major comments (3)
- [§4.2, §5.1] §4.2 and §5.1: The central hypothesis states that the lower tail (not mean or variance) of per-token mismatch is the primary driver of collapse, yet no ablation compares controlling the tail statistic versus mean mismatch or other quantiles; without this, the claim that the tail alone enables the observed stability remains unisolated.
- [§5.3, Table 3] §5.3, Table 3: The mismatch threshold is selected for stability on Qwen3-1.7B/4B/8B and then applied to Qwen3-14B and coding; the paper reports successful transfer but does not tabulate the realized lower-tail mismatch values on the new settings or demonstrate that the identical numerical threshold (without retuning) was used, weakening the generalization claim.
- [§4.1, Eq. (3)–(5)] §4.1, Eq. (3)–(5): The dynamic schedule is defined to keep the lower-tail mismatch statistic constant, but the derivation of the per-step sparsity level from the cost model and the threshold appears to involve an empirical fitting step; this introduces moderate circularity between the stability criterion and the schedule parameters that is not quantified.
minor comments (2)
- [Figure 2, §3.2] Figure 2 caption and §3.2: the definition of “per-token actor-policy mismatch” should explicitly state whether it is KL, total variation, or another divergence, and whether it is computed on log-probabilities or normalized probabilities.
- [Table 1] Table 1: baseline dense rollout times are given but the hardware (GPU count, precision) and exact sequence lengths used for the 2.2x–2.4x measurements are not restated, making direct reproduction harder.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will make revisions to strengthen the empirical support and clarity of the manuscript.
read point-by-point responses
-
Referee: [§4.2, §5.1] §4.2 and §5.1: The central hypothesis states that the lower tail (not mean or variance) of per-token mismatch is the primary driver of collapse, yet no ablation compares controlling the tail statistic versus mean mismatch or other quantiles; without this, the claim that the tail alone enables the observed stability remains unisolated.
Authors: We agree that an ablation isolating the effect of the lower tail from other statistics such as the mean would provide stronger support for the hypothesis. Our observations in §4.2 indicate that the mean mismatch does not drop significantly while the tail does prior to collapse, but this is correlational. In the revision, we will add an ablation study comparing a tail-controlled schedule against a mean-controlled schedule to isolate the contribution of the lower tail. revision: yes
-
Referee: [§5.3, Table 3] §5.3, Table 3: The mismatch threshold is selected for stability on Qwen3-1.7B/4B/8B and then applied to Qwen3-14B and coding; the paper reports successful transfer but does not tabulate the realized lower-tail mismatch values on the new settings or demonstrate that the identical numerical threshold (without retuning) was used, weakening the generalization claim.
Authors: We will revise Table 3 and §5.3 to include the realized lower-tail mismatch values achieved on Qwen3-14B and the coding domain. This will explicitly show that the same numerical threshold was used without retuning and that it was maintained throughout training. revision: yes
-
Referee: [§4.1, Eq. (3)–(5)] §4.1, Eq. (3)–(5): The dynamic schedule is defined to keep the lower-tail mismatch statistic constant, but the derivation of the per-step sparsity level from the cost model and the threshold appears to involve an empirical fitting step; this introduces moderate circularity between the stability criterion and the schedule parameters that is not quantified.
Authors: The stability threshold is selected based on empirical stability results independent of the cost model (§5.1). The cost model is then used to determine the sparsity schedule that achieves the target threshold with maximum speedup. We will clarify this separation in §4.1 and provide more details on the fitting procedure to quantify any potential dependencies. revision: yes
Circularity Check
Threshold for tail mismatch is empirically fitted for stability then used to define the schedule
specific steps
-
fitted input called prediction
[Abstract / hypothesis paragraph]
"we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training."
The 'critical threshold' is not derived from any equation or external principle; it is the value chosen so that training remains stable. The schedule is then defined to enforce constancy at exactly that fitted value, rendering the claim that the schedule 'enables stable training' partly tautological to the selection criterion.
full rationale
The paper selects a critical threshold value specifically because it maintains training stability on the evaluated models, then constructs a dynamic sparsity schedule whose explicit goal is to hold the lower-tail statistic at or above that same fitted value. The reported speedups and generalization claims therefore rest on an input that was tuned to produce the desired outcome rather than an independent first-principles derivation. No equations or external theorems are shown to derive the threshold; it is presented as an empirical choice validated post-hoc on the same model family.
Axiom & Free-Parameter Ledger
free parameters (1)
- mismatch_threshold
axioms (1)
- domain assumption Sparse attention produces per-token outputs whose mismatch with dense attention has a lower tail that controls training stability.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263,
2024
-
[2]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gall´ e, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet¨Ust¨ un, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris
Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.https://hkunlp.github.io/blog/2025/Polaris. Anthropic. Claude code.https://code.claude.com/,
2025
-
[4]
Accelerating Large Language Model Decoding with Speculative Sampling
AI coding assistant. Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023.https://arxiv.org/abs/2302.01318. Zhuoming Chen. Vortex documentation, 2025.https://infini-ai-lab.github.io/vortex torch/. Zhuoming Chen, Ranajoy Sadhukhan...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018.https://arxiv.org/abs/1802.01561. Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Ch...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
K Han, A Gu, WD Li, F Yan, T Zhang, S Wang, A Solar-Lezama, K Sen, and I Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
doi: 10.64434/tml.20250910. https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/. Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl, 2025.https://arxiv.org/abs/2508.18588. Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, D...
-
[9]
Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, and Yukang Chen. Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms, 2025.https://arxiv.org/abs/2510.11696. Infini-AI-Lab. Vortex: A flexible and efficient sparse attention framework,
-
[10]
Fast Inference from Transformers via Speculative Decoding
https://arxiv.org/abs/2211.17192. Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023a. Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinfor...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
https://arxiv.org/abs/2509. 23232. Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression, 2025a.https://arxiv.org/abs/2412.03213. Hongyi Liu, Zhuoming Chen, Yang Zhou, Haizhong Zheng, and Beidi Chen. Jackpot: Optimal budgeted rejection sampling for extreme actor...
-
[12]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang
https://github.com/ganler/code-r1. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023.https://openreview.net/forum?id=1qvx610Cu7. Liyuan Liu, Feng Yao, D...
-
[13]
OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Kinetics: Rethinking test-time scaling laws, 2025.https://arxiv.org/abs/2506.05333
Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, and Beidi Chen. Kinetics: Rethinking test-time scaling laws, 2025.https://arxiv.org/abs/2506.05333. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoni...
-
[15]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025a. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haib...
-
[16]
Qidong Su, Christina Giannoula, and Gennady Pekhimenko. The synergy of speculative decoding and batching in serving large language models, 2023.https://arxiv.org/abs/2310.18813. Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context ll...
-
[17]
14 Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training, 2025a.https://arxiv.org/abs/2505.24034. Yongji Wu, Xueshen Liu, Haizhong ...
-
[18]
FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel
https://openreview.net/forum?id= NG7sS51zVF. Ran Yan, Youhe Jiang, and Binhang Yuan. Flash sparse attention: More efficient natively trainable sparse attention. arXiv preprint arXiv:2508.18224,
work page internal anchor Pith review arXiv
-
[19]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024.https://arxiv.org/abs/2312.07104. Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
https://arxiv.org/abs/2509.18521. 15 Appendix Contents This appendix provides additional analyses, implementation details, and supporting empirical evidence for the main paper. The sections are organized as follows: Appendix A: Sparse Rollout Instability and High-Reward Rollouts p. 16 We examine whether insufficient rollout quality is the primary cause of...
-
[22]
However, the blue curve still fails to recover the training performance
The average reward curve is shown in blue, which is significantly higher than dense average reward and that of the original sparse rollout. However, the blue curve still fails to recover the training performance. B Extended Related Works We would like to divide the discussion of the related work into four aspects: RL for LLMs, prior works on distribution ...
2024
-
[23]
and DPO (Rafailov et al., 2023), which are based on offline RL, have also been employed for human alignment. RL training systems for LLMs, such as Verl (Sheng et al., 2025b), AReal (Fu et al., 2025), TRL (von Werra et al., 2020), and OpenRLHF (Hu et al., 2024), have been developed to improve training throughput and scalability. Distribution Mismatch Corre...
2023
-
[24]
Prior Rollout Speedup Methods.Many recent works have been proposed to address this rollout efficiency challenge, but have several key limitations
are implemented to mitigate the numerical issue of serving systems during rollout. Prior Rollout Speedup Methods.Many recent works have been proposed to address this rollout efficiency challenge, but have several key limitations. Several recent works (Zheng et al., 2025; Pich´ e et al., 2025; Zhou et al.,
2025
-
[25]
and speculative decoding (Leviathan et al., 2023; Chen et al., 2023). Although model quantization can significantly reduce the cost of loading model weights, it cannot effectively mitigate the rollout overhead for long-sequence generation, where KV-cache loading remains the primary bottleneck (Sadhukhan et al., 2025). Conversely, speculative decoding can ...
2023
-
[26]
Furthermore, speculative decoding introduces an additional draft model that requires extra training resources and thus complicates the whole training pipeline
in RL training because the verification process becomes compute-intensive. Furthermore, speculative decoding introduces an additional draft model that requires extra training resources and thus complicates the whole training pipeline. Sparse attention.Attention-operation cost dominates the latency of generating long-context output, a consensus shared by m...
2025
-
[27]
Despite robust performance in general tasks, under aggressive sparsity settings, these methods incur an unacceptable accuracy drop
or more accurate dynamic block-sparse attention (Tang et al., 2024b; Sun et al., 2024b; Liu et al., 2025a). Despite robust performance in general tasks, under aggressive sparsity settings, these methods incur an unacceptable accuracy drop. Pretrained sparse attention methods (Yuan et al., 2025a; DeepSeek-AI, 2025), on the other hand, achieve scalable resu...
2025
-
[28]
Experiments are run on Qwen3-4B-Instruct with generation length 16K
as the inference engine. Experiments are run on Qwen3-4B-Instruct with generation length 16K. Training is run on 2xH200 GPUs. For efficient sparse-attention rollouts, we use Vortex torch (Chen, 2025). We adopt block top- k attention with a page size of 16, and set the number of top- k pages according to the sparse KV budget. In addition, we use Flash Spar...
2025
-
[29]
As shown in Figure 10, we report the efficiency of our implementation
for LoRA adaptation. As shown in Figure 10, we report the efficiency of our implementation. When training a 4B instruct model with 16K max context length, dense rollouts account for roughly 90% of the per-epoch time. Sparse attention directly alleviates this bottleneck and accelerates rollouts by roughly 1 .9×. Although the dense policy update contributes...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.