pith. sign in

arxiv: 2606.18967 · v1 · pith:ZHQLJ4XOnew · submitted 2026-06-17 · 💻 cs.LG

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

Pith reviewed 2026-06-26 21:30 UTC · model grok-4.3

classification 💻 cs.LG
keywords speculative decodingreinforcement learningLLM rolloutsself-speculative decodinglatency reductionsystem-aware decodingquantized drafterautoregressive sampling
0
0 comments X

The pith

EfficientRollout induces a quantized drafter from the target model to accelerate RL rollouts by up to 19.6 percent while keeping the output distribution unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard speculative decoding fails for RL rollouts because the policy evolves during training and batch sizes shrink from compute-bound to memory-bound regimes. By inducing a quantized drafter directly from the target model, the method keeps the drafter aligned with the changing high-temperature policy without any separate training. It adds a system-aware toggle that activates speculation only when parallel verification can exploit idle compute and adapts draft lengths based on acceptance rates. A reader would care because rollout generation dominates the time cost of scaling RL post-training for language models.

Core claim

EfficientRollout induces a quantized drafter from the target model for self-speculative decoding, keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial memory-bound regimes. This produces up to 19.6 percent reduction in rollout latency and 12.7 percent in end-to-end latency over an accelerated autoregressive baseline while preserving final model quality.

What carries the argument

A quantized drafter induced from the target model, coordinated with a system-aware SD toggle policy and acceptance-aware draft-length adaptation.

If this is right

  • Rollout latency drops by up to 19.6 percent and end-to-end latency by up to 12.7 percent relative to accelerated autoregressive baselines.
  • The target-model distribution is exactly preserved, so final model quality after RL training stays unchanged.
  • No separate drafter pretraining or online adaptation is required because the drafter is induced from the target model itself.
  • Speculation is applied only in memory-bound regimes where active batch sizes have shrunk, avoiding overhead in compute-bound phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-speculative construction could be tested in other online training loops where the model distribution shifts continuously.
  • Lower overall wall-clock time for RL post-training might allow more frequent policy updates or larger batch sizes within fixed compute budgets.
  • Varying the quantization precision of the induced drafter offers a direct knob for trading acceptance rate against drafting speed in future implementations.

Load-bearing premise

The quantized drafter remains sufficiently matched to the evolving high-temperature policy distribution throughout training so that acceptance rates stay high enough to produce net speedup.

What would settle it

An experiment that measures acceptance rates on long high-temperature rollouts from the trained policy and finds that the overhead of drafting exceeds the gains from parallel verification would eliminate the reported latency reduction.

Figures

Figures reproduced from arXiv: 2606.18967 by Amir Gholami, Coleman Hooper, Donghoon Kim, Harman Singh, Hyung Il Koo, Kevin Galim, Minjae Lee, Minseo Kim, Seunghyuk Oh, Wonjun Kang.

Figure 1
Figure 1. Figure 1: Overview of EfficientRollout. EfficientRollout bridges the gap between SD for fixed-model serving and RL rollout decoding through three coordinated components: (a) per-step self-drafter refresh to track the evolving policy, (b) regime-aware toggling from AR to SD under tail-heavy rollout dynamics (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical characteristics of RL rollout decoding. (a) Step-time decomposition over the first 20 training steps. (b) Single-token decode-time breakdown in RL rollout-tail phases. (c) Inverse correlation between target-policy entropy and quantized-drafter first-token acceptance over steps. runtime regimes motivate an SD toggle policy that activates only when beneficial. Third, changing acceptance behavior du… view at source ↗
Figure 3
Figure 3. Figure 3: Validation of the roofline-based toggle boundary. Colors show the predicted speedup over the batch-size and sequence-length plane, and markers indicate whether measured SD is beneficial or harmful at the corresponding coordinates. veRL (AR) + Quant. self-SD + SD toggle (SD ~89%) veRL (AR) + Quant. self-SD + SD toggle (SD ~88%) veRL (AR) + Quant. self-SD + SD toggle (SD ~94%) 60 80 100 120 140 Rollout Gener… view at source ↗
Figure 5
Figure 5. Figure 5: Adaptive draft-length policy and training dynamics on Qwen2.5-7B. (a) Adaptive γ reduces rollout-generation time by avoiding overly large drafts early and exploiting longer drafts later. (b) The controller raises γ as block efficiency τ improves during training. (c) EfficientRollout follows the veRL (AR) reward trajectory, indicating preserved training dynamics. drafter alone is not sufficient for robust s… view at source ↗
Figure 6
Figure 6. Figure 6: Block efficiency across pre￾trained auxiliary drafters. On DAPO￾Math-17K, the evaluated pretrained aux￾iliary drafters achieve lower block effi￾ciency than quantized self-drafters. Regime-aware SD toggle policy. High τ alone is insuf￾ficient for realized speedup, and SD must be activated only in system-beneficial regimes via the toggle policy πSD [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Extended rollout-tail analysis. (a) Cumulative request-completion curve for Llama3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Block efficiency τ over the first 50 training steps on Qwen2.5-7B with W4-RTN and W4-AWQ drafters. 0 10 20 30 40 50 Training Step 0.2 0.3 0.4 0.5 0.6 0.7 Policy Token-Level Entropy Pearson r = 0.99 Qwen2.5-7B 0 10 20 30 40 50 Training Step 0.4 0.5 0.6 0.7 Pearson r = 0.96 Qwen2.5-14B 0 10 20 30 40 50 Training Step 1 2 3 4 5 Pearson r = 0.99 Llama3.1-8B 96.0 96.5 97.0 97.5 98.0 98.5 96.4 96.6 96.8 97.0 97.2… view at source ↗
Figure 9
Figure 9. Figure 9: Policy sharpening improves quantized drafter alignment. As target-policy entropy decreases, [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Policy sharpening improves quantized drafter alignment. As target-policy entropy [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Quantized self-SD with shared KV cache, target verification via rejection sampling, and [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Validation of the roofline-based SD toggle policy. Background colors show pre [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Rollout-generation time over training steps (smoothed). EfficientRollout consistently [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Average training reward over RL steps for three evaluated models. Across all evaluated [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Validation accuracy over training steps for the three evaluated models. EfficientRollout [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Adaptive γ schedules across models. The controller increases γ only when block efficiency τ is high enough to support a longer draft; otherwise, it keeps γ unchanged [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Block efficiency of learned auxiliary drafting over RL training steps. Across models, the [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Block efficiency on DAPO-Math-17K. Across the first 30 RL steps, evaluated learned [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Rollout-generation time of EAGLE3 in the NeMo RL stack. Among the evaluated [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes EfficientRollout, a system-aware self-speculative decoding method for RL rollouts in LLMs. It induces a single quantized drafter from the target model (self-SD) to stay coupled to the evolving policy without separate pretraining or online adaptation, and combines this with a system-aware SD toggle policy plus acceptance-aware draft-length adaptation to speculate only in memory-bound regimes. The central empirical claim is that this yields up to 19.6% rollout latency reduction and 12.7% end-to-end latency reduction versus an accelerated autoregressive baseline while preserving final model quality.

Significance. If the reported speedups are robust, the work addresses a practical bottleneck in LLM post-training by making speculative decoding viable under shifting high-temperature policies and shrinking batch sizes without extra training overhead. The system-aware components and self-induced drafter are pragmatic contributions that could improve efficiency of RL-based reasoning and agent training pipelines.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The speedup claims rest on the precondition that the once-induced quantized drafter maintains sufficiently high acceptance rates as the target policy distribution shifts during RL training. No acceptance-rate curves versus training step or temperature are referenced; if acceptance falls below the breakeven point in later epochs, the system-aware toggle and draft-length adaptation cannot deliver the stated net latency reductions.
  2. [§3.1] §3.1 (Drafter Induction): The manuscript states the drafter is induced once from the target model and remains effective without online adaptation, yet provides no quantitative comparison of acceptance rates between the initial induction distribution and the final high-temperature policy distribution after RL. This leaves the central assumption unverified.
minor comments (2)
  1. [§4.2] Figure captions and §4.2 should explicitly state the number of random seeds, error-bar computation method, and whether the reported 19.6% / 12.7% figures are means or best-case values.
  2. [§3.3] Notation for the draft-length adaptation rule (Eq. X) should be defined before first use and cross-referenced in the system-aware toggle description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the robustness of the induced drafter. The comments correctly note the absence of explicit acceptance-rate analyses. We address each point below and will incorporate the requested quantitative evidence in revision.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The speedup claims rest on the precondition that the once-induced quantized drafter maintains sufficiently high acceptance rates as the target policy distribution shifts during RL training. No acceptance-rate curves versus training step or temperature are referenced; if acceptance falls below the breakeven point in later epochs, the system-aware toggle and draft-length adaptation cannot deliver the stated net latency reductions.

    Authors: We agree that acceptance-rate curves versus training step and temperature are not present and would strengthen the central claim. The reported end-to-end latency reductions and unchanged final model quality provide indirect support that the drafter remains above breakeven, because the system-aware toggle disables speculation when acceptance is insufficient. To address the concern directly, we will add acceptance-rate plots over training steps and across temperatures in the revised manuscript. revision: yes

  2. Referee: [§3.1] §3.1 (Drafter Induction): The manuscript states the drafter is induced once from the target model and remains effective without online adaptation, yet provides no quantitative comparison of acceptance rates between the initial induction distribution and the final high-temperature policy distribution after RL. This leaves the central assumption unverified.

    Authors: The manuscript does not contain a direct side-by-side comparison of acceptance rates at induction versus at the end of RL training. While the overall speedups without quality loss are consistent with the assumption holding, we acknowledge that an explicit quantitative comparison would better verify it. We will add this comparison (initial vs. final acceptance rates under the final high-temperature policy) to §3.1 in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical latency measurements from system implementation

full rationale

The paper describes an engineering system (quantized self-drafter induced once from target, plus toggle and draft-length adaptation) and reports measured speedups (19.6% rollout, 12.7% end-to-end) against an AR baseline. No derivation chain, equations, or fitted parameters are shown that reduce a claimed prediction back to the inputs by construction. The drafter-induction step is a one-time engineering choice, not a self-definitional loop or renamed fit. Self-citations, if present, are not load-bearing for the latency numbers, which are externally falsifiable via timing benchmarks. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that a quantized self-drafter stays effective without separate pretraining and that the system-aware policy correctly identifies beneficial regimes; no free parameters or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5879 in / 1153 out tokens · 14068 ms · 2026-06-26T21:30:55.572879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 19 linked inside Pith

  1. [1]

    ShareGPT_Vicuna_unfiltered

    Aeala. ShareGPT_Vicuna_unfiltered. https://huggingface.co/datasets/Aeala/ ShareGPT_Vicuna_unfiltered, 2023. Hugging Face dataset. Accessed: 2026-05-03

  2. [2]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

  3. [3]

    Qwen3-8B_eagle3

    AngelSlim. Qwen3-8B_eagle3. https://huggingface.co/AngelSlim/Qwen3-8B_ eagle3, 2025. Hugging Face model. Accessed: 2026-06-06

  4. [4]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InForty-first International Conference on Machine Learning, 2024. URL https: //openreview.net/forum?id=PEpbUobfJv. 10

  5. [5]

    Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  6. [6]

    Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

  7. [7]

    Clasp: In-context layer skip for self-speculative decoding

    Longze Chen, Renke Shan, Huiming Wang, Lu Wang, Ziqiang Liu, Run Luo, Jiawei Wang, Hamid Alinejad-Rokny, and Min Yang. Clasp: In-context layer skip for self-speculative decoding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31608–31618, 2025

  8. [8]

    Respec: Towards optimizing speculative decoding in reinforcement learning systems

    Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, and Tianwei Zhang. Respec: Towards optimizing speculative decoding in reinforcement learning systems. InNinth Conference on Machine Learning and Systems, 2026. URLhttps://openreview.net/forum?id=HhDSxs7x2R

  9. [9]

    Do NOT think that much for 2+3=? on the overthinking of long reasoning models

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=...

  10. [10]

    Jackpot: Optimal budgeted rejection sampling for extreme actor-policy mismatch reinforcement learning.arXiv preprint arXiv:2602.06107, 2026

    Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, and Beidi Chen. Jackpot: Optimal budgeted rejection sampling for extreme actor-policy mismatch reinforcement learning.arXiv preprint arXiv:2602.06107, 2026

  11. [11]

    Multi-head attention: Collaborate instead of concatenate.arXiv preprint arXiv:2006.16362, 2020

    Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. Multi-head attention: Collaborate instead of concatenate.arXiv preprint arXiv:2006.16362, 2020

  12. [12]

    The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  13. [13]

    QLoRA: Efficient finetuning of quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=OUIFPHEgJU

  14. [14]

    Marlin: Mixed- precision auto-regressive parallel inference on large language models

    Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. Marlin: Mixed- precision auto-regressive parallel inference on large language models. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 239–251, 2025

  15. [15]

    AREAL: A large-scale asynchronous reinforcement learning system for language reasoning

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, Tongkai Yang, Binhang Yuan, and Yi Wu. AREAL: A large-scale asynchronous reinforcement learning system for language reasoning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id...

  16. [16]

    Ai and memory wall.IEEE Micro, 44(3):33–39, 2024

    Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W Mahoney, and Kurt Keutzer. Ai and memory wall.IEEE Micro, 44(3):33–39, 2024

  17. [17]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  18. [18]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  19. [19]

    History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

    Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025. 11

  20. [20]

    Rest: Retrieval-based speculative decoding

    Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. Rest: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1582–1595, 2024

  21. [21]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  22. [22]

    Taming the long-tail: Efficient reasoning rl training with adaptive drafter

    Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, and Song Han. Taming the long-tail: Efficient reasoning rl training with adaptive drafter. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 1933–1948, 2026

  23. [23]

    Ash, and Akshay Krishnamurthy

    Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharp- ening mechanism. InThe Thirteenth International Conference on Learning Representations,

  24. [24]

    URLhttps://openreview.net/forum?id=WJaUkwci9o

  25. [25]

    Accelerating rl post-training rollouts via system-integrated speculative decoding.arXiv preprint arXiv:2604.26779, 2026

    Hayate Iso, Tiyasa Mitra, Sudipta Mondal, Rasoul Shafipour, Venmugil Elango, Terry Kong, Yuki Huang, Seonjin Na, Izzy Putterman, Benjamin Chislett, et al. Accelerating rl post-training rollouts via system-integrated speculative decoding.arXiv preprint arXiv:2604.26779, 2026

  26. [26]

    Revisiting entropy in reinforcement learning for large reasoning models.arXiv preprint arXiv:2511.05993, 2025

    Renren Jin, Pengzhi Gao, Yuqi Ren, Zhuowen Han, Tongxuan Zhang, Wuwei Huang, Wei Liu, Jian Luan, and Deyi Xiong. Revisiting entropy in reinforcement learning for large reasoning models.arXiv preprint arXiv:2511.05993, 2025

  27. [27]

    Beyond next-token prediction: A perfor- mance characterization of diffusion versus autoregressive language models.arXiv preprint arXiv:2510.04146, 2025

    Minseo Kim, Coleman Hooper, Aditya Tomar, Chenfeng Xu, Mehrdad Farajtabar, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Beyond next-token prediction: A perfor- mance characterization of diffusion versus autoregressive language models.arXiv preprint arXiv:2510.04146, 2025

  28. [28]

    Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

    Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

  29. [29]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  30. [30]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

  31. [31]

    QuRL: Low-precision reinforcement learning for efficient reasoning

    Yuhang Li, Reena Elangovan, Xin Dong, Priyadarshini Panda, and Brucek Khailany. QuRL: Low-precision reinforcement learning for efficient reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=eG0bpCwdKn

  32. [32]

    EAGLE: Speculative sampling requires rethinking feature uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=1NdN7eXyb4

  33. [33]

    EAGLE-3: Scaling up inference acceleration of large language models via training-time test

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=4exx1hUffq. 12

  34. [34]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

  35. [35]

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  36. [36]

    Spec-rl: Accelerating on-policy reinforcement learning via speculative rollouts

    Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, and Jinsong Su. Spec-rl: Accelerating on-policy reinforcement learning via speculative rollouts. arXiv preprint arXiv:2509.23232, 2025

  37. [37]

    Speculative decoding: Performance or illusion? InNinth Conference on Machine Learning and Systems, 2026

    Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, and Alvin Cheung. Speculative decoding: Performance or illusion? InNinth Conference on Machine Learning and Systems, 2026. URL https://openreview.net/forum?id=fzkqtezFEi

  38. [38]

    Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  39. [39]

    Qwen2.5-7B-Eagle-RL

    MIT HAN Lab. Qwen2.5-7B-Eagle-RL. https://huggingface.co/mit-han-lab/Qwen2. 5-7B-Eagle-RL, 2025. Hugging Face model. Accessed: 2026-06-08

  40. [40]

    OpenThoughts2-1M

    open-thoughts. OpenThoughts2-1M. https://huggingface.co/datasets/ open-thoughts/OpenThoughts2-1M, 2025. Hugging Face dataset. Accessed: 2026- 06-08

  41. [41]

    Lossless acceleration of large language model via adaptive n-gram parallel decoding

    Jie Ou, Yueming Chen, et al. Lossless acceleration of large language model via adaptive n-gram parallel decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 10–22, 2024

  42. [42]

    Llama-3.1-8B-Instruct-speculator.eagle3

    RedHatAI. Llama-3.1-8B-Instruct-speculator.eagle3. https://huggingface.co/ RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3 , 2025. Hugging Face model. Accessed: 2026-05-04

  43. [43]

    Qwen3-8B-speculator.eagle3

    RedHatAI. Qwen3-8B-speculator.eagle3. https://huggingface.co/RedHatAI/ Qwen3-8B-speculator.eagle3, 2025. Hugging Face model. Accessed: 2026-06-06

  44. [44]

    Qwen3-8B-Thinking-speculator.eagle3

    RedHatAI. Qwen3-8B-Thinking-speculator.eagle3. https://huggingface.co/RedHatAI/ Qwen3-8B-Thinking-speculator.eagle3 , 2026. Hugging Face model. Accessed: 2026- 06-06

  45. [45]

    Magicdec: Breaking the latency- throughput tradeoff for long context generation with speculative decoding

    Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, and Beidi Chen. Magicdec: Breaking the latency- throughput tradeoff for long context generation with speculative decoding. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/f...

  46. [46]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  47. [47]

    Beat the long tail: Distribution-aware speculative decoding for RL training

    Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junxiong Wang. Beat the long tail: Distribution-aware speculative decoding for RL training. InNinth Conference on Machine Learning and Systems, 2026. URL https: //...

  48. [48]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 13

  49. [49]

    Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

    Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

  50. [50]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  51. [51]

    Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  52. [52]

    Knn-ssd: Enabling dynamic self-speculative decoding via nearest neighbor layer set optimization

    Mingbo Song, Heming Xia, Jun Zhang, Chak Tou Leong, Qiancheng Xu, Wenjie Li, and Sujian Li. Knn-ssd: Enabling dynamic self-speculative decoding via nearest neighbor layer set optimization. InFindings of the Association for Computational Linguistics: EACL 2026, pages 641–655, 2026

  53. [53]

    The n- grammys: Accelerating autoregressive inference with learning-free batched speculation.arXiv preprint arXiv:2411.03786, 2024

    Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, and Stefano Soatto. The n- grammys: Accelerating autoregressive inference with learning-free batched speculation.arXiv preprint arXiv:2411.03786, 2024

  54. [54]

    QUEST: Query-aware sparsity for efficient long-context LLM inference

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InForty-first Inter- national Conference on Machine Learning, 2024. URLhttps://openreview.net/forum? id=KzACYw0MTV

  55. [55]

    Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  56. [56]

    Mahoney, Kurt Keutzer, and Amir Gholami

    Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Richard Charles Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. Quantspec: Self-speculative decoding with hierarchical quantized KV cache. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=7SHbJENgHX

  57. [57]

    Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum...

  58. [58]

    Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

    Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

  59. [59]

    Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding.Findings of the Association for Computational Linguistics: ACL 2024, pages 7655–7671, 2024

  60. [60]

    SWIFT: On-the-fly self- speculative decoding for LLM inference acceleration

    Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. SWIFT: On-the-fly self- speculative decoding for LLM inference acceleration. InThe Thirteenth International Con- ference on Learning Representations, 2025. URL https://openreview.net/forum?id= EKJhH5D5wA

  61. [61]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  62. [62]

    Llm probability concentration: How alignment shrinks the generative horizon.arXiv preprint arXiv:2506.17871, 2025

    Chenghao Yang, Sida Li, and Ari Holtzman. Llm probability concentration: How alignment shrinks the generative horizon.arXiv preprint arXiv:2506.17871, 2025. 14

  63. [63]

    Qwen2.5 technical report.ArXiv, abs/2412.15115, 2024

    Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...

  64. [64]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

  65. [65]

    Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

    Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, et al. Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

  66. [66]

    Specattn: Co-designing sparse attention with self- speculative decoding.arXiv preprint arXiv:2602.07223, 2026

    Yikang Yue, Yuqi Xue, and Jian Huang. Specattn: Co-designing sparse attention with self- speculative decoding.arXiv preprint arXiv:2602.07223, 2026

  67. [67]

    EAGLE3-LLaMA3.1-Instruct-8B

    Yuhui Li. EAGLE3-LLaMA3.1-Instruct-8B. https://huggingface.co/yuhuili/ EAGLE3-LLaMA3.1-Instruct-8B, 2024. Hugging Face model. Accessed: 2026-06-06

  68. [68]

    Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

  69. [69]

    SimpleRL-zoo: Investigating and taming zero reinforcement learning for open base models in the wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun MA, and Junxian He. SimpleRL-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InSecond Conference on Language Modeling, 2025. URL https://openreview. net/forum?id=vSMCBUgrQj

  70. [70]

    Draft& verify: Lossless large language model acceleration via self-speculative decoding

    Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft& verify: Lossless large language model acceleration via self-speculative decoding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11263–11282, 2024

  71. [71]

    Sortedrl: Accelerating rl training for llms through online length-aware scheduling.arXiv preprint arXiv:2603.23414, 2026

    Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, and Yang You. Sortedrl: Accelerating rl training for llms through online length-aware scheduling.arXiv preprint arXiv:2603.23414, 2026

  72. [72]

    FastGRPO: Accelerating policy optimization via concurrency-aware speculative decoding and online draft learning

    Yizhou Zhang, Ning Lv, Teng Wang, and Jisheng Dang. FastGRPO: Accelerating policy optimization via concurrency-aware speculative decoding and online draft learning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=zuGt6TYYtS

  73. [73]

    Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

    Rosie Zhao, Alexandru Meterez, Sham M. Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: RL post-training amplifies behaviors learned in pretraining. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= dp4KWuSDzj

  74. [74]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net...

  75. [75]

    Distillspec: Improving speculative decoding via knowledge distillation

    Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=rsY6J3ZaTF

  76. [76]

    April: Active partial rollouts in reinforcement learning to tame long-tail generation.arXiv preprint arXiv:2509.18521, 2025

    Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xiaodong Yu, Ze Wang, et al. April: Active partial rollouts in reinforcement learning to tame long-tail generation.arXiv preprint arXiv:2509.18521, 2025. 16 Appendix A Extended Analysis of Shrinking-Batch Dynamics 18 B System Rationale for Weight-Quantized S...