Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

arxiv: 2605.14978 · v2 · pith:YREQXWZRnew · submitted 2026-05-14 · 💻 cs.CL

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

Jie Jiang , Xing Sun , Ruotian Chen , Jianan Su , Kaixin Shen This is my paper

Pith reviewed 2026-05-19 16:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative decodingreinforcement learningLLM inferencepolicy optimizationadaptive windowingdraft modelacceptance lengthspeedup

0 comments p. Extension

pith:YREQXWZR Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{YREQXWZR}

Prints a linked pith:YREQXWZR badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Window-level reinforcement learning optimizes drafters for speculative decoding, reaching acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36×.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PPOW, a reinforcement learning approach that trains the draft model in speculative decoding by optimizing entire candidate windows instead of single tokens. It defines a cost-aware speedup reward and a distribution-based proximity reward, then applies adaptive divergence-aware windowing to focus training on positions where the draft and target models disagree most. Experiments across model families and benchmarks show these changes produce longer accepted sequences and faster overall inference under a standard protocol. A reader would care because current speculative methods are limited by early mismatches in the window, and addressing that at the right granularity could make large-model serving more efficient without new hardware.

Core claim

PPOW shifts drafter optimization from token-level imitation to window-level optimization by combining a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence and achieves average acceptance lengths of 6.29-6.52 with speedups of 3.39-4.36× across multiple model families and benchmarks under a unified decoding protocol.

What carries the argument

PPOW, a reinforcement learning framework that replaces token-level supervised objectives with window-level rewards and adaptive selection of high-divergence windows.

If this is right

Acceptance lengths rise to the 6.29-6.52 range on average.
Inference speedups reach 3.39-4.36× across model families under one decoding protocol.
Window-level policy optimization outperforms token-level imitation for speculative decoding.
Adaptive window selection focused on high-divergence positions improves training signal quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward and windowing design could be applied to other parallel decoding schemes that generate candidate sequences.
If the learned policy transfers across tasks, smaller draft models might achieve comparable speedups without retraining from scratch.
In production serving, the method would most help latency on inputs that contain rare or ambiguous tokens where early mismatches are common.

Load-bearing premise

The cost-aware speedup reward, distribution-based proximity reward, and adaptive divergence-aware windowing together produce training signals that reduce real end-to-end latency rather than only raising acceptance length on the tested model pairs and benchmarks.

What would settle it

Measure end-to-end latency on a held-out model family or benchmark after training with PPOW; if acceptance length rises but wall-clock latency stays the same or worsens, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.14978 by Jianan Su, Jie Jiang, Kaixin Shen, Ruotian Chen, Xing Sun.

**Figure 1.** Figure 1: PPOW uses a Cost-Aware Speedup Reward together with a Distribution-Based Proximity Reward. (a) The Cost-Aware Speedup Reward increases with accepted prefix length and directly encourages speculative decoding efficiency. (b) When verification is truncated early, resulting in k = 0, the Distribution-Based Proximity Reward still provides auxiliary credit if the speculative window remains close to the target-p… view at source ↗

**Figure 2.** Figure 2: Overview of PPOW. PPOW performs policy optimization at the window level for speculative decoding. Left: Adaptive windowing uses confidence-weighted draft–target divergence scores to prioritize informative training windows. Right: The drafter samples a rollout group of speculative windows for policy optimization with performance-driven rewards and KL regularization. Beyond drafter modeling, prior work also… view at source ↗

**Figure 3.** Figure 3: PPOW versus continued supervised training under matched training steps. On GSM8K with LLaMA-3.1-8B, the supervised baseline initially improves average acceptance length but later degrades, whereas PPOW continues to improve and achieves a higher final acceptance length. CST denotes continued supervised training from the EAGLE-3 checkpoint. 1.2 1.4 1.6 1.8 2.0 K L Div e r g e n c e (D K L) 0 10k 20k 30k 40… view at source ↗

read the original abstract

Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PPOW applies window-level RL to speculative decoding and reports 3.4-4.4x speedups, but the rewards may optimize acceptance length without guaranteeing real end-to-end latency gains.

read the letter

The main takeaway is that this work replaces token-level supervised training for draft models with a performance-driven RL policy that optimizes entire speculative windows. The three pieces are a cost-aware speedup reward, a distribution proximity reward, and adaptive window selection based on divergence. This setup makes sense for the problem because mismatches early in a window waste the rest of the speculation. The experiments cover several model families and show acceptance lengths between 6.29 and 6.52 with speedups from 3.39 to 4.36 times under consistent conditions. If the baselines are standard ones like typical speculative decoding setups, this could be a practical improvement for inference speed. A real concern is whether the cost-aware reward uses actual timing data or just a simplified model. If it approximates verification overhead without capturing things like KV-cache updates or batch effects, the speedups might not hold in real deployments as the stress-test suggests. The abstract does not include the reward equations or mention calibration against wall-clock measurements, which leaves some uncertainty. This paper is for the community working on efficient LLM generation techniques. Anyone tuning speculative decoders would find the adaptive windowing and reward combination worth examining. It shows clear engagement with the practical challenges and has specific numbers, so it should go through peer review rather than being rejected outright. The authors would benefit from feedback on making the reward definitions and evaluation more transparent.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PPOW, a reinforcement learning framework for optimizing drafter models in speculative decoding. It replaces token-level supervised objectives with window-level optimization via three components: a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and an Adaptive Divergence-Aware Windowing rule that prioritizes high-divergence windows. The authors report average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36× across model families and benchmarks under a unified decoding protocol.

Significance. If the empirical claims hold under rigorous validation, the work offers a practical advance by demonstrating that performance-driven, window-level RL can improve speculative decoding efficiency beyond standard imitation learning. The unified protocol and multi-model evaluation are positive for comparability; however, the significance hinges on whether the proposed rewards translate acceptance-length gains into verified wall-clock speedups rather than proxy improvements.

major comments (2)

[Abstract and §3] Abstract and §3 (reward definitions): The central performance claim (speedups of 3.39-4.36×) rests on the Cost-Aware Speedup Reward producing policies that improve measured end-to-end latency. The abstract and reward description provide no equation or calibration detail showing that the reward directly incorporates variable verification overheads (KV-cache management, early-exit costs, or hardware batching) rather than approximating them via token counts; this leaves open the possibility that reported acceptance lengths do not guarantee the claimed latency gains.
[§4] §4 (experimental reporting): The reported acceptance lengths and speedups lack any mention of baseline implementation details, number of runs, or statistical significance testing. Without these, it is impossible to determine whether the 3.39-4.36× speedups are robust or whether the adaptive windowing rule was tuned on the evaluation data, undermining evaluation of the unified decoding protocol results.

minor comments (2)

[§3] Notation for the three reward components and the windowing rule should be introduced with explicit equations in §3 to improve readability and allow direct comparison to prior speculative decoding work.
[Figures] Figure captions and axis labels for speedup and acceptance-length plots should explicitly state the exact baseline method and hardware setup used for each bar.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the changes that will be incorporated in the revised manuscript to improve clarity and empirical rigor.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (reward definitions): The central performance claim (speedups of 3.39-4.36×) rests on the Cost-Aware Speedup Reward producing policies that improve measured end-to-end latency. The abstract and reward description provide no equation or calibration detail showing that the reward directly incorporates variable verification overheads (KV-cache management, early-exit costs, or hardware batching) rather than approximating them via token counts; this leaves open the possibility that reported acceptance lengths do not guarantee the claimed latency gains.

Authors: We thank the referee for this observation. The Cost-Aware Speedup Reward in §3 is formulated to optimize for net latency reduction by subtracting an estimated verification cost (proportional to window size and a profiled per-token overhead) from the speedup gained by accepted tokens. To address the lack of explicit detail, we will insert the precise reward equation and the calibration procedure (including hardware profiling for KV-cache and early-exit effects) into the revised §3. This will make clear that the reward is not a pure token-count proxy but incorporates measured overhead factors. revision: yes
Referee: [§4] §4 (experimental reporting): The reported acceptance lengths and speedups lack any mention of baseline implementation details, number of runs, or statistical significance testing. Without these, it is impossible to determine whether the 3.39-4.36× speedups are robust or whether the adaptive windowing rule was tuned on the evaluation data, undermining evaluation of the unified decoding protocol results.

Authors: We agree that these details are essential. In the revised §4 we will add: (i) full baseline implementation specifications and hyperparameter settings, (ii) results averaged over five independent runs with different random seeds together with standard deviations, and (iii) statistical significance tests (paired t-tests) comparing PPOW against baselines. We will also state explicitly that adaptive-windowing hyperparameters were selected on a disjoint validation split and never tuned on the reported test benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results presented as independent outcomes of proposed RL components

full rationale

The abstract and visible description introduce PPOW as a new RL framework combining three explicitly named components (Cost-Aware Speedup Reward, Distribution-Based Proximity Reward, Adaptive Divergence-Aware Windowing) and then report measured acceptance lengths and speedups as experimental results. No equations, fitting procedures, or derivation steps are shown that would reduce the reported speedups to parameters defined inside the method itself. No self-citations, uniqueness theorems, or ansatzes are referenced in the provided text. The central claims therefore remain self-contained and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger cannot enumerate concrete free parameters, axioms, or invented entities. The method description implies that reward weights and the divergence threshold in adaptive windowing are chosen or fitted, but their exact status is not stated.

pith-pipeline@v0.9.0 · 5728 in / 1191 out tokens · 40205 ms · 2026-05-19T16:35:23.262447+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rspeedup = k / (kγ + 1), where k is accepted prefix length and γ is relative drafter cost

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 12 internal anchors

[1]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024
[5]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7421–7432, 2024

work page 2024
[7]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Learning harmonized represen- tations for speculative sampling.arXiv preprint arXiv:2408.15766, 2024

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized represen- tations for speculative sampling.arXiv preprint arXiv:2408.15766, 2024

work page arXiv 2024
[9]

Griffin: Effective token alignment for faster speculative decoding.arXiv preprint arXiv:2502.11018, 2025

Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh, and Pan Zhou. Griffin: Effective token alignment for faster speculative decoding.arXiv preprint arXiv:2502.11018, 2025

work page arXiv 2025
[10]

Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461, 2023

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461, 2023

work page arXiv 2023
[11]

Online speculative decoding.arXiv preprint arXiv:2310.07177, 2023

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Online speculative decoding.arXiv preprint arXiv:2310.07177, 2023

work page arXiv 2023
[12]

Fastdraft: How to train your draft

Ofir Zafrir, Igor Margulis, Dorin Shteyman, Shira Guskin, and Guy Boudoukh. Fastdraft: How to train your draft. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22488–22505, 2025

work page 2025
[13]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057,

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024
[14]

Spectr: Fast speculative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023

work page 2023
[15]

Block verification accelerates speculative decoding.arXiv preprint arXiv:2403.10444, 2024

Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Jae Hun Ro, Ahmad Beirami, and Ananda Theertha Suresh. Block verification accelerates speculative decoding.arXiv preprint arXiv:2403.10444, 2024

work page arXiv 2024
[16]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Lang...

work page 2024
[17]

Sequoia: Scalable, robust, and hardware-aware speculative decoding.arXiv preprint arXiv:2402.12374, 2024

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding.arXiv preprint arXiv:2402.12374, 2024

work page arXiv 2024
[18]

Cascade speculative drafting for even faster llm inference.Advances in Neural Information Processing Systems, 37:86226–86242, 2024

Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Kevin C Chang, and Jie Huang. Cascade speculative drafting for even faster llm inference.Advances in Neural Information Processing Systems, 37:86226–86242, 2024

work page 2024
[19]

Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, and Roy Schwartz. Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304, 2024

work page arXiv 2024
[20]

Adaptive draft-verification for efficient large language model decoding

Xukun Liu, Bowen Lei, Ruqi Zhang, and Dongkuan DK Xu. Adaptive draft-verification for efficient large language model decoding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24668–24676, 2025

work page 2025
[21]

Specexec: Massively parallel speculative decoding for interactive llm inference on consumer devices.Advances in Neural Information Processing Systems, 37:16342–16368, 2024

Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, and Max Ryabinin. Specexec: Massively parallel speculative decoding for interactive llm inference on consumer devices.Advances in Neural Information Processing Systems, 37:16342–16368, 2024

work page 2024
[22]

Pearl: Parallel speculative decoding with adaptive draft length.arXiv preprint arXiv:2408.11850, 2024

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length.arXiv preprint arXiv:2408.11850, 2024

work page arXiv 2024
[23]

Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

work page arXiv 2025
[24]

Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts.arXiv preprint arXiv:2509.23232, 2025a

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, and Jinsong Su. Spec-rl: Accelerating on-policy reinforcement learning via speculative rollouts. arXiv preprint arXiv:2509.23232, 2025

work page arXiv 2025
[25]

Rlhf- spec: Breaking the efficiency bottleneck in rlhf training via adaptive drafting.arXiv preprint arXiv:2512.04752, 2025

Siqi Wang, Hailong Yang, Junjie Zhu, Xuezhu Wang, Yufan Xu, and Depei Qian. Rlhf- spec: Breaking the efficiency bottleneck in rlhf training via adaptive drafting.arXiv preprint arXiv:2512.04752, 2025

work page arXiv 2025
[26]

Respec: Towards optimizing speculative decoding in reinforcement learning systems.arXiv preprint arXiv:2510.26475, 2025

Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, and Tianwei Zhang. Respec: Towards optimizing speculative decoding in reinforcement learning systems.arXiv preprint arXiv:2510.26475, 2025

work page arXiv 2025
[27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[30]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023. 11

work page 2023
[33]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 1797–1807, 2018

work page 2018
[38]

Findings of the 2014 workshop on statistical machine translation

Ondˇrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. InProceedings of the ninth workshop on statistical machine translation, pages 12–58, 2014. 12 A Optimization and Implementati...

work page 2014

[1] [1]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023

[2] [2]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024

[5] [5]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7421–7432, 2024

work page 2024

[7] [7]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Learning harmonized represen- tations for speculative sampling.arXiv preprint arXiv:2408.15766, 2024

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized represen- tations for speculative sampling.arXiv preprint arXiv:2408.15766, 2024

work page arXiv 2024

[9] [9]

Griffin: Effective token alignment for faster speculative decoding.arXiv preprint arXiv:2502.11018, 2025

Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh, and Pan Zhou. Griffin: Effective token alignment for faster speculative decoding.arXiv preprint arXiv:2502.11018, 2025

work page arXiv 2025

[10] [10]

Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461, 2023

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461, 2023

work page arXiv 2023

[11] [11]

Online speculative decoding.arXiv preprint arXiv:2310.07177, 2023

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Online speculative decoding.arXiv preprint arXiv:2310.07177, 2023

work page arXiv 2023

[12] [12]

Fastdraft: How to train your draft

Ofir Zafrir, Igor Margulis, Dorin Shteyman, Shira Guskin, and Guy Boudoukh. Fastdraft: How to train your draft. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22488–22505, 2025

work page 2025

[13] [13]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057,

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024

[14] [14]

Spectr: Fast speculative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023

work page 2023

[15] [15]

Block verification accelerates speculative decoding.arXiv preprint arXiv:2403.10444, 2024

Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Jae Hun Ro, Ahmad Beirami, and Ananda Theertha Suresh. Block verification accelerates speculative decoding.arXiv preprint arXiv:2403.10444, 2024

work page arXiv 2024

[16] [16]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Lang...

work page 2024

[17] [17]

Sequoia: Scalable, robust, and hardware-aware speculative decoding.arXiv preprint arXiv:2402.12374, 2024

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding.arXiv preprint arXiv:2402.12374, 2024

work page arXiv 2024

[18] [18]

Cascade speculative drafting for even faster llm inference.Advances in Neural Information Processing Systems, 37:86226–86242, 2024

Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Kevin C Chang, and Jie Huang. Cascade speculative drafting for even faster llm inference.Advances in Neural Information Processing Systems, 37:86226–86242, 2024

work page 2024

[19] [19]

Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, and Roy Schwartz. Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304, 2024

work page arXiv 2024

[20] [20]

Adaptive draft-verification for efficient large language model decoding

Xukun Liu, Bowen Lei, Ruqi Zhang, and Dongkuan DK Xu. Adaptive draft-verification for efficient large language model decoding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24668–24676, 2025

work page 2025

[21] [21]

Specexec: Massively parallel speculative decoding for interactive llm inference on consumer devices.Advances in Neural Information Processing Systems, 37:16342–16368, 2024

Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, and Max Ryabinin. Specexec: Massively parallel speculative decoding for interactive llm inference on consumer devices.Advances in Neural Information Processing Systems, 37:16342–16368, 2024

work page 2024

[22] [22]

Pearl: Parallel speculative decoding with adaptive draft length.arXiv preprint arXiv:2408.11850, 2024

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length.arXiv preprint arXiv:2408.11850, 2024

work page arXiv 2024

[23] [23]

Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

work page arXiv 2025

[24] [24]

Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts.arXiv preprint arXiv:2509.23232, 2025a

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, and Jinsong Su. Spec-rl: Accelerating on-policy reinforcement learning via speculative rollouts. arXiv preprint arXiv:2509.23232, 2025

work page arXiv 2025

[25] [25]

Rlhf- spec: Breaking the efficiency bottleneck in rlhf training via adaptive drafting.arXiv preprint arXiv:2512.04752, 2025

Siqi Wang, Hailong Yang, Junjie Zhu, Xuezhu Wang, Yufan Xu, and Depei Qian. Rlhf- spec: Breaking the efficiency bottleneck in rlhf training via adaptive drafting.arXiv preprint arXiv:2512.04752, 2025

work page arXiv 2025

[26] [26]

Respec: Towards optimizing speculative decoding in reinforcement learning systems.arXiv preprint arXiv:2510.26475, 2025

Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, and Tianwei Zhang. Respec: Towards optimizing speculative decoding in reinforcement learning systems.arXiv preprint arXiv:2510.26475, 2025

work page arXiv 2025

[27] [27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023

[30] [30]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[31] [31]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [32]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023. 11

work page 2023

[33] [33]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 1797–1807, 2018

work page 2018

[38] [38]

Findings of the 2014 workshop on statistical machine translation

Ondˇrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. InProceedings of the ninth workshop on statistical machine translation, pages 12–58, 2014. 12 A Optimization and Implementati...

work page 2014