Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

Dogyung Yoon; Jaemin Kim; Jiho Shin; Jiwon Seo; Junyeol Lee; Sungkyun Kim

arxiv: 2509.24328 · v2 · submitted 2025-09-29 · 💻 cs.CL

Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

Sungkyun Kim , Jaemin Kim , Dogyung Yoon , Jiho Shin , Junyeol Lee , Jiwon Seo This is my paper

Pith reviewed 2026-05-18 13:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative decodinginformation gaincompanion modelLLM inferenceverification lengththroughput optimizationdraft modeltarget model

0 comments

The pith

A small companion model estimates draft-target alignment to dynamically set verification lengths and cut wasted rejections in speculative decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Speculative Verification as a way to make speculative decoding more efficient by adding a tiny companion model that measures how closely the draft model's token predictions match what the larger target model would produce. It uses information gain to decide on the fly how many tokens to verify in one parallel step, shortening the sequence when alignment looks poor and extending it when alignment looks strong. This adjustment happens without any retraining or changes to the original draft and target models. A reader would care because the method delivers measurable speedups on current hardware, especially when running many requests at once, by reducing the fraction of computation spent on tokens that get rejected.

Core claim

Speculative Verification augments speculative decoding by training or using a companion model of similar size to the draft model; the companion quantifies the distributional alignment between draft and target outputs through an information-gain objective, which is then maximized to select an adaptive verification length that minimizes the expected number of rejected tokens per forward pass of the target model, yielding higher net throughput than fixed-length speculative decoding or direct target-model decoding.

What carries the argument

The companion model, an auxiliary network sized like the draft model, that scores the expected information gain from verifying a candidate sequence under the observed draft-target alignment.

If this is right

Verification length becomes a runtime decision instead of a fixed hyper-parameter.
The same companion can be reused across different draft-target pairs without retraining either model.
Speedup remains positive at batch sizes from 4 up to 80 on models ranging from 13B to 72B parameters.
The method is orthogonal to existing speculative-decoding variants and can be stacked on top of them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The information-gain signal could be used to decide when to switch entirely to the target model for a few steps instead of always using the draft.
If the companion is kept extremely small, the same idea might apply to edge-device inference where memory is the primary constraint.
The approach suggests that lightweight alignment estimators could replace some of the trial-and-error tuning currently done for speculative decoding hyperparameters.

Load-bearing premise

The companion model's estimate of draft-target alignment is accurate enough that choosing longer or shorter verification runs actually increases overall tokens per second rather than being erased by the companion's own inference cost.

What would settle it

A controlled run in which the companion model's information-gain score shows no statistical correlation with the actual token-acceptance rate observed when the target model verifies the draft sequence.

Figures

Figures reproduced from arXiv: 2509.24328 by Dogyung Yoon, Jaemin Kim, Jiho Shin, Jiwon Seo, Junyeol Lee, Sungkyun Kim.

**Figure 2.** Figure 2: Example: selecting verification length in Speculative Verification. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Runtime optimizations: Data-Parallel drafting and overlapping drafting with verification. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Overall performance: three target models (displayed at top) and six tasks (above each plot). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance Breakdown: Prefill vs. Decoding (left) and Runtime Optimizations (right) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Effectiveness of SV: (a) Impact of draft and companion model selection; (b) sensitivity to [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target model. However, when speculation accuracy is low, the overhead from rejected tokens can offset the benefits, limiting SD's effectiveness, especially at large batch sizes. To address this, we propose Speculative Verification (SV), an efficient augmentation to SD that dynamically predicts speculation accuracy and adapts the verification length to maximize throughput. SV introduces a companion model - a small auxiliary model similar in size to the draft model - to estimate the alignment between draft and target model distributions. By maximizing the information gain from quantifying this alignment, SV refines verification decisions, reducing wasted computation on rejected tokens and improving decoding efficiency. Moreover, SV requires no modifications to the draft or target models and is compatible with existing SD variants. We extensively evaluated SV on publicly available LLMs across three NLP tasks using nine combinations of draft, companion, and target models, including 13B-72B target models and three types of variations: base (no finetuning), instruction-tuned, and task fine-tuned. Across all experiments and batch sizes (4-80), SV consistently outperforms both SD and standard decoding with the target model. It improves SD performance by up to 2$\times$, with an average speedup of 1.4 $\times$ in large-batch settings (batch sizes 32-80). These results demonstrate SV's robustness, scalability, and practical utility for efficient LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Speculative Verification (SV) as an augmentation to speculative decoding (SD) for LLMs. SV introduces a companion model (similar in size to the draft model) that estimates alignment between draft and target distributions by maximizing information gain; this estimate is used to dynamically select verification length, reducing overhead from rejected tokens. The authors report that SV outperforms both SD and standard target-model decoding across nine draft/companion/target combinations, three NLP tasks, and batch sizes 4–80, with up to 2× improvement over SD and an average 1.4× speedup in large-batch regimes (32–80), while requiring no changes to the draft or target models.

Significance. If the reported net throughput gains survive a full accounting of companion-model latency, SV would constitute a practical, model-agnostic refinement to speculative decoding that is especially valuable at high batch sizes where conventional SD degrades. The compatibility claim and the breadth of the evaluation (base, instruction-tuned, and task-fine-tuned models up to 72 B) would further increase its utility for production inference pipelines.

major comments (2)

[Abstract / Experimental Results] Abstract / Experimental Results: the headline speedups (up to 2×, average 1.4× over SD) are presented without error bars, statistical tests, or an ablation that disables the information-gain scheduler while retaining the companion model. Because the companion must execute on every (or every few) steps before parallel verification, any unisolated serial cost directly affects the central net-throughput claim.
[Method] Method section: the precise computation of information gain from the companion’s alignment estimate, and the rule that maps this quantity to a chosen verification length, are described only at a high level. Without an explicit algorithm or timing breakdown, it is impossible to verify that the companion’s forward-pass cost is smaller than the compute saved by shorter rejected sequences.

minor comments (2)

[Figures] Figure captions and axis labels should explicitly state whether reported times include or exclude the companion model’s forward passes.
[Implementation Details] The manuscript should clarify whether the companion model is frozen or fine-tuned and whether its parameters are counted in the overall memory footprint.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of presentation and methodological clarity that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract / Experimental Results: the headline speedups (up to 2×, average 1.4× over SD) are presented without error bars, statistical tests, or an ablation that disables the information-gain scheduler while retaining the companion model. Because the companion must execute on every (or every few) steps before parallel verification, any unisolated serial cost directly affects the central net-throughput claim.

Authors: We agree that the current presentation would benefit from greater statistical rigor and isolation of components. In the revised manuscript we will report all throughput numbers with error bars computed over multiple independent runs (different random seeds and prompt sets) and will include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing SV against SD. We will also add an explicit ablation that retains the companion model but disables the information-gain scheduler, replacing it with either a fixed verification length or a random baseline. This ablation will be presented alongside the main results so that readers can directly observe the incremental benefit of the dynamic scheduler after accounting for companion-model latency. The abstract will be updated to reflect these additional controls. revision: yes
Referee: [Method] Method section: the precise computation of information gain from the companion’s alignment estimate, and the rule that maps this quantity to a chosen verification length, are described only at a high level. Without an explicit algorithm or timing breakdown, it is impossible to verify that the companion’s forward-pass cost is smaller than the compute saved by shorter rejected sequences.

Authors: We acknowledge that the current description is insufficient for independent verification. In the revised Method section we will provide the exact mathematical definition of information gain used (KL divergence between the companion’s predicted alignment distribution and the empirical draft-target agreement), the closed-form expression for the alignment estimate, and the deterministic mapping function from information-gain value to verification length (including any thresholds or interpolation). We will also insert pseudocode for the full SV step. In the Experiments section we will add a dedicated timing table that reports (a) average companion forward-pass latency, (b) reduction in rejected tokens per step, and (c) net wall-clock time per token, thereby demonstrating that the overhead is more than offset by the savings. These additions will allow readers to confirm the net-throughput claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent measurements

full rationale

The paper introduces Speculative Verification as a practical augmentation to speculative decoding via an external companion model that estimates draft-target alignment and selects verification length by information-gain maximization. No equations, derivations, or self-citations reduce the reported speedups (up to 2×, average 1.4×) to fitted parameters or tautological inputs by construction. The central claims rest on wall-clock throughput measurements across nine model combinations, three tasks, and batch sizes 4-80; these are externally falsifiable empirical outcomes rather than self-referential definitions. The companion overhead is a separate engineering assumption whose net effect is directly measured in the experiments, not presupposed by the method's formulation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The approach introduces one new component (the companion model) whose effectiveness is not derived from prior literature but asserted through empirical results; no explicit free parameters or background axioms are stated in the abstract.

invented entities (1)

companion model no independent evidence
purpose: estimate alignment between draft and target model distributions to maximize information gain for verification decisions
Described as a small auxiliary model similar in size to the draft model that is added without modifying the original models.

pith-pipeline@v0.9.0 · 5835 in / 1295 out tokens · 36395 ms · 2026-05-18T13:13:14.484707+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
cs.DC 2026-02 unverdicted novelty 5.0

StreamServe achieves 11-18x lower latency than standard vLLM setups for LLM serving by combining disaggregated prefill-decode execution with metric-aware routing and runtime-adaptive speculative decoding.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[2]

Optimizing speculative decoding for serving large language models using goodput.arXiv preprint arXiv:2406.14066, 2024

Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Optimizing speculative decoding for serving large language models using goodput.arXiv preprint arXiv:2406.14066, 2024

work page arXiv 2024
[3]

Distillspec: Improving speculative decoding via knowledge distillation

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[4]

Learning harmonized representations for speculative sampling

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[5]

EAGLE: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InForty-first International Conference on Machine Learning, 2024

work page 2024
[6]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

work page arXiv 2024
[7]

Layerskip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self-speculative decoding. InACL (1), pages 12622–12642, 2024

work page 2024
[8]

Kangaroo: Lossless self-speculative decoding for accelerating LLMs via double early exiting

Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, and Yunhe Wang. Kangaroo: Lossless self-speculative decoding for accelerating LLMs via double early exiting. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[9]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InForty-first International Conference on Machine Learning, 2024

work page 2024
[10]

Accelerating LLM inference with staged speculative decoding

Benjamin Frederick Spector and Christopher Re. Accelerating LLM inference with staged speculative decoding. InWorkshop on Efficient Systems for Foundation Models @ ICML2023, 2023

work page 2023
[11]

Hierarchical speculative decoding with dynamic window

Shensian Syu and Hung-yi Lee. Hierarchical speculative decoding with dynamic window. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 8260–8273, 2025

work page 2025
[12]

Pearl: Parallel speculative decoding with adaptive draft length

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[13]

Spin: Accelerating large language model inference with heterogeneous speculative models.arXiv preprint arXiv:2503.15921, 2025

Fahao Chen, Peng Li, Tom H Luan, Zhou Su, and Jing Deng. Spin: Accelerating large language model inference with heterogeneous speculative models.arXiv preprint arXiv:2503.15921, 2025

work page arXiv 2025
[14]

What Does BERT Look At? An Analysis of BERT's Attention

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention.arXiv preprint arXiv:1906.04341, 2019. 10

work page internal anchor Pith review Pith/arXiv arXiv 1906
[15]

Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984,

George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984, 2024

work page arXiv 2024
[16]

Unveiling simplicities of attention: Adaptive long-context head identification.arXiv preprint arXiv:2502.09647, 2025

Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, and Kartik Ahuja. Unveiling simplicities of attention: Adaptive long-context head identification.arXiv preprint arXiv:2502.09647, 2025

work page arXiv 2025
[17]

Flashinfer: Efficient and customizable attention engine for LLM inference serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for LLM inference serving. InEighth Conference on Machine Learning and Systems, 2025

work page 2025
[18]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Chatgpt prompts dataset

Mohamed Rashad. Chatgpt prompts dataset. https://huggingface.co/datasets/MohamedRashad/ ChatGPT-prompts, 2023

work page 2023
[22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Lang...

work page 2024
[24]

Sharegpt vicuna unfiltered dataset

anon8231489123. Sharegpt vicuna unfiltered dataset. https://huggingface.co/datasets/ anon8231489123/ShareGPT_Vicuna_unfiltered, 2023

work page 2023
[25]

Octopack: Instruction tuning code large language models,

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models.arXiv preprint arXiv:2308.07124, 2023

work page arXiv 2023
[26]

Analytical estimates of limited sampling biases in different information measures.Network: Computation in neural systems, 7(1):87, 1996

Stefano Panzeri and Alessandro Treves. Analytical estimates of limited sampling biases in different information measures.Network: Computation in neural systems, 7(1):87, 1996

work page 1996
[27]

Improved use of continuous attributes in c4

J Ross Quinlan. Improved use of continuous attributes in c4. 5.Journal of artificial intelligence research, 4:77–90, 1996

work page 1996
[28]

McGraw-hill New York, 1997

Tom M Mitchell and Tom M Mitchell.Machine learning, volume 1. McGraw-hill New York, 1997

work page 1997
[29]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Layer- skip: Enabling early exit inference and self-speculative de- coding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layerskip: Enabling early exit inference and self-speculative decoding.arXiv preprint arXiv:2404.16710, 2024

work page arXiv 2024
[31]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

AMD-Llama-135m: A 135M Parameter Language Model

AMD. AMD-Llama-135m: A 135M Parameter Language Model. https://huggingface.co/amd/ AMD-Llama-135m, 2024. 11

work page 2024
[34]

Qwen2.5-0.5b.https://huggingface.co/unsloth/Qwen2.5-0.5B, 2024

Unsloth. Qwen2.5-0.5b.https://huggingface.co/unsloth/Qwen2.5-0.5B, 2024

work page 2024
[35]

Layerskip llama 2 70b

Meta AI (Facebook). Layerskip llama 2 70b. https://huggingface.co/facebook/ layerskip-llama2-70B, 2024

work page 2024
[36]

Layerskip codellama 34b

Meta AI (Facebook). Layerskip codellama 34b. https://huggingface.co/facebook/ layerskip-codellama-34B, 2024

work page 2024
[37]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. A Additional Experimental Data The detailed experimental results for Section 7.5 are shown in Table 3. Table 3: Number of Step...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

We constructed a balanced subset by randomly sampling tasks across all languages to ensure comprehensive coverage of different programming paradigms and syntactic structures

benchmark, which includes Python, C++, Java, JavaScript, Rust, and Go. We constructed a balanced subset by randomly sampling tasks across all languages to ensure comprehensive coverage of different programming paradigms and syntactic structures. We also incorporated MBPP [ 37], which consists of approximately 1,000 crowd-sourced Python programming problem...

work page

[1] [1]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023

[2] [2]

Optimizing speculative decoding for serving large language models using goodput.arXiv preprint arXiv:2406.14066, 2024

Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Optimizing speculative decoding for serving large language models using goodput.arXiv preprint arXiv:2406.14066, 2024

work page arXiv 2024

[3] [3]

Distillspec: Improving speculative decoding via knowledge distillation

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[4] [4]

Learning harmonized representations for speculative sampling

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[5] [5]

EAGLE: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InForty-first International Conference on Machine Learning, 2024

work page 2024

[6] [6]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

work page arXiv 2024

[7] [7]

Layerskip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self-speculative decoding. InACL (1), pages 12622–12642, 2024

work page 2024

[8] [8]

Kangaroo: Lossless self-speculative decoding for accelerating LLMs via double early exiting

Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, and Yunhe Wang. Kangaroo: Lossless self-speculative decoding for accelerating LLMs via double early exiting. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[9] [9]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InForty-first International Conference on Machine Learning, 2024

work page 2024

[10] [10]

Accelerating LLM inference with staged speculative decoding

Benjamin Frederick Spector and Christopher Re. Accelerating LLM inference with staged speculative decoding. InWorkshop on Efficient Systems for Foundation Models @ ICML2023, 2023

work page 2023

[11] [11]

Hierarchical speculative decoding with dynamic window

Shensian Syu and Hung-yi Lee. Hierarchical speculative decoding with dynamic window. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 8260–8273, 2025

work page 2025

[12] [12]

Pearl: Parallel speculative decoding with adaptive draft length

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[13] [13]

Spin: Accelerating large language model inference with heterogeneous speculative models.arXiv preprint arXiv:2503.15921, 2025

Fahao Chen, Peng Li, Tom H Luan, Zhou Su, and Jing Deng. Spin: Accelerating large language model inference with heterogeneous speculative models.arXiv preprint arXiv:2503.15921, 2025

work page arXiv 2025

[14] [14]

What Does BERT Look At? An Analysis of BERT's Attention

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention.arXiv preprint arXiv:1906.04341, 2019. 10

work page internal anchor Pith review Pith/arXiv arXiv 1906

[15] [15]

Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984,

George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984, 2024

work page arXiv 2024

[16] [16]

Unveiling simplicities of attention: Adaptive long-context head identification.arXiv preprint arXiv:2502.09647, 2025

Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, and Kartik Ahuja. Unveiling simplicities of attention: Adaptive long-context head identification.arXiv preprint arXiv:2502.09647, 2025

work page arXiv 2025

[17] [17]

Flashinfer: Efficient and customizable attention engine for LLM inference serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for LLM inference serving. InEighth Conference on Machine Learning and Systems, 2025

work page 2025

[18] [18]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Chatgpt prompts dataset

Mohamed Rashad. Chatgpt prompts dataset. https://huggingface.co/datasets/MohamedRashad/ ChatGPT-prompts, 2023

work page 2023

[22] [22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Lang...

work page 2024

[24] [24]

Sharegpt vicuna unfiltered dataset

anon8231489123. Sharegpt vicuna unfiltered dataset. https://huggingface.co/datasets/ anon8231489123/ShareGPT_Vicuna_unfiltered, 2023

work page 2023

[25] [25]

Octopack: Instruction tuning code large language models,

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models.arXiv preprint arXiv:2308.07124, 2023

work page arXiv 2023

[26] [26]

Analytical estimates of limited sampling biases in different information measures.Network: Computation in neural systems, 7(1):87, 1996

Stefano Panzeri and Alessandro Treves. Analytical estimates of limited sampling biases in different information measures.Network: Computation in neural systems, 7(1):87, 1996

work page 1996

[27] [27]

Improved use of continuous attributes in c4

J Ross Quinlan. Improved use of continuous attributes in c4. 5.Journal of artificial intelligence research, 4:77–90, 1996

work page 1996

[28] [28]

McGraw-hill New York, 1997

Tom M Mitchell and Tom M Mitchell.Machine learning, volume 1. McGraw-hill New York, 1997

work page 1997

[29] [29]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Layer- skip: Enabling early exit inference and self-speculative de- coding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layerskip: Enabling early exit inference and self-speculative decoding.arXiv preprint arXiv:2404.16710, 2024

work page arXiv 2024

[31] [31]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

AMD-Llama-135m: A 135M Parameter Language Model

AMD. AMD-Llama-135m: A 135M Parameter Language Model. https://huggingface.co/amd/ AMD-Llama-135m, 2024. 11

work page 2024

[34] [34]

Qwen2.5-0.5b.https://huggingface.co/unsloth/Qwen2.5-0.5B, 2024

Unsloth. Qwen2.5-0.5b.https://huggingface.co/unsloth/Qwen2.5-0.5B, 2024

work page 2024

[35] [35]

Layerskip llama 2 70b

Meta AI (Facebook). Layerskip llama 2 70b. https://huggingface.co/facebook/ layerskip-llama2-70B, 2024

work page 2024

[36] [36]

Layerskip codellama 34b

Meta AI (Facebook). Layerskip codellama 34b. https://huggingface.co/facebook/ layerskip-codellama-34B, 2024

work page 2024

[37] [37]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. A Additional Experimental Data The detailed experimental results for Section 7.5 are shown in Table 3. Table 3: Number of Step...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

We constructed a balanced subset by randomly sampling tasks across all languages to ensure comprehensive coverage of different programming paradigms and syntactic structures

benchmark, which includes Python, C++, Java, JavaScript, Rust, and Go. We constructed a balanced subset by randomly sampling tasks across all languages to ensure comprehensive coverage of different programming paradigms and syntactic structures. We also incorporated MBPP [ 37], which consists of approximately 1,000 crowd-sourced Python programming problem...

work page