LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

arxiv: 2507.01449 · v3 · submitted 2025-07-02 · 💻 cs.CL

LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

Tianyu Liu , Qitan Lv , Hao Li , Xing Gao , Xiao Sun , Xiaoyan Sun This is my paper

Pith reviewed 2026-05-19 06:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative decodingretrieval-basedlogit speculationdraft tokensLLM accelerationnext next tokentraining-free

0 comments p. Extension

The pith

LogitSpec uses the last token logit to speculate the next-next token and retrieves drafts for both next and next-next positions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-based speculative decoding avoids draft models by fetching relevant references as candidate tokens. The challenge is that matching only the next token often produces inaccurate drafts that the target model rejects. LogitSpec solves this by using the logit of the last token to also speculate the token after next. It then retrieves references that align with both the next and next-next positions. This results in more accepted tokens and faster overall generation for large language models.

Core claim

LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec can achieve up to 2.61× speedup and 3.28 mean accepted tokens per decoding step.

What carries the argument

The two-step draft generation process that uses the last logit to speculate the next-next token before retrieving references covering both consecutive positions.

If this is right

LLM text generation runs with higher average token acceptance per step, reaching 3.28.
Inference achieves up to 2.61 times speedup without any draft model or training step.
The approach plugs directly into existing LLM inference systems with no code changes required.
Retrieval-based methods gain accuracy by expanding the search to cover the token after next.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The logit speculation step could be tested in non-retrieval speculative decoding setups to see if it reduces rejections there as well.
Extending the idea to speculate three tokens ahead might produce further gains if retrieval quality remains high.
Reference databases might be reorganized around logit patterns to support faster multi-step matching in long sequences.

Load-bearing premise

The logit of the last token provides useful speculation for the next-next token that leads to more accurate and relevant retrieved drafts than standard next-token-only retrieval.

What would settle it

A controlled test on the same benchmarks where adding next-next logit speculation yields no increase or a decrease in mean accepted tokens or overall speedup compared to next-token-only retrieval would disprove the benefit.

Figures

Figures reproduced from arXiv: 2507.01449 by Hao Li, Qitan Lv, Tianyu Liu, Xiao Sun, Xiaoyan Sun, Xing Gao.

**Figure 2.** Figure 2: Motivated observations. (a) The last logit can speculate the next next token with a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: An overview of LogitSpec. At each decoding step, LogitSpec first utilizes the top-k entries of the last logit as the speculation to the next next token. Then, LogitSpec retrievals relevant references for both the next token and the next next token. Finally, LogitSpec organizes the draft tokens into a draft tree and prepares a tree attention for parallel verification. We further conduct experiments on Spec-… view at source ↗

**Figure 4.** Figure 4: Running time breakdown of the whole decoding process on Spec-Bench with Vicuna 7B. In-depth Running Time Analysis. To further investigate the effectiveness of LogitSpec, we conduct experiments to analyze the running time allocation within the whole decoding process. Specifically, there are five non-negligible components in LogitSpec, including (a)retrieving draft tokens: the process of retrieving referen… view at source ↗

read the original abstract

Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LogitSpec adds next-next token speculation from the last logit to enable joint retrieval in retrieval-based SD, with reported 2.61x speedups, but the gains rest on an assumption that lacks isolating experiments.

read the letter

The main thing here is that LogitSpec takes the observation that a token's logit can carry some signal about the token after next and uses it to run a two-step retrieval: guess the next-next token, then pull references for both the immediate next token and that guessed one. This is meant to fix weak matching in pure retrieval SD without adding a draft model or any training, and they report up to 2.61x speedup plus 3.28 mean accepted tokens across benchmarks. The code release is a plus for anyone who wants to try it directly. What works is the plug-and-play framing and the focus on practical deployment cost. It stays training-free and shows measurable latency wins on standard text generation tasks, which is the right kind of evidence for inference papers. The soft spot is exactly the one the stress-test flags. Nothing in the write-up separates whether the next-next guess actually surfaces better references or whether the method just benefits from casting a wider net. If the speculated token is often wrong, the second retrieval step risks pulling irrelevant continuations and the net acceptance rate could fall back to what a stronger next-token-only retriever would get with the same budget. An ablation that measures speculation accuracy and compares joint retrieval against extra next-token candidates would have made the central claim much tighter. Baselines and variance details also feel light given how much the numbers are asked to carry. This is for people already working on retrieval-based drafting or low-overhead speculative decoding in production settings. A reader who needs a simple, no-training tweak to existing pipelines could pick up the method and the numbers. It is coherent enough and grounded enough in a real deployment pain point to go to a serious referee, though it would come back asking for those ablations. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper proposes LogitSpec, a training-free extension to retrieval-based speculative decoding. It observes that the logit of the last token can speculate the next-next token, then performs two-step retrieval of reference continuations for both the next token and the speculated next-next token. The method is presented as plug-and-play and is evaluated on text-generation benchmarks, reporting up to 2.61× speedup and 3.28 mean accepted tokens per decoding step.

Significance. If the reported speedups and acceptance rates are reproducible and exceed strong next-token-only retrieval baselines, the approach would offer a lightweight way to increase draft quality in retrieval-based speculative decoding without introducing trainable components or draft models. The training-free and plug-and-play character is a practical strength for deployment.

major comments (2)

[§3] §3 (Method): The central claim that last-token logit speculation yields useful next-next references rests on an unverified assumption. No ablation isolates the contribution of the next-next retrieval step versus standard next-token retrieval; if the speculated token is inaccurate, the second retrieval may add noise rather than signal, collapsing the net gain. A direct comparison of acceptance rates with and without the next-next branch is needed to substantiate the 2.61× and 3.28-token claims.
[§4] §4 (Experiments): The abstract and results report concrete speedups and acceptance numbers, yet the manuscript provides insufficient detail on the exact retrieval baselines, reference corpus construction, datasets, and statistical significance or variance across runs. Without these, it is impossible to assess whether the reported gains are robust or merely reflect favorable experimental conditions.

minor comments (2)

The notation for the two-step retrieval process could be clarified with a small diagram or pseudocode to make the distinction between next-token and next-next-token retrieval explicit.
A few sentences on the computational overhead of the additional retrieval step relative to standard retrieval would help readers evaluate the net efficiency benefit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the paper to incorporate the suggested improvements where appropriate.

read point-by-point responses

Referee: [§3] §3 (Method): The central claim that last-token logit speculation yields useful next-next references rests on an unverified assumption. No ablation isolates the contribution of the next-next retrieval step versus standard next-token retrieval; if the speculated token is inaccurate, the second retrieval may add noise rather than signal, collapsing the net gain. A direct comparison of acceptance rates with and without the next-next branch is needed to substantiate the 2.61× and 3.28-token claims.

Authors: We agree that an explicit ablation isolating the next-next retrieval branch is necessary to substantiate the central claim. In the revised manuscript we add a new ablation experiment that directly compares acceptance rates and speedups for the full LogitSpec pipeline against a next-token-only retrieval baseline (identical retrieval settings, same reference corpus). The results confirm a consistent improvement in mean accepted tokens when the next-next branch is included, indicating that the logit-based speculation supplies useful rather than noisy references under the conditions tested. We also add a brief discussion of cases where the speculated token is inaccurate and why the net effect remains positive. revision: yes
Referee: [§4] §4 (Experiments): The abstract and results report concrete speedups and acceptance numbers, yet the manuscript provides insufficient detail on the exact retrieval baselines, reference corpus construction, datasets, and statistical significance or variance across runs. Without these, it is impossible to assess whether the reported gains are robust or merely reflect favorable experimental conditions.

Authors: We acknowledge that the experimental section lacked sufficient detail for full reproducibility and robustness assessment. In the revised version we expand §4 with: (i) precise specifications of all retrieval baselines (matching criteria, top-k values, and implementation), (ii) explicit description of reference-corpus construction (size, sources, and preprocessing steps), (iii) complete dataset details including splits and token counts, and (iv) reporting of standard deviations across three independent runs together with paired statistical significance tests. These additions directly address the concern and allow readers to evaluate the stability of the 2.61× and 3.28-token figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical motivation and experimental validation remain independent of inputs

full rationale

The paper motivates LogitSpec from an empirical observation that last-token logits can inform next-next token speculation, then implements a two-step retrieval process. Speedup and acceptance metrics (2.61×, 3.28 tokens) are reported from benchmark experiments rather than any fitted parameter or self-referential definition. No equations, derivations, or self-citations reduce the central claims to tautological constructions or prior author results that presuppose the target outcome. The method is presented as training-free and plug-and-play, with performance claims resting on external evaluation rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that last-token logits contain usable information about the token after the next one and that this information improves retrieval quality; no free parameters or new invented entities are introduced in the abstract description.

axioms (1)

domain assumption The logit of the last token can speculate the next next token in a way that improves draft retrieval accuracy.
This observation is stated as the direct motivation for the two-step generation process.

pith-pipeline@v0.9.0 · 5808 in / 1323 out tokens · 61275 ms · 2026-05-19T06:49:23.341961+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the logit of the last token can not only predict the next token, but also speculate the next next token... retrieving relevant reference for both the next token and the next next token

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

A literature review on question answering techniques, paradigms and systems

Marco Antonio Calijorne Soares and Fernando Silva Parreiras. A literature review on question answering techniques, paradigms and systems. Journal of King Saud University - Computer and Information Sciences, 32 0 (6): 0 635--646, 2020. ISSN 1319-1578. doi:https://doi.org/10.1016/j.jksuci.2018.08.005. URL https://www.sciencedirect.com/science/article/pii/S1...

work page doi:10.1016/j.jksuci.2018.08.005 2020
[6]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URL https://arxiv.org/abs/2406.00515

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

A survey on recent advances in llm-based multi-turn dialogue systems, 2024

Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems, 2024. URL https://arxiv.org/abs/2402.18013

work page arXiv 2024
[8]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 1...

work page 2023
[9]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023. URL https://arxiv.org/abs/2302.01318

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

PEARL : Parallel speculative decoding with adaptive draft length

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. PEARL : Parallel speculative decoding with adaptive draft length. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=QOXrVMiHGK

work page 2025
[11]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test, 2025 a . URL https://arxiv.org/abs/2503.01840

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Prompt lookup decoding, November 2023

Apoorv Saxena. Prompt lookup decoding, November 2023. URL https://github.com/apoorvumang/prompt-lookup-decoding/

work page 2023
[13]

Break the sequential dependency of llm inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024
[14]

REST : Retrieval-based speculative decoding

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. REST : Retrieval-based speculative decoding. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1582--1595, Mexico Cit...

work page doi:10.18653/v1/2024.naacl-long.88 2024
[15]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024
[16]

Hydra: Sequentially-dependent draft heads for medusa decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=FbhjirzvJG

work page 2024
[17]

Clover: Regressive lightweight speculative decoding with sequential knowledge, 2024

Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, and Bin Cui. Clover: Regressive lightweight speculative decoding with sequential knowledge, 2024. URL https://arxiv.org/abs/2405.00263

work page arXiv 2024
[18]

Glide with a cape: a low-hassle method to accelerate speculative decoding

Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, and Yang You. Glide with a cape: a low-hassle method to accelerate speculative decoding. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024
[19]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025 b . URL https://arxiv.org/abs/2401.15077

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

EAGLE -2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE -2: Faster inference of language models with dynamic draft trees. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421--7432, Miami, Florida, USA, November 2024. Association for Computati...

work page doi:10.18653/v1/2024.emnlp-main.422 2024
[21]

Draft & verify: Lossless large language model acceleration via self-speculative decoding

Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11263-...

work page doi:10.18653/v1/2024.acl-long.607 2024
[22]

Layerskip: Enabling early exit inference and self-speculative decoding, August 2024

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self-speculative decoding, August 2024. URL https://aclanthology.org/2024.acl-long.681

work page 2024
[23]

SWIFT : On-the-fly self-speculative decoding for LLM inference acceleration

Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. SWIFT : On-the-fly self-speculative decoding for LLM inference acceleration. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=EKJhH5D5wA

work page 2025
[24]

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction, 2024. URL https://arxiv.org/abs/2404.19737

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics ACL 2024, pages 7655--7671, B...

work page doi:10.18653/v1/2024.findings-acl.456 2024
[26]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM ...

work page doi:10.1145/3620666.3651335 2024
[27]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Abstractive text summarization using sequence-to-sequence RNN s and beyond

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, C a g lar Gu l c ehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence RNN s and beyond. In Stefan Riezler and Yoav Goldberg, editors, Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages 280--290, Berlin, Germany, August 2016. Association fo...

work page doi:10.18653/v1/k16-1028 2016
[30]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[31]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020
[32]

NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. Cuda, release: 10.2.89, 2020. URL https://developer.nvidia.com/cuda-toolkit

work page 2020
[33]

Learning harmonized representations for speculative sampling

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=T9u56s7mbk

work page 2025
[34]

Coral: Learning consistent representations across multi-step training with lighter speculative drafter, 2025

Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, and Zhongchao Shi. Coral: Learning consistent representations across multi-step training with lighter speculative drafter, 2025. URL https://arxiv.org/abs/2502.16880

work page arXiv 2025
[35]

Judge decoding: Faster speculative sampling requires going beyond model alignment, 2025

Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, and Jonas Kohler. Judge decoding: Faster speculative sampling requires going beyond model alignment, 2025. URL https://arxiv.org/abs/2501.19309

work page arXiv 2025
[36]

Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, and Emad Barsoum. Gumiho: A hybrid architecture to prioritize early tokens in speculative decoding, 2025 c . URL https://arxiv.org/abs/2503.10135

work page arXiv 2025
[37]

Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2025

Kaixuan Huang, Xudong Guo, and Mengdi Wang. Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2025. URL https://openreview.net/forum?id=NnExMNiTHw

work page 2025
[38]

Opt-tree: Speculative decoding with adaptive draft tree structure, 2024

Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure, 2024. URL https://arxiv.org/abs/2406.17276

work page arXiv 2024
[39]

Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures, 2024 b

Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, and Kai Yu. Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures, 2024 b . URL https://arxiv.org/abs/2412.18910

work page arXiv 2024
[40]

Speed: Speculative pipelined execution for efficient decoding, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, and Sophia Shao. Speed: Speculative pipelined execution for efficient decoding, 2024. URL https://arxiv.org/abs/2310.12072

work page arXiv 2024
[41]

Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding, 2023

Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding, 2023. URL https://arxiv.org/abs/2310.05424

work page arXiv 2023
[42]

Speculative decoding via early-exiting for faster llm inference with thompson sampling control mechanism, 2024

Jiahao Liu, Qifan Wang, Jingang Wang, and Xunliang Cai. Speculative decoding via early-exiting for faster llm inference with thompson sampling control mechanism, 2024. URL https://arxiv.org/abs/2406.03853

work page arXiv 2024
[43]

Turning trash into treasure: Accelerating inference of large language models with token recycling, 2024

Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, and Wanxiang Che. Turning trash into treasure: Accelerating inference of large language models with token recycling, 2024. URL https://arxiv.org/abs/2408.08696

work page arXiv 2024
[44]

Sam decoding: Speculative decoding via suffix automaton, 2024

Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. Sam decoding: Speculative decoding via suffix automaton, 2024. URL https://arxiv.org/abs/2411.10666

work page arXiv 2024

[1] [1]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

A literature review on question answering techniques, paradigms and systems

Marco Antonio Calijorne Soares and Fernando Silva Parreiras. A literature review on question answering techniques, paradigms and systems. Journal of King Saud University - Computer and Information Sciences, 32 0 (6): 0 635--646, 2020. ISSN 1319-1578. doi:https://doi.org/10.1016/j.jksuci.2018.08.005. URL https://www.sciencedirect.com/science/article/pii/S1...

work page doi:10.1016/j.jksuci.2018.08.005 2020

[6] [6]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URL https://arxiv.org/abs/2406.00515

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

A survey on recent advances in llm-based multi-turn dialogue systems, 2024

Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems, 2024. URL https://arxiv.org/abs/2402.18013

work page arXiv 2024

[8] [8]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 1...

work page 2023

[9] [9]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023. URL https://arxiv.org/abs/2302.01318

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

PEARL : Parallel speculative decoding with adaptive draft length

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. PEARL : Parallel speculative decoding with adaptive draft length. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=QOXrVMiHGK

work page 2025

[11] [11]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test, 2025 a . URL https://arxiv.org/abs/2503.01840

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Prompt lookup decoding, November 2023

Apoorv Saxena. Prompt lookup decoding, November 2023. URL https://github.com/apoorvumang/prompt-lookup-decoding/

work page 2023

[13] [13]

Break the sequential dependency of llm inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024

[14] [14]

REST : Retrieval-based speculative decoding

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. REST : Retrieval-based speculative decoding. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1582--1595, Mexico Cit...

work page doi:10.18653/v1/2024.naacl-long.88 2024

[15] [15]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024

[16] [16]

Hydra: Sequentially-dependent draft heads for medusa decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=FbhjirzvJG

work page 2024

[17] [17]

Clover: Regressive lightweight speculative decoding with sequential knowledge, 2024

Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, and Bin Cui. Clover: Regressive lightweight speculative decoding with sequential knowledge, 2024. URL https://arxiv.org/abs/2405.00263

work page arXiv 2024

[18] [18]

Glide with a cape: a low-hassle method to accelerate speculative decoding

Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, and Yang You. Glide with a cape: a low-hassle method to accelerate speculative decoding. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024

[19] [19]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025 b . URL https://arxiv.org/abs/2401.15077

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

EAGLE -2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE -2: Faster inference of language models with dynamic draft trees. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421--7432, Miami, Florida, USA, November 2024. Association for Computati...

work page doi:10.18653/v1/2024.emnlp-main.422 2024

[21] [21]

Draft & verify: Lossless large language model acceleration via self-speculative decoding

Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11263-...

work page doi:10.18653/v1/2024.acl-long.607 2024

[22] [22]

Layerskip: Enabling early exit inference and self-speculative decoding, August 2024

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self-speculative decoding, August 2024. URL https://aclanthology.org/2024.acl-long.681

work page 2024

[23] [23]

SWIFT : On-the-fly self-speculative decoding for LLM inference acceleration

Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. SWIFT : On-the-fly self-speculative decoding for LLM inference acceleration. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=EKJhH5D5wA

work page 2025

[24] [24]

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction, 2024. URL https://arxiv.org/abs/2404.19737

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics ACL 2024, pages 7655--7671, B...

work page doi:10.18653/v1/2024.findings-acl.456 2024

[26] [26]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM ...

work page doi:10.1145/3620666.3651335 2024

[27] [27]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[29] [29]

Abstractive text summarization using sequence-to-sequence RNN s and beyond

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, C a g lar Gu l c ehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence RNN s and beyond. In Stefan Riezler and Yoav Goldberg, editors, Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages 280--290, Berlin, Germany, August 2016. Association fo...

work page doi:10.18653/v1/k16-1028 2016

[30] [30]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[31] [31]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020

[32] [32]

NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. Cuda, release: 10.2.89, 2020. URL https://developer.nvidia.com/cuda-toolkit

work page 2020

[33] [33]

Learning harmonized representations for speculative sampling

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=T9u56s7mbk

work page 2025

[34] [34]

Coral: Learning consistent representations across multi-step training with lighter speculative drafter, 2025

Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, and Zhongchao Shi. Coral: Learning consistent representations across multi-step training with lighter speculative drafter, 2025. URL https://arxiv.org/abs/2502.16880

work page arXiv 2025

[35] [35]

Judge decoding: Faster speculative sampling requires going beyond model alignment, 2025

Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, and Jonas Kohler. Judge decoding: Faster speculative sampling requires going beyond model alignment, 2025. URL https://arxiv.org/abs/2501.19309

work page arXiv 2025

[36] [36]

Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, and Emad Barsoum. Gumiho: A hybrid architecture to prioritize early tokens in speculative decoding, 2025 c . URL https://arxiv.org/abs/2503.10135

work page arXiv 2025

[37] [37]

Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2025

Kaixuan Huang, Xudong Guo, and Mengdi Wang. Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2025. URL https://openreview.net/forum?id=NnExMNiTHw

work page 2025

[38] [38]

Opt-tree: Speculative decoding with adaptive draft tree structure, 2024

Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure, 2024. URL https://arxiv.org/abs/2406.17276

work page arXiv 2024

[39] [39]

Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures, 2024 b

Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, and Kai Yu. Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures, 2024 b . URL https://arxiv.org/abs/2412.18910

work page arXiv 2024

[40] [40]

Speed: Speculative pipelined execution for efficient decoding, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, and Sophia Shao. Speed: Speculative pipelined execution for efficient decoding, 2024. URL https://arxiv.org/abs/2310.12072

work page arXiv 2024

[41] [41]

Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding, 2023

Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding, 2023. URL https://arxiv.org/abs/2310.05424

work page arXiv 2023

[42] [42]

Speculative decoding via early-exiting for faster llm inference with thompson sampling control mechanism, 2024

Jiahao Liu, Qifan Wang, Jingang Wang, and Xunliang Cai. Speculative decoding via early-exiting for faster llm inference with thompson sampling control mechanism, 2024. URL https://arxiv.org/abs/2406.03853

work page arXiv 2024

[43] [43]

Turning trash into treasure: Accelerating inference of large language models with token recycling, 2024

Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, and Wanxiang Che. Turning trash into treasure: Accelerating inference of large language models with token recycling, 2024. URL https://arxiv.org/abs/2408.08696

work page arXiv 2024

[44] [44]

Sam decoding: Speculative decoding via suffix automaton, 2024

Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. Sam decoding: Speculative decoding via suffix automaton, 2024. URL https://arxiv.org/abs/2411.10666

work page arXiv 2024