LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
Pith reviewed 2026-05-19 06:49 UTC · model grok-4.3
The pith
LogitSpec uses the last token logit to speculate the next-next token and retrieves drafts for both next and next-next positions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec can achieve up to 2.61× speedup and 3.28 mean accepted tokens per decoding step.
What carries the argument
The two-step draft generation process that uses the last logit to speculate the next-next token before retrieving references covering both consecutive positions.
If this is right
- LLM text generation runs with higher average token acceptance per step, reaching 3.28.
- Inference achieves up to 2.61 times speedup without any draft model or training step.
- The approach plugs directly into existing LLM inference systems with no code changes required.
- Retrieval-based methods gain accuracy by expanding the search to cover the token after next.
Where Pith is reading between the lines
- The logit speculation step could be tested in non-retrieval speculative decoding setups to see if it reduces rejections there as well.
- Extending the idea to speculate three tokens ahead might produce further gains if retrieval quality remains high.
- Reference databases might be reorganized around logit patterns to support faster multi-step matching in long sequences.
Load-bearing premise
The logit of the last token provides useful speculation for the next-next token that leads to more accurate and relevant retrieved drafts than standard next-token-only retrieval.
What would settle it
A controlled test on the same benchmarks where adding next-next logit speculation yields no increase or a decrease in mean accepted tokens or overall speedup compared to next-token-only retrieval would disprove the benefit.
Figures
read the original abstract
Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LogitSpec, a training-free extension to retrieval-based speculative decoding. It observes that the logit of the last token can speculate the next-next token, then performs two-step retrieval of reference continuations for both the next token and the speculated next-next token. The method is presented as plug-and-play and is evaluated on text-generation benchmarks, reporting up to 2.61× speedup and 3.28 mean accepted tokens per decoding step.
Significance. If the reported speedups and acceptance rates are reproducible and exceed strong next-token-only retrieval baselines, the approach would offer a lightweight way to increase draft quality in retrieval-based speculative decoding without introducing trainable components or draft models. The training-free and plug-and-play character is a practical strength for deployment.
major comments (2)
- [§3] §3 (Method): The central claim that last-token logit speculation yields useful next-next references rests on an unverified assumption. No ablation isolates the contribution of the next-next retrieval step versus standard next-token retrieval; if the speculated token is inaccurate, the second retrieval may add noise rather than signal, collapsing the net gain. A direct comparison of acceptance rates with and without the next-next branch is needed to substantiate the 2.61× and 3.28-token claims.
- [§4] §4 (Experiments): The abstract and results report concrete speedups and acceptance numbers, yet the manuscript provides insufficient detail on the exact retrieval baselines, reference corpus construction, datasets, and statistical significance or variance across runs. Without these, it is impossible to assess whether the reported gains are robust or merely reflect favorable experimental conditions.
minor comments (2)
- The notation for the two-step retrieval process could be clarified with a small diagram or pseudocode to make the distinction between next-token and next-next-token retrieval explicit.
- A few sentences on the computational overhead of the additional retrieval step relative to standard retrieval would help readers evaluate the net efficiency benefit.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the paper to incorporate the suggested improvements where appropriate.
read point-by-point responses
-
Referee: [§3] §3 (Method): The central claim that last-token logit speculation yields useful next-next references rests on an unverified assumption. No ablation isolates the contribution of the next-next retrieval step versus standard next-token retrieval; if the speculated token is inaccurate, the second retrieval may add noise rather than signal, collapsing the net gain. A direct comparison of acceptance rates with and without the next-next branch is needed to substantiate the 2.61× and 3.28-token claims.
Authors: We agree that an explicit ablation isolating the next-next retrieval branch is necessary to substantiate the central claim. In the revised manuscript we add a new ablation experiment that directly compares acceptance rates and speedups for the full LogitSpec pipeline against a next-token-only retrieval baseline (identical retrieval settings, same reference corpus). The results confirm a consistent improvement in mean accepted tokens when the next-next branch is included, indicating that the logit-based speculation supplies useful rather than noisy references under the conditions tested. We also add a brief discussion of cases where the speculated token is inaccurate and why the net effect remains positive. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract and results report concrete speedups and acceptance numbers, yet the manuscript provides insufficient detail on the exact retrieval baselines, reference corpus construction, datasets, and statistical significance or variance across runs. Without these, it is impossible to assess whether the reported gains are robust or merely reflect favorable experimental conditions.
Authors: We acknowledge that the experimental section lacked sufficient detail for full reproducibility and robustness assessment. In the revised version we expand §4 with: (i) precise specifications of all retrieval baselines (matching criteria, top-k values, and implementation), (ii) explicit description of reference-corpus construction (size, sources, and preprocessing steps), (iii) complete dataset details including splits and token counts, and (iv) reporting of standard deviations across three independent runs together with paired statistical significance tests. These additions directly address the concern and allow readers to evaluate the stability of the 2.61× and 3.28-token figures. revision: yes
Circularity Check
No circularity: empirical motivation and experimental validation remain independent of inputs
full rationale
The paper motivates LogitSpec from an empirical observation that last-token logits can inform next-next token speculation, then implements a two-step retrieval process. Speedup and acceptance metrics (2.61×, 3.28 tokens) are reported from benchmark experiments rather than any fitted parameter or self-referential definition. No equations, derivations, or self-citations reduce the central claims to tautological constructions or prior author results that presuppose the target outcome. The method is presented as training-free and plug-and-play, with performance claims resting on external evaluation rather than internal redefinition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The logit of the last token can speculate the next next token in a way that improves draft retrieval accuracy.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the logit of the last token can not only predict the next token, but also speculate the next next token... retrieving relevant reference for both the next token and the next next token
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Qwen Team. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
A literature review on question answering techniques, paradigms and systems
Marco Antonio Calijorne Soares and Fernando Silva Parreiras. A literature review on question answering techniques, paradigms and systems. Journal of King Saud University - Computer and Information Sciences, 32 0 (6): 0 635--646, 2020. ISSN 1319-1578. doi:https://doi.org/10.1016/j.jksuci.2018.08.005. URL https://www.sciencedirect.com/science/article/pii/S1...
-
[6]
A Survey on Large Language Models for Code Generation
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URL https://arxiv.org/abs/2406.00515
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
A survey on recent advances in llm-based multi-turn dialogue systems, 2024
Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems, 2024. URL https://arxiv.org/abs/2402.18013
-
[8]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 1...
work page 2023
-
[9]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023. URL https://arxiv.org/abs/2302.01318
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
PEARL : Parallel speculative decoding with adaptive draft length
Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. PEARL : Parallel speculative decoding with adaptive draft length. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=QOXrVMiHGK
work page 2025
-
[11]
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test, 2025 a . URL https://arxiv.org/abs/2503.01840
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Prompt lookup decoding, November 2023
Apoorv Saxena. Prompt lookup decoding, November 2023. URL https://github.com/apoorvumang/prompt-lookup-decoding/
work page 2023
-
[13]
Break the sequential dependency of llm inference using lookahead decoding
Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024
work page 2024
-
[14]
REST : Retrieval-based speculative decoding
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. REST : Retrieval-based speculative decoding. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1582--1595, Mexico Cit...
-
[15]
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024
work page 2024
-
[16]
Hydra: Sequentially-dependent draft heads for medusa decoding
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=FbhjirzvJG
work page 2024
-
[17]
Clover: Regressive lightweight speculative decoding with sequential knowledge, 2024
Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, and Bin Cui. Clover: Regressive lightweight speculative decoding with sequential knowledge, 2024. URL https://arxiv.org/abs/2405.00263
-
[18]
Glide with a cape: a low-hassle method to accelerate speculative decoding
Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, and Yang You. Glide with a cape: a low-hassle method to accelerate speculative decoding. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024
work page 2024
-
[19]
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025 b . URL https://arxiv.org/abs/2401.15077
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
EAGLE -2: Faster inference of language models with dynamic draft trees
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE -2: Faster inference of language models with dynamic draft trees. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421--7432, Miami, Florida, USA, November 2024. Association for Computati...
-
[21]
Draft & verify: Lossless large language model acceleration via self-speculative decoding
Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11263-...
-
[22]
Layerskip: Enabling early exit inference and self-speculative decoding, August 2024
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self-speculative decoding, August 2024. URL https://aclanthology.org/2024.acl-long.681
work page 2024
-
[23]
SWIFT : On-the-fly self-speculative decoding for LLM inference acceleration
Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. SWIFT : On-the-fly self-speculative decoding for LLM inference acceleration. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=EKJhH5D5wA
work page 2025
-
[24]
Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction, 2024. URL https://arxiv.org/abs/2404.19737
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics ACL 2024, pages 7655--7671, B...
-
[26]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM ...
-
[27]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Abstractive text summarization using sequence-to-sequence RNN s and beyond
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, C a g lar Gu l c ehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence RNN s and beyond. In Stefan Riezler and Yoav Goldberg, editors, Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages 280--290, Berlin, Germany, August 2016. Association fo...
-
[30]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[31]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...
work page 2020
-
[32]
NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. Cuda, release: 10.2.89, 2020. URL https://developer.nvidia.com/cuda-toolkit
work page 2020
-
[33]
Learning harmonized representations for speculative sampling
Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=T9u56s7mbk
work page 2025
-
[34]
Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, and Zhongchao Shi. Coral: Learning consistent representations across multi-step training with lighter speculative drafter, 2025. URL https://arxiv.org/abs/2502.16880
-
[35]
Judge decoding: Faster speculative sampling requires going beyond model alignment, 2025
Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, and Jonas Kohler. Judge decoding: Faster speculative sampling requires going beyond model alignment, 2025. URL https://arxiv.org/abs/2501.19309
- [36]
-
[37]
Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2025
Kaixuan Huang, Xudong Guo, and Mengdi Wang. Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2025. URL https://openreview.net/forum?id=NnExMNiTHw
work page 2025
-
[38]
Opt-tree: Speculative decoding with adaptive draft tree structure, 2024
Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure, 2024. URL https://arxiv.org/abs/2406.17276
-
[39]
Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures, 2024 b
Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, and Kai Yu. Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures, 2024 b . URL https://arxiv.org/abs/2412.18910
-
[40]
Speed: Speculative pipelined execution for efficient decoding, 2024
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, and Sophia Shao. Speed: Speculative pipelined execution for efficient decoding, 2024. URL https://arxiv.org/abs/2310.12072
-
[41]
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding, 2023. URL https://arxiv.org/abs/2310.05424
-
[42]
Jiahao Liu, Qifan Wang, Jingang Wang, and Xunliang Cai. Speculative decoding via early-exiting for faster llm inference with thompson sampling control mechanism, 2024. URL https://arxiv.org/abs/2406.03853
-
[43]
Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, and Wanxiang Che. Turning trash into treasure: Accelerating inference of large language models with token recycling, 2024. URL https://arxiv.org/abs/2408.08696
-
[44]
Sam decoding: Speculative decoding via suffix automaton, 2024
Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. Sam decoding: Speculative decoding via suffix automaton, 2024. URL https://arxiv.org/abs/2411.10666
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.