pith. sign in

arxiv: 2507.01449 · v3 · submitted 2025-07-02 · 💻 cs.CL

LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

Pith reviewed 2026-05-19 06:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords speculative decodingretrieval-basedlogit speculationdraft tokensLLM accelerationnext next tokentraining-free
0
0 comments X p. Extension

The pith

LogitSpec uses the last token logit to speculate the next-next token and retrieves drafts for both next and next-next positions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-based speculative decoding avoids draft models by fetching relevant references as candidate tokens. The challenge is that matching only the next token often produces inaccurate drafts that the target model rejects. LogitSpec solves this by using the logit of the last token to also speculate the token after next. It then retrieves references that align with both the next and next-next positions. This results in more accepted tokens and faster overall generation for large language models.

Core claim

LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec can achieve up to 2.61× speedup and 3.28 mean accepted tokens per decoding step.

What carries the argument

The two-step draft generation process that uses the last logit to speculate the next-next token before retrieving references covering both consecutive positions.

If this is right

  • LLM text generation runs with higher average token acceptance per step, reaching 3.28.
  • Inference achieves up to 2.61 times speedup without any draft model or training step.
  • The approach plugs directly into existing LLM inference systems with no code changes required.
  • Retrieval-based methods gain accuracy by expanding the search to cover the token after next.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The logit speculation step could be tested in non-retrieval speculative decoding setups to see if it reduces rejections there as well.
  • Extending the idea to speculate three tokens ahead might produce further gains if retrieval quality remains high.
  • Reference databases might be reorganized around logit patterns to support faster multi-step matching in long sequences.

Load-bearing premise

The logit of the last token provides useful speculation for the next-next token that leads to more accurate and relevant retrieved drafts than standard next-token-only retrieval.

What would settle it

A controlled test on the same benchmarks where adding next-next logit speculation yields no increase or a decrease in mean accepted tokens or overall speedup compared to next-token-only retrieval would disprove the benefit.

Figures

Figures reproduced from arXiv: 2507.01449 by Hao Li, Qitan Lv, Tianyu Liu, Xiao Sun, Xiaoyan Sun, Xing Gao.

Figure 1
Figure 1. Figure 1: Illustration of vanilla retrieval-based SD method and our [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Motivated observations. (a) The last logit can speculate the next next token with a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An overview of LogitSpec. At each decoding step, LogitSpec first utilizes the top-k entries of the last logit as the speculation to the next next token. Then, LogitSpec retrievals relevant references for both the next token and the next next token. Finally, LogitSpec organizes the draft tokens into a draft tree and prepares a tree attention for parallel verification. We further conduct experiments on Spec-… view at source ↗
Figure 4
Figure 4. Figure 4: Running time breakdown of the whole decoding process on Spec-Bench with Vicuna 7B. In-depth Running Time Analysis. To further in￾vestigate the effectiveness of LogitSpec, we conduct experiments to analyze the running time allocation within the whole decoding process. Specifically, there are five non-negligible components in LogitSpec, including (a)retrieving draft tokens: the process of re￾trieving referen… view at source ↗
read the original abstract

Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes LogitSpec, a training-free extension to retrieval-based speculative decoding. It observes that the logit of the last token can speculate the next-next token, then performs two-step retrieval of reference continuations for both the next token and the speculated next-next token. The method is presented as plug-and-play and is evaluated on text-generation benchmarks, reporting up to 2.61× speedup and 3.28 mean accepted tokens per decoding step.

Significance. If the reported speedups and acceptance rates are reproducible and exceed strong next-token-only retrieval baselines, the approach would offer a lightweight way to increase draft quality in retrieval-based speculative decoding without introducing trainable components or draft models. The training-free and plug-and-play character is a practical strength for deployment.

major comments (2)
  1. [§3] §3 (Method): The central claim that last-token logit speculation yields useful next-next references rests on an unverified assumption. No ablation isolates the contribution of the next-next retrieval step versus standard next-token retrieval; if the speculated token is inaccurate, the second retrieval may add noise rather than signal, collapsing the net gain. A direct comparison of acceptance rates with and without the next-next branch is needed to substantiate the 2.61× and 3.28-token claims.
  2. [§4] §4 (Experiments): The abstract and results report concrete speedups and acceptance numbers, yet the manuscript provides insufficient detail on the exact retrieval baselines, reference corpus construction, datasets, and statistical significance or variance across runs. Without these, it is impossible to assess whether the reported gains are robust or merely reflect favorable experimental conditions.
minor comments (2)
  1. The notation for the two-step retrieval process could be clarified with a small diagram or pseudocode to make the distinction between next-token and next-next-token retrieval explicit.
  2. A few sentences on the computational overhead of the additional retrieval step relative to standard retrieval would help readers evaluate the net efficiency benefit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the paper to incorporate the suggested improvements where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that last-token logit speculation yields useful next-next references rests on an unverified assumption. No ablation isolates the contribution of the next-next retrieval step versus standard next-token retrieval; if the speculated token is inaccurate, the second retrieval may add noise rather than signal, collapsing the net gain. A direct comparison of acceptance rates with and without the next-next branch is needed to substantiate the 2.61× and 3.28-token claims.

    Authors: We agree that an explicit ablation isolating the next-next retrieval branch is necessary to substantiate the central claim. In the revised manuscript we add a new ablation experiment that directly compares acceptance rates and speedups for the full LogitSpec pipeline against a next-token-only retrieval baseline (identical retrieval settings, same reference corpus). The results confirm a consistent improvement in mean accepted tokens when the next-next branch is included, indicating that the logit-based speculation supplies useful rather than noisy references under the conditions tested. We also add a brief discussion of cases where the speculated token is inaccurate and why the net effect remains positive. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract and results report concrete speedups and acceptance numbers, yet the manuscript provides insufficient detail on the exact retrieval baselines, reference corpus construction, datasets, and statistical significance or variance across runs. Without these, it is impossible to assess whether the reported gains are robust or merely reflect favorable experimental conditions.

    Authors: We acknowledge that the experimental section lacked sufficient detail for full reproducibility and robustness assessment. In the revised version we expand §4 with: (i) precise specifications of all retrieval baselines (matching criteria, top-k values, and implementation), (ii) explicit description of reference-corpus construction (size, sources, and preprocessing steps), (iii) complete dataset details including splits and token counts, and (iv) reporting of standard deviations across three independent runs together with paired statistical significance tests. These additions directly address the concern and allow readers to evaluate the stability of the 2.61× and 3.28-token figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical motivation and experimental validation remain independent of inputs

full rationale

The paper motivates LogitSpec from an empirical observation that last-token logits can inform next-next token speculation, then implements a two-step retrieval process. Speedup and acceptance metrics (2.61×, 3.28 tokens) are reported from benchmark experiments rather than any fitted parameter or self-referential definition. No equations, derivations, or self-citations reduce the central claims to tautological constructions or prior author results that presuppose the target outcome. The method is presented as training-free and plug-and-play, with performance claims resting on external evaluation rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that last-token logits contain usable information about the token after the next one and that this information improves retrieval quality; no free parameters or new invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption The logit of the last token can speculate the next next token in a way that improves draft retrieval accuracy.
    This observation is stated as the direct motivation for the two-step generation process.

pith-pipeline@v0.9.0 · 5808 in / 1323 out tokens · 61275 ms · 2026-05-19T06:49:23.341961+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  3. [3]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115

  4. [4]

    The Llama 3 Herd of Models

    Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  5. [5]

    A literature review on question answering techniques, paradigms and systems

    Marco Antonio Calijorne Soares and Fernando Silva Parreiras. A literature review on question answering techniques, paradigms and systems. Journal of King Saud University - Computer and Information Sciences, 32 0 (6): 0 635--646, 2020. ISSN 1319-1578. doi:https://doi.org/10.1016/j.jksuci.2018.08.005. URL https://www.sciencedirect.com/science/article/pii/S1...

  6. [6]

    A Survey on Large Language Models for Code Generation

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URL https://arxiv.org/abs/2406.00515

  7. [7]

    A survey on recent advances in llm-based multi-turn dialogue systems, 2024

    Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems, 2024. URL https://arxiv.org/abs/2402.18013

  8. [8]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 1...

  9. [9]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023. URL https://arxiv.org/abs/2302.01318

  10. [10]

    PEARL : Parallel speculative decoding with adaptive draft length

    Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. PEARL : Parallel speculative decoding with adaptive draft length. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=QOXrVMiHGK

  11. [11]

    EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test, 2025 a . URL https://arxiv.org/abs/2503.01840

  12. [12]

    Prompt lookup decoding, November 2023

    Apoorv Saxena. Prompt lookup decoding, November 2023. URL https://github.com/apoorvumang/prompt-lookup-decoding/

  13. [13]

    Break the sequential dependency of llm inference using lookahead decoding

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

  14. [14]

    REST : Retrieval-based speculative decoding

    Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. REST : Retrieval-based speculative decoding. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1582--1595, Mexico Cit...

  15. [15]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

  16. [16]

    Hydra: Sequentially-dependent draft heads for medusa decoding

    Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=FbhjirzvJG

  17. [17]

    Clover: Regressive lightweight speculative decoding with sequential knowledge, 2024

    Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, and Bin Cui. Clover: Regressive lightweight speculative decoding with sequential knowledge, 2024. URL https://arxiv.org/abs/2405.00263

  18. [18]

    Glide with a cape: a low-hassle method to accelerate speculative decoding

    Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, and Yang You. Glide with a cape: a low-hassle method to accelerate speculative decoding. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

  19. [19]

    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025 b . URL https://arxiv.org/abs/2401.15077

  20. [20]

    EAGLE -2: Faster inference of language models with dynamic draft trees

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE -2: Faster inference of language models with dynamic draft trees. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421--7432, Miami, Florida, USA, November 2024. Association for Computati...

  21. [21]

    Draft & verify: Lossless large language model acceleration via self-speculative decoding

    Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11263-...

  22. [22]

    Layerskip: Enabling early exit inference and self-speculative decoding, August 2024

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self-speculative decoding, August 2024. URL https://aclanthology.org/2024.acl-long.681

  23. [23]

    SWIFT : On-the-fly self-speculative decoding for LLM inference acceleration

    Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. SWIFT : On-the-fly self-speculative decoding for LLM inference acceleration. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=EKJhH5D5wA

  24. [24]

    Better & Faster Large Language Models via Multi-token Prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction, 2024. URL https://arxiv.org/abs/2404.19737

  25. [25]

    Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

    Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics ACL 2024, pages 7655--7671, B...

  26. [26]

    Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM ...

  27. [27]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  28. [28]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

  29. [29]

    Abstractive text summarization using sequence-to-sequence RNN s and beyond

    Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, C a g lar Gu l c ehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence RNN s and beyond. In Stefan Riezler and Yoav Goldberg, editors, Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages 280--290, Berlin, Germany, August 2016. Association fo...

  30. [30]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

  31. [31]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  32. [32]

    NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. Cuda, release: 10.2.89, 2020. URL https://developer.nvidia.com/cuda-toolkit

  33. [33]

    Learning harmonized representations for speculative sampling

    Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=T9u56s7mbk

  34. [34]

    Coral: Learning consistent representations across multi-step training with lighter speculative drafter, 2025

    Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, and Zhongchao Shi. Coral: Learning consistent representations across multi-step training with lighter speculative drafter, 2025. URL https://arxiv.org/abs/2502.16880

  35. [35]

    Judge decoding: Faster speculative sampling requires going beyond model alignment, 2025

    Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, and Jonas Kohler. Judge decoding: Faster speculative sampling requires going beyond model alignment, 2025. URL https://arxiv.org/abs/2501.19309

  36. [36]

    Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, and Emad Barsoum. Gumiho: A hybrid architecture to prioritize early tokens in speculative decoding, 2025 c . URL https://arxiv.org/abs/2503.10135

  37. [37]

    Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2025

    Kaixuan Huang, Xudong Guo, and Mengdi Wang. Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2025. URL https://openreview.net/forum?id=NnExMNiTHw

  38. [38]

    Opt-tree: Speculative decoding with adaptive draft tree structure, 2024

    Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure, 2024. URL https://arxiv.org/abs/2406.17276

  39. [39]

    Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures, 2024 b

    Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, and Kai Yu. Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures, 2024 b . URL https://arxiv.org/abs/2412.18910

  40. [40]

    Speed: Speculative pipelined execution for efficient decoding, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, and Sophia Shao. Speed: Speculative pipelined execution for efficient decoding, 2024. URL https://arxiv.org/abs/2310.12072

  41. [41]

    Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding, 2023

    Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding, 2023. URL https://arxiv.org/abs/2310.05424

  42. [42]

    Speculative decoding via early-exiting for faster llm inference with thompson sampling control mechanism, 2024

    Jiahao Liu, Qifan Wang, Jingang Wang, and Xunliang Cai. Speculative decoding via early-exiting for faster llm inference with thompson sampling control mechanism, 2024. URL https://arxiv.org/abs/2406.03853

  43. [43]

    Turning trash into treasure: Accelerating inference of large language models with token recycling, 2024

    Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, and Wanxiang Che. Turning trash into treasure: Accelerating inference of large language models with token recycling, 2024. URL https://arxiv.org/abs/2408.08696

  44. [44]

    Sam decoding: Speculative decoding via suffix automaton, 2024

    Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. Sam decoding: Speculative decoding via suffix automaton, 2024. URL https://arxiv.org/abs/2411.10666