pith. sign in

arxiv: 2605.20104 · v1 · pith:DQXVLIN7new · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Pith reviewed 2026-05-20 06:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords speculative decodingdraft treepruningretrievallarge language modelsinference accelerationhybrid constructionacceptance rate
0
0 comments X

The pith

Graft recovers lost acceptance length in speculative decoding by attaching retrieved tokens into positions freed by pruning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the computational budget freed by pruning draft trees in speculative decoding can be redirected to retrieval, which then compensates for the coverage lost during pruning. By attaching highly predictive retrieved tokens into the open positions created by pruning, the method maintains high acceptance rates while reducing overall overhead. This hybrid prune-then-graft process is presented as training-free and lossless, allowing draft trees to approach the acceptance performance of dense trees without their full cost. A reader would care because it directly targets the bandwidth and latency bottlenecks that currently limit end-to-end speedups in practical LLM inference across short contexts, long contexts, and very large models.

Core claim

Graft couples pruning and retrieval as mutually reinforcing steps in a sequential prune-then-graft mechanism: pruning removes marginal branches to free budget and create topological gaps, while retrieval supplies predictive tokens that are grafted into those gaps with near-zero overhead, thereby recovering the accepted length that pruning would otherwise discard and establishing a new Pareto frontier for speculative decoding.

What carries the argument

The prune-then-graft mechanism, which uses dynamic-depth pruning to free resources and then attaches retrieved tokens into the resulting positions to restore coverage.

If this is right

  • Up to 5.41× end-to-end speedup on short-context benchmarks.
  • Up to 21.8% average speedup improvement over EAGLE-3 on the Qwen3-235B model.
  • Consistent gains across short-context, long-context, and large-scale model settings.
  • Preliminary extension possible to DFlash-style block drafting paradigms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same compensation pattern could be tested in other tree-search or beam-search settings where pruning creates reusable slots for external information.
  • Dynamic retrieval sources might further improve the method if the grafted tokens are drawn from domain-specific corpora rather than general ones.
  • The approach suggests that future draft-tree methods could treat pruning not as a pure reduction but as an opening for targeted augmentation.

Load-bearing premise

Retrieved tokens can be attached into pruned positions with near-zero overhead while remaining predictive enough to avoid new verification failures or hidden costs.

What would settle it

A measurement showing that the combined latency of retrieval and attachment plus any drop in acceptance rate exceeds the latency savings from pruning alone.

read the original abstract

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41$\times$ speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Graft, a training-free and lossless framework for speculative decoding that couples dynamic-depth pruning of draft trees with retrieval-based grafting. Pruning frees computational budget by removing marginal branches, after which highly predictive retrieved tokens are attached into the resulting topological gaps via a sequential prune-then-graft mechanism with claimed near-zero overhead. This is asserted to recover acceptance length lost to pruning and break the Pareto tradeoff between tree size and verification cost. Evaluations across short-context, long-context, and large-scale models (including Qwen3-235B) report up to 5.41× speedup and up to 21.8% average improvement over EAGLE-3, with a preliminary extension to DFlash-style block drafting.

Significance. If the claims hold, the work meaningfully advances practical LLM inference by resolving the VRAM/compute overheads of expansive draft trees through mutually reinforcing pruning and retrieval. The training-free, lossless character and the concrete, falsifiable speedups across diverse deployment regimes constitute clear strengths. The preliminary exploration of grafting beyond standard autoregressive trees supplies useful initial evidence for broader applicability.

major comments (1)
  1. [§3.2] §3.2 (prune-then-graft mechanism): The manuscript states that grafting retrieved tokens into pruned positions incurs near-zero overhead and fully recovers acceptance length without new verification failures, yet provides no explicit accounting of any incremental forward-pass cost during verification or any shift in the draft distribution induced by topological gap-filling. Because the reported speedups (5.41× and +21.8 % over EAGLE-3) rest directly on this assumption, a quantitative breakdown of VRAM versus compute overhead before and after grafting is required to substantiate the central claim.
minor comments (2)
  1. [Table 2] Table 2: Speedup figures are presented without error bars or the number of independent runs; adding these would strengthen the cross-method comparisons.
  2. [Figure 4] Figure 4: Axis labels and legend entries for the long-context setting are slightly compressed; increasing font size or splitting the legend would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment of our work. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation of our central claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (prune-then-graft mechanism): The manuscript states that grafting retrieved tokens into pruned positions incurs near-zero overhead and fully recovers acceptance length without new verification failures, yet provides no explicit accounting of any incremental forward-pass cost during verification or any shift in the draft distribution induced by topological gap-filling. Because the reported speedups (5.41× and +21.8 % over EAGLE-3) rest directly on this assumption, a quantitative breakdown of VRAM versus compute overhead before and after grafting is required to substantiate the central claim.

    Authors: We agree that an explicit quantitative breakdown would improve clarity and substantiation. In the current manuscript, the near-zero overhead is justified by the sequential design: pruning first releases budget (in terms of reduced tree width and verification cost), which is then reallocated to grafting without introducing extra model calls—the grafted tokens are inserted into the existing tree topology and verified in the same batched forward pass as the pruned tree. This preserves the original verification logic and acceptance criteria, with no new failures introduced because grafted positions are treated identically to original branches. Retrieval selection is based on high-probability matches that align with the model's predictive distribution, minimizing any shift. To directly address the request, we will revise §3.2 to include a dedicated quantitative analysis with tables reporting VRAM usage, additional compute (FLOPs and latency), and acceptance-length recovery before versus after grafting, measured on the same hardware and models as the main experiments (including Qwen3-235B). revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is self-contained with external empirical validation

full rationale

The paper presents Graft as a training-free, lossless prune-then-graft mechanism for speculative decoding trees, with speedups measured directly against independent baselines such as EAGLE-3. No equations, fitted parameters, or self-citations are shown in the provided text that reduce the central claims or reported speedups to tautological inputs by construction. The derivation relies on the described algorithmic steps and external comparisons rather than self-referential definitions or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method is framed as training-free with no new postulated quantities.

pith-pipeline@v0.9.0 · 5841 in / 1143 out tokens · 50632 ms · 2026-05-20T06:54:41.182236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 20 internal anchors

  1. [1]

    LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 31...

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774,

  4. [4]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

  5. [5]

    Rapid: Long-context inference with retrieval-augmented speculative decoding.arXiv preprint arXiv:2502.20330,

    Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, and Michael Qizhe Shieh. Rapid: Long-context inference with retrieval-augmented speculative decoding.arXiv preprint arXiv:2502.20330,

  6. [6]

    arXiv preprint arXiv:2602.06036 , year=

    Jian Chen, Yesheng Liang, and Zhijian Liu. DFlash: Block Diffusion for Flash Speculative Decoding.arXiv preprint arXiv:2602.06036,

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  8. [8]

    PACT: Parameterized Clipping Activation for Quantized Neural Networks

    URLhttps://lmsys.org/blog/2023-03-30-vicuna/. Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks.arXiv preprint arXiv:1805.06085,

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  10. [10]

    Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057,

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057,

  11. [11]

    Better & Faster Large Language Models via Multi-token Prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737,

  12. [12]

    URLhttps://arxiv.org/abs/2407.21783. 17 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  13. [13]

    Rest: Retrieval-based speculative decoding

    Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. Rest: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1582–1595,

  14. [14]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  15. [15]

    ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

    Xinyi Hu, Yuhao Shen, Baolin Zhang, Hengxin Zhang, Jun Dai, Shuang Ge, Lei Chen, Yue Li, and Mingcheng Wan. Echo: Elastic speculative decoding with sparse gating for high-concurrency scenarios.arXiv preprint arXiv:2604.09603,

  16. [16]

    Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

    Hayate Iso, Tiyasa Mitra, Sudipta Mondal, Rasoul Shafipour, Venmugil Elango, Terry Kong, Yuki Huang, Seonjin Na, Izzy Putterman, Benjamin Chislett, Maor Ashkenazi, Joseph Guman, Gerald Shen, Tugrul Konuk, Ashwath Aithal, Ritika Borkar, Ran Zilberstein, and Bita Rouhani. Accelerating rl post-training rollouts via system-integrated speculative decoding.arXi...

  17. [17]

    SpecVLM: Enhancing speculative decoding of video LLMs via verifier-guided token pruning

    Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, and Huan Li. SpecVLM: Enhancing speculative decoding of video LLMs via verifier-guided token pruning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 72...

  18. [18]

    ISBN 979-8-89176-332-6

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.366. URL https://aclanthology.org/2025.emnlp-main.366/. Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou, Gang Chen, and Huan Li. See the forest for the trees: Loosely speculative decoding via visual-semantic guidance for efficient inference of vi...

  19. [19]

    See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

    URLhttps://arxiv.org/abs/2604.05650. Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, and Cong Wang. Parallelvlm: Lossless video-llm acceleration with visual alignment aware parallel speculative decoding.arXiv preprint arXiv:2603.19610,

  20. [20]

    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024a. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024b. Yuhui Li, Fangyun Wei, Chao Z...

  21. [21]

    Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts.arXiv preprint arXiv:2509.23232, 2025a

    Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Xu Han, Peng Li, Anxiang Zeng, and Jinsong Su. Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts.arXiv preprint arXiv:2509.23232, 2025a. Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, and Winston Hu. Parallel speculative decoding with adaptive draft...

  22. [22]

    LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

    Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, and Xiao Sun. Logitspec: Accelerating retrieval-based speculative decoding via next next token speculation.arXiv preprint arXiv:2507.01449, 2025b. Tianyu Liu, Qitan Lv, Yuhao Shen, Xiao Sun, and Xiaoyan Sun. Talon: Confidence-aware speculative decoding with adaptive token trees.arXiv preprint arXiv:2601.07353, 2026a...

  23. [23]

    Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

    Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, and Roy Schwartz. Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

  24. [24]

    Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Lang...

  25. [25]

    Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

    Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang. Seer: Online context learning for fast synchronous llm reinforcement learning. arXiv preprint arXiv:2511.14617,

  26. [26]

    Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding.arXiv preprint arXiv:2408.11049,

    Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, and Beidi Chen. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding.arXiv preprint arXiv:2408.11049,

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URLhttps://github.com/apoorvumang/ prompt-lookup-decoding/. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  28. [28]

    SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

    Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, and Cong Wang. Speculative decoding via hybrid drafting and rollback-aware branch parallelism.arXiv preprint arXiv:2506.01979,

  29. [29]

    Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

    Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, and Cong Wang. Double: Breaking the acceleration limit via double retrieval speculative parallelism.arXiv preprint arXiv:2601.05524,

  30. [30]

    KNN- SSD: Enabling dynamic self-speculative decoding via nearest neighbor layer set optimization

    Mingbo Song, Heming Xia, Jun Zhang, Chak Tou Leong, Qiancheng Xu, Wenjie Li, and Sujian Li. KNN- SSD: Enabling dynamic self-speculative decoding via nearest neighbor layer set optimization. In Vera Demberg, Kentaro Inui, and Lluís Marquez (eds.),Findings of the Association for Computational Linguistics: EACL 2026, pp. 641–655, Rabat, Morocco, March

  31. [31]

    ISBN 979-8-89176-386-9

    Association for Computational Linguistics. ISBN 979-8-89176-386-9. doi: 10.18653/v1/2026.findings-eacl.31. URLhttps://aclanthology.org/2026. findings-eacl.31/. Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31,

  32. [32]

    Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912,

    Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912,

  33. [33]

    arXiv preprint arXiv:2512.02337 , year=

    Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, and Kun Xia. Specpv: Improving self-speculative decoding for long-context generation via partial verification.arXiv preprint arXiv:2512.02337,

  34. [34]

    Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models

    19 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alp...

  35. [35]

    Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122,

    Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122,

  36. [36]

    Tokenswift: Lossless acceleration of ultra long sequence generation.arXiv preprint arXiv:2502.18890,

    Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, and Zilong Zheng. Tokenswift: Lossless acceleration of ultra long sequence generation.arXiv preprint arXiv:2502.18890,

  37. [37]

    Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

    Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851,

  38. [38]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al

    URLhttps://arxiv.org/abs/2410.06916. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. PenghuiYang, CunxiaoDu, FengzhuoZhang, HaonanWang, TianyuPang, ChaoDu, andBoAn. Longspec: Long- context lossless speculative decodi...

  39. [39]

    Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

    URLhttps://arxiv.org/abs/2604.05546. Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, and Kai Yu. Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures.arXiv preprint arXiv:2412.18910, 2024b. Ziyin Zhang, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Rui Wang, and Zhaopeng Tu. Draft model knows when...

  40. [40]

    20 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding Algorithm 1GraftGeneration 1:Input:prefix𝑥 1:𝑡, target model𝑝, tree drafter𝑞, budget𝐾max, adjacency matrixM 2:EnsureMis allocated on GPU; initialize empty rows if needed 3:Precompute retrieval-template caches for pruning stages𝑠∈ {𝑑 0, 𝑑1, 𝑑5} 4:Run target prefill on𝑥 1:𝑡 and ...

  41. [41]

    Speedup Condition with Grafted Retrieval Let𝐴 0 =𝐴(T 0),𝐴 𝜋 =𝐴(T 𝜋), and𝐴 𝑔 =𝐴(T 𝑔)

    B.3. Speedup Condition with Grafted Retrieval Let𝐴 0 =𝐴(T 0),𝐴 𝜋 =𝐴(T 𝜋), and𝐴 𝑔 =𝐴(T 𝑔). Let Γ𝑔 =𝐴 𝑔 −𝐴 𝜋 (24) be the accepted-length gain brought by grafted retrieval. Since|T𝑔 | ≤ |T 0|, the verification cost of the hybrid tree is no larger than that of the dense tree under the same packing strategy. The hybrid tree improves over pure pruning when (𝐴 𝜋...

  42. [42]

    The source code of this project will be made available at a later time. C.1. Data Configurations We evaluateGraft under three complementary settings: short-context generation, long-context generation, and batched high-concurrency decoding. For short-context evaluation, we follow the standard EAGLE-3 and Spec-Bench setup, covering code generation, mathemat...

  43. [43]

    Vicuna 68M corresponds to the lightweight draft module used by the Vicuna-13B tree-draft pair

    Table 10 lists the key model architecture configurations used in our experiments. Vicuna 68M corresponds to the lightweight draft module used by the Vicuna-13B tree-draft pair. 25 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding Table 8 | Short-context target and draft model checkpoints.We follow the public EAGLE3-compatible ch...

  44. [44]

    and EAGLE-3 (Li et al., 2025). EAGLE-3 is our primary static-tree instantiation because it provides a strong public tree drafter;Graft changes how the candidate budget is allocated rather than depending on an EAGLE-specific verification rule.Dynamic-tree methodsinclude DDD (Brown et al.,

  45. [45]

    and ECHO (Hu et al., 2026), which reduce wasted drafting through adaptive tree construction. When a baseline was originally implemented for a weaker or older tree backbone, we adapt the scheduling policy to the strongest available public setting whenever possible, so that the comparison focuses on tree construction rather than draft-model strength. For sh...

  46. [46]

    ParallelVLM extends lossless acceleration to video-LLMs by considering visual-alignment-aware parallel speculative decoding (Kong et al., 2026)

    further 30 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding extends this line of work by introducing vision-aware loose verification for Video-LLMs. ParallelVLM extends lossless acceleration to video-LLMs by considering visual-alignment-aware parallel speculative decoding (Kong et al., 2026). These works suggest that future spe...