Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
Pith reviewed 2026-05-20 06:54 UTC · model grok-4.3
The pith
Graft recovers lost acceptance length in speculative decoding by attaching retrieved tokens into positions freed by pruning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Graft couples pruning and retrieval as mutually reinforcing steps in a sequential prune-then-graft mechanism: pruning removes marginal branches to free budget and create topological gaps, while retrieval supplies predictive tokens that are grafted into those gaps with near-zero overhead, thereby recovering the accepted length that pruning would otherwise discard and establishing a new Pareto frontier for speculative decoding.
What carries the argument
The prune-then-graft mechanism, which uses dynamic-depth pruning to free resources and then attaches retrieved tokens into the resulting positions to restore coverage.
If this is right
- Up to 5.41× end-to-end speedup on short-context benchmarks.
- Up to 21.8% average speedup improvement over EAGLE-3 on the Qwen3-235B model.
- Consistent gains across short-context, long-context, and large-scale model settings.
- Preliminary extension possible to DFlash-style block drafting paradigms.
Where Pith is reading between the lines
- The same compensation pattern could be tested in other tree-search or beam-search settings where pruning creates reusable slots for external information.
- Dynamic retrieval sources might further improve the method if the grafted tokens are drawn from domain-specific corpora rather than general ones.
- The approach suggests that future draft-tree methods could treat pruning not as a pure reduction but as an opening for targeted augmentation.
Load-bearing premise
Retrieved tokens can be attached into pruned positions with near-zero overhead while remaining predictive enough to avoid new verification failures or hidden costs.
What would settle it
A measurement showing that the combined latency of retrieval and attachment plus any drop in acceptance rate exceeds the latency savings from pruning alone.
read the original abstract
Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41$\times$ speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Graft, a training-free and lossless framework for speculative decoding that couples dynamic-depth pruning of draft trees with retrieval-based grafting. Pruning frees computational budget by removing marginal branches, after which highly predictive retrieved tokens are attached into the resulting topological gaps via a sequential prune-then-graft mechanism with claimed near-zero overhead. This is asserted to recover acceptance length lost to pruning and break the Pareto tradeoff between tree size and verification cost. Evaluations across short-context, long-context, and large-scale models (including Qwen3-235B) report up to 5.41× speedup and up to 21.8% average improvement over EAGLE-3, with a preliminary extension to DFlash-style block drafting.
Significance. If the claims hold, the work meaningfully advances practical LLM inference by resolving the VRAM/compute overheads of expansive draft trees through mutually reinforcing pruning and retrieval. The training-free, lossless character and the concrete, falsifiable speedups across diverse deployment regimes constitute clear strengths. The preliminary exploration of grafting beyond standard autoregressive trees supplies useful initial evidence for broader applicability.
major comments (1)
- [§3.2] §3.2 (prune-then-graft mechanism): The manuscript states that grafting retrieved tokens into pruned positions incurs near-zero overhead and fully recovers acceptance length without new verification failures, yet provides no explicit accounting of any incremental forward-pass cost during verification or any shift in the draft distribution induced by topological gap-filling. Because the reported speedups (5.41× and +21.8 % over EAGLE-3) rest directly on this assumption, a quantitative breakdown of VRAM versus compute overhead before and after grafting is required to substantiate the central claim.
minor comments (2)
- [Table 2] Table 2: Speedup figures are presented without error bars or the number of independent runs; adding these would strengthen the cross-method comparisons.
- [Figure 4] Figure 4: Axis labels and legend entries for the long-context setting are slightly compressed; increasing font size or splitting the legend would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment of our work. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation of our central claims.
read point-by-point responses
-
Referee: [§3.2] §3.2 (prune-then-graft mechanism): The manuscript states that grafting retrieved tokens into pruned positions incurs near-zero overhead and fully recovers acceptance length without new verification failures, yet provides no explicit accounting of any incremental forward-pass cost during verification or any shift in the draft distribution induced by topological gap-filling. Because the reported speedups (5.41× and +21.8 % over EAGLE-3) rest directly on this assumption, a quantitative breakdown of VRAM versus compute overhead before and after grafting is required to substantiate the central claim.
Authors: We agree that an explicit quantitative breakdown would improve clarity and substantiation. In the current manuscript, the near-zero overhead is justified by the sequential design: pruning first releases budget (in terms of reduced tree width and verification cost), which is then reallocated to grafting without introducing extra model calls—the grafted tokens are inserted into the existing tree topology and verified in the same batched forward pass as the pruned tree. This preserves the original verification logic and acceptance criteria, with no new failures introduced because grafted positions are treated identically to original branches. Retrieval selection is based on high-probability matches that align with the model's predictive distribution, minimizing any shift. To directly address the request, we will revise §3.2 to include a dedicated quantitative analysis with tables reporting VRAM usage, additional compute (FLOPs and latency), and acceptance-length recovery before versus after grafting, measured on the same hardware and models as the main experiments (including Qwen3-235B). revision: yes
Circularity Check
No significant circularity; method is self-contained with external empirical validation
full rationale
The paper presents Graft as a training-free, lossless prune-then-graft mechanism for speculative decoding trees, with speedups measured directly against independent baselines such as EAGLE-3. No equations, fitted parameters, or self-citations are shown in the provided text that reduce the central claims or reported speedups to tautological inputs by construction. The derivation relies on the described algorithmic steps and external comparisons rather than self-referential definitions or imported uniqueness theorems.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Graft first performs confidence-based pruning through calibrated pruning checkpoints... then grafts retrieval-based branches into the released slots... T_s = T_draft_s ∪ G_ret_s, |T_s| = K_max
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Graft establishes a new Pareto frontier... up to 5.41× speedup... improves average speedup over EAGLE-3 by up to 21.8%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 31...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.172 2024
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[3]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, and Michael Qizhe Shieh. Rapid: Long-context inference with retrieval-augmented speculative decoding.arXiv preprint arXiv:2502.20330,
-
[6]
arXiv preprint arXiv:2602.06036 , year=
Jian Chen, Yesheng Liang, and Zhijian Liu. DFlash: Block Diffusion for Flash Speculative Decoding.arXiv preprint arXiv:2602.06036,
-
[7]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
PACT: Parameterized Clipping Activation for Quantized Neural Networks
URLhttps://lmsys.org/blog/2023-03-30-vicuna/. Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks.arXiv preprint arXiv:1805.06085,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057,
-
[11]
Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
URLhttps://arxiv.org/abs/2407.21783. 17 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Rest: Retrieval-based speculative decoding
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. Rest: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1582–1595,
work page 2024
-
[14]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
Xinyi Hu, Yuhao Shen, Baolin Zhang, Hengxin Zhang, Jun Dai, Shuang Ge, Lei Chen, Yue Li, and Mingcheng Wan. Echo: Elastic speculative decoding with sparse gating for high-concurrency scenarios.arXiv preprint arXiv:2604.09603,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
Hayate Iso, Tiyasa Mitra, Sudipta Mondal, Rasoul Shafipour, Venmugil Elango, Terry Kong, Yuki Huang, Seonjin Na, Izzy Putterman, Benjamin Chislett, Maor Ashkenazi, Joseph Guman, Gerald Shen, Tugrul Konuk, Ashwath Aithal, Ritika Borkar, Ran Zilberstein, and Bita Rouhani. Accelerating rl post-training rollouts via system-integrated speculative decoding.arXi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
SpecVLM: Enhancing speculative decoding of video LLMs via verifier-guided token pruning
Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, and Huan Li. SpecVLM: Enhancing speculative decoding of video LLMs via verifier-guided token pruning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 72...
work page 2025
-
[18]
Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.366. URL https://aclanthology.org/2025.emnlp-main.366/. Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou, Gang Chen, and Huan Li. See the forest for the trees: Loosely speculative decoding via visual-semantic guidance for efficient inference of vi...
-
[19]
URLhttps://arxiv.org/abs/2604.05650. Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, and Cong Wang. Parallelvlm: Lossless video-llm acceleration with visual alignment aware parallel speculative decoding.arXiv preprint arXiv:2603.19610,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024a. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024b. Yuhui Li, Fangyun Wei, Chao Z...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Xu Han, Peng Li, Anxiang Zeng, and Jinsong Su. Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts.arXiv preprint arXiv:2509.23232, 2025a. Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, and Winston Hu. Parallel speculative decoding with adaptive draft...
-
[22]
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, and Xiao Sun. Logitspec: Accelerating retrieval-based speculative decoding via next next token speculation.arXiv preprint arXiv:2507.01449, 2025b. Tianyu Liu, Qitan Lv, Yuhao Shen, Xiao Sun, and Xiaoyan Sun. Talon: Confidence-aware speculative decoding with adaptive token trees.arXiv preprint arXiv:2601.07353, 2026a...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, and Roy Schwartz. Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,
-
[24]
Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Lang...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang. Seer: Online context learning for fast synchronous llm reinforcement learning. arXiv preprint arXiv:2511.14617,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, and Beidi Chen. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding.arXiv preprint arXiv:2408.11049,
-
[27]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URLhttps://github.com/apoorvumang/ prompt-lookup-decoding/. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism
Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, and Cong Wang. Speculative decoding via hybrid drafting and rollback-aware branch parallelism.arXiv preprint arXiv:2506.01979,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism
Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, and Cong Wang. Double: Breaking the acceleration limit via double retrieval speculative parallelism.arXiv preprint arXiv:2601.05524,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
KNN- SSD: Enabling dynamic self-speculative decoding via nearest neighbor layer set optimization
Mingbo Song, Heming Xia, Jun Zhang, Chak Tou Leong, Qiancheng Xu, Wenjie Li, and Sujian Li. KNN- SSD: Enabling dynamic self-speculative decoding via nearest neighbor layer set optimization. In Vera Demberg, Kentaro Inui, and Lluís Marquez (eds.),Findings of the Association for Computational Linguistics: EACL 2026, pp. 641–655, Rabat, Morocco, March
work page 2026
-
[31]
Association for Computational Linguistics. ISBN 979-8-89176-386-9. doi: 10.18653/v1/2026.findings-eacl.31. URLhttps://aclanthology.org/2026. findings-eacl.31/. Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31,
-
[32]
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912,
-
[33]
arXiv preprint arXiv:2512.02337 , year=
Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, and Kun Xia. Specpv: Improving self-speculative decoding for long-context generation via partial verification.arXiv preprint arXiv:2512.02337,
-
[34]
19 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alp...
work page 2023
-
[35]
Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122,
-
[36]
Tokenswift: Lossless acceleration of ultra long sequence generation.arXiv preprint arXiv:2502.18890,
Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, and Zilong Zheng. Tokenswift: Lossless acceleration of ultra long sequence generation.arXiv preprint arXiv:2502.18890,
-
[37]
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851,
-
[38]
URLhttps://arxiv.org/abs/2410.06916. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. PenghuiYang, CunxiaoDu, FengzhuoZhang, HaonanWang, TianyuPang, ChaoDu, andBoAn. Longspec: Long- context lossless speculative decodi...
-
[39]
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
URLhttps://arxiv.org/abs/2604.05546. Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, and Kai Yu. Adaeagle: Optimizing speculative decoding via explicit modeling of adaptive draft structures.arXiv preprint arXiv:2412.18910, 2024b. Ziyin Zhang, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Rui Wang, and Zhaopeng Tu. Draft model knows when...
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
20 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding Algorithm 1GraftGeneration 1:Input:prefix𝑥 1:𝑡, target model𝑝, tree drafter𝑞, budget𝐾max, adjacency matrixM 2:EnsureMis allocated on GPU; initialize empty rows if needed 3:Precompute retrieval-template caches for pruning stages𝑠∈ {𝑑 0, 𝑑1, 𝑑5} 4:Run target prefill on𝑥 1:𝑡 and ...
work page 2024
-
[41]
Speedup Condition with Grafted Retrieval Let𝐴 0 =𝐴(T 0),𝐴 𝜋 =𝐴(T 𝜋), and𝐴 𝑔 =𝐴(T 𝑔)
B.3. Speedup Condition with Grafted Retrieval Let𝐴 0 =𝐴(T 0),𝐴 𝜋 =𝐴(T 𝜋), and𝐴 𝑔 =𝐴(T 𝑔). Let Γ𝑔 =𝐴 𝑔 −𝐴 𝜋 (24) be the accepted-length gain brought by grafted retrieval. Since|T𝑔 | ≤ |T 0|, the verification cost of the hybrid tree is no larger than that of the dense tree under the same packing strategy. The hybrid tree improves over pure pruning when (𝐴 𝜋...
work page 2023
-
[42]
The source code of this project will be made available at a later time. C.1. Data Configurations We evaluateGraft under three complementary settings: short-context generation, long-context generation, and batched high-concurrency decoding. For short-context evaluation, we follow the standard EAGLE-3 and Spec-Bench setup, covering code generation, mathemat...
work page 2021
-
[43]
Vicuna 68M corresponds to the lightweight draft module used by the Vicuna-13B tree-draft pair
Table 10 lists the key model architecture configurations used in our experiments. Vicuna 68M corresponds to the lightweight draft module used by the Vicuna-13B tree-draft pair. 25 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding Table 8 | Short-context target and draft model checkpoints.We follow the public EAGLE3-compatible ch...
work page 2023
-
[44]
and EAGLE-3 (Li et al., 2025). EAGLE-3 is our primary static-tree instantiation because it provides a strong public tree drafter;Graft changes how the candidate budget is allocated rather than depending on an EAGLE-specific verification rule.Dynamic-tree methodsinclude DDD (Brown et al.,
work page 2025
-
[45]
and ECHO (Hu et al., 2026), which reduce wasted drafting through adaptive tree construction. When a baseline was originally implemented for a weaker or older tree backbone, we adapt the scheduling policy to the strongest available public setting whenever possible, so that the comparison focuses on tree construction rather than draft-model strength. For sh...
work page 2026
-
[46]
further 30 Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding extends this line of work by introducing vision-aware loose verification for Video-LLMs. ParallelVLM extends lossless acceleration to video-LLMs by considering visual-alignment-aware parallel speculative decoding (Kong et al., 2026). These works suggest that future spe...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.