pith. sign in

arxiv: 2606.18394 · v3 · pith:XY7LTV3Cnew · submitted 2026-06-16 · 💻 cs.CL

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Pith reviewed 2026-06-27 00:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords speculative decodinglarge language modelsdraft headtree speculative decodinginference accelerationparallel draftingcausal conditioningLLM serving
0
0 comments X

The pith

JetSpec trains a causal parallel draft head on fused target hidden states to generate branch-conditioned trees whose scores match the target LLM's autoregressive factorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speculative decoding accelerates LLMs by drafting tokens for parallel verification, but prior methods hit a ceiling: autoregressive drafters keep causality but scale poorly with tree depth, while bidirectional drafters run fast but produce inconsistent trees. JetSpec resolves the dilemma by training one draft head that conditions each branch on prior choices yet computes everything in a single forward pass. The resulting trees convert extra draft budget into longer accepted prefixes instead of wasted verification. This yields measured speedups of 9.64x on math tasks and 4.58x on chat workloads when the target model stays frozen.

Core claim

JetSpec is a head-based speculative decoding framework that trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This alignment lets larger draft budgets become longer accepted sequences and higher end-to-end speedups, outperforming both bidirectional-head and tree-based baselines on math, coding, and chat benchmarks across dense and MoE Qwen3 models.

What carries the argument

The causal parallel draft head trained over fused hidden states from the frozen target model, which produces branch-wise causally conditioned candidate trees in one forward pass.

If this is right

  • Larger draft budgets translate into longer accepted prefixes rather than wasted verification steps.
  • The same frozen target model can be paired with the draft head on both dense and MoE architectures without retraining the base weights.
  • End-to-end latency improves on H100 GPUs for both short math problems and longer conversational sequences.
  • Integration with production serving engines such as vLLM further reduces latency under realistic multi-request loads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion-and-causal-head pattern could be tested on sequence models outside language, such as time-series or protein generators.
  • If the alignment property holds at larger scales, the draft head might be reused across multiple target models of similar architecture.
  • The method implies that explicit tree consistency can be restored without paying the full cost of sequential autoregressive drafting.

Load-bearing premise

The scores assigned by the draft head to its candidate trees will match the probabilities the frozen target model would compute under its own autoregressive factorization.

What would settle it

A controlled run in which acceptance length stops rising or falls as the draft budget is increased, or in which measured wall-clock latency fails to drop despite higher theoretical acceptance.

Figures

Figures reproduced from arXiv: 2606.18394 by Bojun Wang, Daxin Jiang, Haoran Yuan, Hao Zhang, Lanxiang Hu, Peng Zhao, Tajana Rosing, Yibo Zhu, Yujie Zhao, Yulun Wu, Yu-Yang Qian, Zhaoxiang Feng.

Figure 1
Figure 1. Figure 1: End-to-end decoding speedup over standard autoregressive decoding on H100 GPUs across [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Expected speculative decoding speedup scales as a function of draft length [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: JetSpec design overview. JetSpec extracts fused hidden features from the frozen target model and conditions a causal-parallel draft head to generate high-quality candidate trees in one forward pass. Why Causality Matters. A key requirement for parallel tree drafting is that each node distribution should be conditioned on its own branch prefix. Consider a draft tree rooted at the original prefix x. Each nod… view at source ↗
Figure 4
Figure 4. Figure 4: Tree-quality failure mode at MATH-500 prompt #0, decode step 0. Both heads draft from the same prefix (last token “We”). The causal head’s rank-1 branch (“ are told that”) is faithful: target joint Σ log p ≈ Σ log r, so tree verification walks 6 tokens along it. The diffusion head’s rank-1 branch (“ given told that”) is incoherent (target joint Σ log p = −63.32 nats, i.e. probability ≈e −63) because its br… view at source ↗
Figure 5
Figure 5. Figure 5: Causal attention mask used for training with multiple sampled blocks. Each query can [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Each sampled block includes an anchor position and multiple future token positions. The [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 2
Figure 2. Figure 2: Lower is better. L\N 1 2 4 8 16 32 64 128 256 512 128 15.098 6.810 3.396 1.683 0.846 0.422 0.211 0.106 0.053 0.034 256 15.042 6.735 3.329 1.663 0.837 0.419 0.210 0.105 0.052 0.034 512 15.083 6.710 3.327 1.679 0.843 0.421 0.208 0.105 0.052 0.034 1024 15.029 6.720 3.327 1.690 0.845 0.420 0.209 0.104 0.054 0.035 2048 15.031 6.664 3.347 1.668 0.831 0.415 0.211 0.110 0.064 0.036 4096 18.295 8.796 4.411 2.233 1.… view at source ↗
read the original abstract

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces JetSpec, a speculative decoding framework that trains a single causal parallel draft head on fused hidden states from a frozen target LLM. This produces tree-structured candidate sequences whose per-token scores are intended to align with the target's autoregressive factorization, allowing larger draft budgets to translate into higher acceptance lengths without the inconsistency problems of bidirectional drafters or the depth-dependent cost of autoregressive drafters. Empirical results on Qwen3 dense and MoE models across math, coding, and chat benchmarks report consistent outperformance over bidirectional-head and tree-based baselines, with peak speedups of 9.64× on MATH-500 and 4.58× on conversational workloads on H100 GPUs, plus further gains when integrated with vLLM.

Significance. If the claimed alignment between the parallel draft head and the target factorization holds under verification and the reported speedups prove robust to implementation details, the work would meaningfully advance the scaling behavior of speculative decoding by removing a long-standing causality-efficiency trade-off. The open release of code and models strengthens the potential for follow-on work.

major comments (3)
  1. [Abstract, §3] Abstract and §3: The central claim that the trained causal parallel draft head 'produces candidate trees whose scores align with the target model's autoregressive factorization' is load-bearing for correct tree acceptance and the reported speedups, yet no loss term, auxiliary objective, or derivation is provided that enforces exact equivalence rather than approximate matching via standard training; any systematic mismatch would invalidate the verification step.
  2. [§4] §4 and experimental sections: The manuscript reports large speedups (e.g., 9.64× on MATH-500) but the provided text contains no error bars, statistical significance tests, or ablation studies isolating the contribution of the alignment property versus other implementation choices; without these, it is impossible to assess whether the scaling-ceiling claim is supported by the data.
  3. [§3.2] §3.2: The description of how fused hidden states are used to produce branch-wise causal conditioning does not include an explicit statement or proof that the resulting tree scores remain consistent with the target's left-to-right factorization when the draft tree contains multiple branches; this consistency is required to avoid the inconsistency problem the paper attributes to bidirectional methods.
minor comments (2)
  1. [§3] Notation for the draft head output (e.g., the precise definition of the per-position logits used for tree scoring) is introduced without a dedicated equation number, making cross-references in later sections difficult to follow.
  2. [Figure 2] Figure captions for the tree-construction diagrams do not explicitly label which nodes are accepted versus rejected under the verification procedure, reducing clarity when comparing acceptance lengths across methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the thorough review and constructive suggestions. We will revise the manuscript to address the concerns regarding the theoretical justification of the alignment, the experimental rigor, and the consistency proof. Below we respond to each major comment.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3: The central claim that the trained causal parallel draft head 'produces candidate trees whose scores align with the target model's autoregressive factorization' is load-bearing for correct tree acceptance and the reported speedups, yet no loss term, auxiliary objective, or derivation is provided that enforces exact equivalence rather than approximate matching via standard training; any systematic mismatch would invalidate the verification step.

    Authors: We thank the referee for highlighting this important point. The alignment is intended to arise from the causal training of the draft head on fused hidden states, which conditions each branch on preceding tokens in a manner consistent with autoregressive factorization. However, we agree that the manuscript lacks an explicit derivation or specialized loss term to guarantee exact equivalence. In the revision, we will add a detailed explanation of the training objective and a proof sketch showing how the branch-consistent scores maintain consistency with the target's left-to-right probabilities. This will be incorporated as an addition to Section 3. revision: yes

  2. Referee: [§4] §4 and experimental sections: The manuscript reports large speedups (e.g., 9.64× on MATH-500) but the provided text contains no error bars, statistical significance tests, or ablation studies isolating the contribution of the alignment property versus other implementation choices; without these, it is impossible to assess whether the scaling-ceiling claim is supported by the data.

    Authors: We acknowledge the absence of error bars, significance tests, and targeted ablations in the current version. To address this, we will rerun the experiments with multiple random seeds to report means and standard deviations, include p-values for comparisons, and add ablations that disable the causal alignment mechanism to isolate its contribution. These will be added to Section 4 and the experimental results. revision: yes

  3. Referee: [§3.2] §3.2: The description of how fused hidden states are used to produce branch-wise causal conditioning does not include an explicit statement or proof that the resulting tree scores remain consistent with the target's left-to-right factorization when the draft tree contains multiple branches; this consistency is required to avoid the inconsistency problem the paper attributes to bidirectional methods.

    Authors: We agree that an explicit statement and proof of consistency for multi-branch trees would clarify the method. The branch-wise causal conditioning is designed such that each path in the tree is conditioned only on its own prefix, preserving the autoregressive property. We will revise Section 3.2 to include a formal statement and a short proof demonstrating that the joint probability of any path equals the product of conditional probabilities under the target model, thereby ensuring consistency across branches. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training method with external validation

full rationale

The paper presents JetSpec as a training-based method where a causal parallel draft head is trained on fused hidden states from a frozen target model to produce trees whose scores align with autoregressive factorization. No equations, derivations, or self-citations are shown that reduce any claimed prediction or result to its inputs by construction. Performance claims rest on benchmark comparisons rather than internal self-referential quantities. The alignment is described as an outcome of training rather than an enforced identity or fitted parameter renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters, axioms, or invented entities with precision; the method relies on standard assumptions that a frozen target model can supply useful hidden states and that training can align draft scores to autoregressive factorization.

axioms (1)
  • domain assumption The target model remains frozen while training the draft head on its hidden states.
    Explicitly referenced in the abstract description of the training procedure.

pith-pipeline@v0.9.1-grok · 5849 in / 1306 out tokens · 41866 ms · 2026-06-27T00:40:04.395228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 14 linked inside Pith

  1. [1]

    On-policy distillation of language models: Learning from self- generated mistakes, 2024

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes, 2024

  2. [2]

    Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026

    StepFun AI. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026

  3. [3]

    Le, and Charles Sutton

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  4. [4]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

  5. [5]

    Code Alpaca: An instruction-following LLaMA model trained on code generation instructions.https://github.com/sahil280114/codealpaca, 2023

    Sahil Chaudhary. Code Alpaca: An instruction-following LLaMA model trained on code generation instructions.https://github.com/sahil280114/codealpaca, 2023

  6. [6]

    Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

  7. [7]

    Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

    Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

  8. [8]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  9. [9]

    Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  11. [11]

    DeepSeek-V4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

  12. [12]

    Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

  13. [13]

    LayerSkip: Enabling early exit inference and self-speculative decoding

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean Wu. LayerSkip: Enabling early exit inference and self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual M...

  14. [14]

    Break the sequential dependency of LLM inference using lookahead decoding

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 14060–14079. PMLR, 2024

  15. [15]

    Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Syn- naeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

  16. [16]

    Direct alignment of draft model for speculative decoding with chat-fine-tuned llms.arXiv preprint arXiv:2403.00858, 2024

    Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. Direct alignment of draft model for speculative decoding with chat-fine-tuned llms.arXiv preprint arXiv:2403.00858, 2024

  17. [17]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024

  18. [18]

    Lee, and Di He

    Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, and Di He. REST: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1582–1595, 2024

  19. [19]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021

  20. [20]

    Fast and accurate causal parallel decoding using jacobi forcing.arXiv preprint arXiv:2512.14681, 2025

    Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, and Hao Zhang. Fast and accurate causal parallel decoding using jacobi forcing.arXiv preprint arXiv:2512.14681, 2025

  21. [21]

    SAM decoding: Speculative decoding via suffix automaton

    Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. SAM decoding: Speculative decoding via suffix automaton. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 121...

  22. [22]

    Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  23. [23]

    CLLMs: Consistency large language models

    Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. CLLMs: Consistency large language models. InProceedings of the 41st International Conference on Machine Learning, 2024

  24. [24]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023

  25. [25]

    Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

  26. [26]

    Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

  27. [27]

    Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

  28. [28]

    Yihao Liang, Ze Wang, Hao Chen, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Emad Barsoum, Zicheng Liu, and Niraj K. Jha. CD4LM: Consistency distillation and adaptive decoding for diffusion language models.arXiv preprint arXiv:2601.02236, 2026

  29. [29]

    Online speculative decoding

    Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Online speculative decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 31131–31146. PMLR, 2024. 12

  30. [30]

    ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

    Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational L...

  31. [31]

    Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification. InProceedings of the...

  32. [32]

    Nemotron post-training dataset v2

    NVIDIA. Nemotron post-training dataset v2. https://huggingface.co/datasets/ nvidia/Nemotron-Post-Training-Dataset-v2, 2025. Dataset

  33. [33]

    Suffixdecoding: Extreme speculative decoding for emerging ai applications.arXiv preprint arXiv:2411.04975, 2026

    Gabriele Oliaro, Zhihao Jia, Daniel Campos, and Aurick Qiao. Suffixdecoding: Extreme speculative decoding for emerging ai applications.arXiv preprint arXiv:2411.04975, 2026

  34. [34]

    Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

    Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InF orty-second International Conference on Machine Learning, 2025

  35. [35]

    d3LLM: Ultra-fast diffusion LLM using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

    Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3LLM: Ultra-fast diffusion LLM using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

  36. [36]

    Accelerating speculative decoding with block diffusion draft trees.arXiv preprint arXiv:2604.12989, 2026

    Liran Ringel and Yaniv Romano. Accelerating speculative decoding with block diffusion draft trees.arXiv preprint arXiv:2604.12989, 2026

  37. [37]

    Prompt lookup decoding

    Apoorv Saxena. Prompt lookup decoding. https://github.com/apoorvumang/ prompt-lookup-decoding, 2023. GitHub repository

  38. [38]

    Pld+: Accelerating llm inference by leveraging language model artifacts.arXiv preprint arXiv:2412.01447, 2024

    Shwetha Somasundaram, Anirudh Phukan, and Apoorv Saxena. Pld+: Accelerating llm inference by leveraging language model artifacts.arXiv preprint arXiv:2412.01447, 2024

  39. [39]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022

  40. [40]

    Fast-dllm v2: Efficient block-diffusion llm

    Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm. arXiv preprint arXiv:2509.26328, 2025

  41. [41]

    Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

    LLM-Core Xiaomi. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  42. [42]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  43. [43]

    Narasimhan

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan. τ-bench: A bench- mark for tool-agent-user interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025. 13

  44. [44]

    d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

  45. [45]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

  46. [46]

    Distillspec: Improving speculative decoding via knowledge distillation

    Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. InInternational Conference on Learning Representations, 2024

  47. [47]

    We”). The causal head’s rank-1 branch (“ are told that

    Dawei Zhu, Xiyu Wei, Guangxiang Zhao, Wenhao Wu, Haosheng Zou, Junfeng Ran, XWang, Lin Sun, Xiangzheng Zhang, and Sujian Li. Chain-of-thought matters: Improving long- context language models with reasoning path supervision. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computatio...