JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Bojun Wang; Daxin Jiang; Haoran Yuan; Hao Zhang; Lanxiang Hu; Peng Zhao; Tajana Rosing; Yibo Zhu; Yujie Zhao; Yulun Wu

arxiv: 2606.18394 · v3 · pith:XY7LTV3Cnew · submitted 2026-06-16 · 💻 cs.CL

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Lanxiang Hu , Zhaoxiang Feng , Yulun Wu , Haoran Yuan , Yujie Zhao , Yu-Yang Qian , Bojun Wang , Peng Zhao

show 4 more authors

Daxin Jiang Yibo Zhu Tajana Rosing Hao Zhang

This is my paper

Pith reviewed 2026-06-27 00:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative decodinglarge language modelsdraft headtree speculative decodinginference accelerationparallel draftingcausal conditioningLLM serving

0 comments

The pith

JetSpec trains a causal parallel draft head on fused target hidden states to generate branch-conditioned trees whose scores match the target LLM's autoregressive factorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speculative decoding accelerates LLMs by drafting tokens for parallel verification, but prior methods hit a ceiling: autoregressive drafters keep causality but scale poorly with tree depth, while bidirectional drafters run fast but produce inconsistent trees. JetSpec resolves the dilemma by training one draft head that conditions each branch on prior choices yet computes everything in a single forward pass. The resulting trees convert extra draft budget into longer accepted prefixes instead of wasted verification. This yields measured speedups of 9.64x on math tasks and 4.58x on chat workloads when the target model stays frozen.

Core claim

JetSpec is a head-based speculative decoding framework that trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This alignment lets larger draft budgets become longer accepted sequences and higher end-to-end speedups, outperforming both bidirectional-head and tree-based baselines on math, coding, and chat benchmarks across dense and MoE Qwen3 models.

What carries the argument

The causal parallel draft head trained over fused hidden states from the frozen target model, which produces branch-wise causally conditioned candidate trees in one forward pass.

If this is right

Larger draft budgets translate into longer accepted prefixes rather than wasted verification steps.
The same frozen target model can be paired with the draft head on both dense and MoE architectures without retraining the base weights.
End-to-end latency improves on H100 GPUs for both short math problems and longer conversational sequences.
Integration with production serving engines such as vLLM further reduces latency under realistic multi-request loads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion-and-causal-head pattern could be tested on sequence models outside language, such as time-series or protein generators.
If the alignment property holds at larger scales, the draft head might be reused across multiple target models of similar architecture.
The method implies that explicit tree consistency can be restored without paying the full cost of sequential autoregressive drafting.

Load-bearing premise

The scores assigned by the draft head to its candidate trees will match the probabilities the frozen target model would compute under its own autoregressive factorization.

What would settle it

A controlled run in which acceptance length stops rising or falls as the draft budget is increased, or in which measured wall-clock latency fails to drop despite higher theoretical acceptance.

Figures

Figures reproduced from arXiv: 2606.18394 by Bojun Wang, Daxin Jiang, Haoran Yuan, Hao Zhang, Lanxiang Hu, Peng Zhao, Tajana Rosing, Yibo Zhu, Yujie Zhao, Yulun Wu, Yu-Yang Qian, Zhaoxiang Feng.

**Figure 2.** Figure 2: Expected speculative decoding speedup scales as a function of draft length [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: JetSpec design overview. JetSpec extracts fused hidden features from the frozen target model and conditions a causal-parallel draft head to generate high-quality candidate trees in one forward pass. Why Causality Matters. A key requirement for parallel tree drafting is that each node distribution should be conditioned on its own branch prefix. Consider a draft tree rooted at the original prefix x. Each nod… view at source ↗

**Figure 4.** Figure 4: Tree-quality failure mode at MATH-500 prompt #0, decode step 0. Both heads draft from the same prefix (last token “We”). The causal head’s rank-1 branch (“ are told that”) is faithful: target joint Σ log p ≈ Σ log r, so tree verification walks 6 tokens along it. The diffusion head’s rank-1 branch (“ given told that”) is incoherent (target joint Σ log p = −63.32 nats, i.e. probability ≈e −63) because its br… view at source ↗

**Figure 5.** Figure 5: Causal attention mask used for training with multiple sampled blocks. Each query can [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Each sampled block includes an anchor position and multiple future token positions. The [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 2.** Figure 2: Lower is better. L\N 1 2 4 8 16 32 64 128 256 512 128 15.098 6.810 3.396 1.683 0.846 0.422 0.211 0.106 0.053 0.034 256 15.042 6.735 3.329 1.663 0.837 0.419 0.210 0.105 0.052 0.034 512 15.083 6.710 3.327 1.679 0.843 0.421 0.208 0.105 0.052 0.034 1024 15.029 6.720 3.327 1.690 0.845 0.420 0.209 0.104 0.054 0.035 2048 15.031 6.664 3.347 1.668 0.831 0.415 0.211 0.110 0.064 0.036 4096 18.295 8.796 4.411 2.233 1.… view at source ↗

read the original abstract

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JetSpec trains a causal parallel draft head on fused states to get consistent trees in one forward pass, which looks like a workable fix for the causality-efficiency trade-off in speculative decoding.

read the letter

JetSpec claims to break the scaling limit in speculative decoding by training a draft head that stays causal across branches while still drafting everything in one pass. The core move is fusing hidden states from the frozen target model and training the head to output scores that line up with the target's autoregressive factorization, so larger draft trees actually convert into longer accepted sequences instead of wasting budget on inconsistent branches.

What stands out is the explicit attempt to solve the stated dilemma: autoregressive tree drafters keep causality but scale poorly with depth, while bidirectional heads are cheap but produce mutually inconsistent candidates. JetSpec's one-forward approach with branch-wise conditioning is a direct response, and the reported speedups (up to 9.64x on MATH-500, 4.58x on chat) plus vLLM integration suggest the method delivers measurable gains on dense and MoE models.

The main soft spot is the alignment assumption. The method needs the trained head's tree scores to match the target's factorization closely enough that verification doesn't reject valid paths or accept invalid ones. The description relies on training to achieve this rather than a derivation or loss term that enforces exact equivalence, so any residual mismatch would undercut the acceptance logic and the claimed speedups. The abstract gives benchmark numbers but no ablations, error bars, or direct checks on how well the scores align, which leaves the central claim harder to evaluate.

This is for people working on LLM serving and inference optimization. The idea is concrete, the problem is practical, and the code release helps. It deserves peer review so the alignment and experimental details can be checked properly.

Referee Report

3 major / 2 minor

Summary. The paper introduces JetSpec, a speculative decoding framework that trains a single causal parallel draft head on fused hidden states from a frozen target LLM. This produces tree-structured candidate sequences whose per-token scores are intended to align with the target's autoregressive factorization, allowing larger draft budgets to translate into higher acceptance lengths without the inconsistency problems of bidirectional drafters or the depth-dependent cost of autoregressive drafters. Empirical results on Qwen3 dense and MoE models across math, coding, and chat benchmarks report consistent outperformance over bidirectional-head and tree-based baselines, with peak speedups of 9.64× on MATH-500 and 4.58× on conversational workloads on H100 GPUs, plus further gains when integrated with vLLM.

Significance. If the claimed alignment between the parallel draft head and the target factorization holds under verification and the reported speedups prove robust to implementation details, the work would meaningfully advance the scaling behavior of speculative decoding by removing a long-standing causality-efficiency trade-off. The open release of code and models strengthens the potential for follow-on work.

major comments (3)

[Abstract, §3] Abstract and §3: The central claim that the trained causal parallel draft head 'produces candidate trees whose scores align with the target model's autoregressive factorization' is load-bearing for correct tree acceptance and the reported speedups, yet no loss term, auxiliary objective, or derivation is provided that enforces exact equivalence rather than approximate matching via standard training; any systematic mismatch would invalidate the verification step.
[§4] §4 and experimental sections: The manuscript reports large speedups (e.g., 9.64× on MATH-500) but the provided text contains no error bars, statistical significance tests, or ablation studies isolating the contribution of the alignment property versus other implementation choices; without these, it is impossible to assess whether the scaling-ceiling claim is supported by the data.
[§3.2] §3.2: The description of how fused hidden states are used to produce branch-wise causal conditioning does not include an explicit statement or proof that the resulting tree scores remain consistent with the target's left-to-right factorization when the draft tree contains multiple branches; this consistency is required to avoid the inconsistency problem the paper attributes to bidirectional methods.

minor comments (2)

[§3] Notation for the draft head output (e.g., the precise definition of the per-position logits used for tree scoring) is introduced without a dedicated equation number, making cross-references in later sections difficult to follow.
[Figure 2] Figure captions for the tree-construction diagrams do not explicitly label which nodes are accepted versus rejected under the verification procedure, reducing clarity when comparing acceptance lengths across methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the thorough review and constructive suggestions. We will revise the manuscript to address the concerns regarding the theoretical justification of the alignment, the experimental rigor, and the consistency proof. Below we respond to each major comment.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3: The central claim that the trained causal parallel draft head 'produces candidate trees whose scores align with the target model's autoregressive factorization' is load-bearing for correct tree acceptance and the reported speedups, yet no loss term, auxiliary objective, or derivation is provided that enforces exact equivalence rather than approximate matching via standard training; any systematic mismatch would invalidate the verification step.

Authors: We thank the referee for highlighting this important point. The alignment is intended to arise from the causal training of the draft head on fused hidden states, which conditions each branch on preceding tokens in a manner consistent with autoregressive factorization. However, we agree that the manuscript lacks an explicit derivation or specialized loss term to guarantee exact equivalence. In the revision, we will add a detailed explanation of the training objective and a proof sketch showing how the branch-consistent scores maintain consistency with the target's left-to-right probabilities. This will be incorporated as an addition to Section 3. revision: yes
Referee: [§4] §4 and experimental sections: The manuscript reports large speedups (e.g., 9.64× on MATH-500) but the provided text contains no error bars, statistical significance tests, or ablation studies isolating the contribution of the alignment property versus other implementation choices; without these, it is impossible to assess whether the scaling-ceiling claim is supported by the data.

Authors: We acknowledge the absence of error bars, significance tests, and targeted ablations in the current version. To address this, we will rerun the experiments with multiple random seeds to report means and standard deviations, include p-values for comparisons, and add ablations that disable the causal alignment mechanism to isolate its contribution. These will be added to Section 4 and the experimental results. revision: yes
Referee: [§3.2] §3.2: The description of how fused hidden states are used to produce branch-wise causal conditioning does not include an explicit statement or proof that the resulting tree scores remain consistent with the target's left-to-right factorization when the draft tree contains multiple branches; this consistency is required to avoid the inconsistency problem the paper attributes to bidirectional methods.

Authors: We agree that an explicit statement and proof of consistency for multi-branch trees would clarify the method. The branch-wise causal conditioning is designed such that each path in the tree is conditioned only on its own prefix, preserving the autoregressive property. We will revise Section 3.2 to include a formal statement and a short proof demonstrating that the joint probability of any path equals the product of conditional probabilities under the target model, thereby ensuring consistency across branches. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training method with external validation

full rationale

The paper presents JetSpec as a training-based method where a causal parallel draft head is trained on fused hidden states from a frozen target model to produce trees whose scores align with autoregressive factorization. No equations, derivations, or self-citations are shown that reduce any claimed prediction or result to its inputs by construction. Performance claims rest on benchmark comparisons rather than internal self-referential quantities. The alignment is described as an outcome of training rather than an enforced identity or fitted parameter renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters, axioms, or invented entities with precision; the method relies on standard assumptions that a frozen target model can supply useful hidden states and that training can align draft scores to autoregressive factorization.

axioms (1)

domain assumption The target model remains frozen while training the draft head on its hidden states.
Explicitly referenced in the abstract description of the training procedure.

pith-pipeline@v0.9.1-grok · 5849 in / 1306 out tokens · 41866 ms · 2026-06-27T00:40:04.395228+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 14 linked inside Pith

[1]

On-policy distillation of language models: Learning from self- generated mistakes, 2024

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes, 2024

2024
[2]

Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026

StepFun AI. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026

arXiv 2026
[3]

Le, and Charles Sutton

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

Pith/arXiv arXiv 2021
[4]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

Pith/arXiv arXiv 2024
[5]

Code Alpaca: An instruction-following LLaMA model trained on code generation instructions.https://github.com/sahil280114/codealpaca, 2023

Sahil Chaudhary. Code Alpaca: An instruction-following LLaMA model trained on code generation instructions.https://github.com/sahil280114/codealpaca, 2023

2023
[6]

Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

Pith/arXiv arXiv 2023
[7]

Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

Pith/arXiv arXiv 2026
[8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

Pith/arXiv arXiv 2021
[9]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[10]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[11]

DeepSeek-V4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

arXiv 2026
[12]

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

2025
[13]

LayerSkip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean Wu. LayerSkip: Enabling early exit inference and self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual M...

2024
[14]

Break the sequential dependency of LLM inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 14060–14079. PMLR, 2024

2024
[15]

Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Syn- naeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

Pith/arXiv arXiv 2024
[16]

Direct alignment of draft model for speculative decoding with chat-fine-tuned llms.arXiv preprint arXiv:2403.00858, 2024

Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. Direct alignment of draft model for speculative decoding with chat-fine-tuned llms.arXiv preprint arXiv:2403.00858, 2024

arXiv 2024
[17]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024

2024
[18]

Lee, and Di He

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, and Di He. REST: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1582–1595, 2024

2024
[19]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021

2021
[20]

Fast and accurate causal parallel decoding using jacobi forcing.arXiv preprint arXiv:2512.14681, 2025

Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, and Hao Zhang. Fast and accurate causal parallel decoding using jacobi forcing.arXiv preprint arXiv:2512.14681, 2025

arXiv 2025
[21]

SAM decoding: Speculative decoding via suffix automaton

Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. SAM decoding: Speculative decoding via suffix automaton. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 121...

2025
[22]

Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

Pith/arXiv arXiv 2024
[23]

CLLMs: Consistency large language models

Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. CLLMs: Consistency large language models. InProceedings of the 41st International Conference on Machine Learning, 2024

2024
[24]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023

2023
[25]

Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

arXiv 2024
[26]

Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

Pith/arXiv arXiv 2024
[27]

Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

Pith/arXiv arXiv 2025
[28]

Yihao Liang, Ze Wang, Hao Chen, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Emad Barsoum, Zicheng Liu, and Niraj K. Jha. CD4LM: Consistency distillation and adaptive decoding for diffusion language models.arXiv preprint arXiv:2601.02236, 2026

arXiv 2026
[29]

Online speculative decoding

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Online speculative decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 31131–31146. PMLR, 2024. 12

2024
[30]

ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational L...

2025
[31]

Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification. InProceedings of the...

2024
[32]

Nemotron post-training dataset v2

NVIDIA. Nemotron post-training dataset v2. https://huggingface.co/datasets/ nvidia/Nemotron-Post-Training-Dataset-v2, 2025. Dataset

2025
[33]

Suffixdecoding: Extreme speculative decoding for emerging ai applications.arXiv preprint arXiv:2411.04975, 2026

Gabriele Oliaro, Zhihao Jia, Daniel Campos, and Aurick Qiao. Suffixdecoding: Extreme speculative decoding for emerging ai applications.arXiv preprint arXiv:2411.04975, 2026

arXiv 2026
[34]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InF orty-second International Conference on Machine Learning, 2025

2025
[35]

d3LLM: Ultra-fast diffusion LLM using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3LLM: Ultra-fast diffusion LLM using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

arXiv 2026
[36]

Accelerating speculative decoding with block diffusion draft trees.arXiv preprint arXiv:2604.12989, 2026

Liran Ringel and Yaniv Romano. Accelerating speculative decoding with block diffusion draft trees.arXiv preprint arXiv:2604.12989, 2026

Pith/arXiv arXiv 2026
[37]

Prompt lookup decoding

Apoorv Saxena. Prompt lookup decoding. https://github.com/apoorvumang/ prompt-lookup-decoding, 2023. GitHub repository

2023
[38]

Pld+: Accelerating llm inference by leveraging language model artifacts.arXiv preprint arXiv:2412.01447, 2024

Shwetha Somasundaram, Anirudh Phukan, and Apoorv Saxena. Pld+: Accelerating llm inference by leveraging language model artifacts.arXiv preprint arXiv:2412.01447, 2024

arXiv 2024
[39]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022

2022
[40]

Fast-dllm v2: Efficient block-diffusion llm

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm. arXiv preprint arXiv:2509.26328, 2025

arXiv 2025
[41]

Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

LLM-Core Xiaomi. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

Pith/arXiv arXiv 2026
[42]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025
[43]

Narasimhan

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan. τ-bench: A bench- mark for tool-agent-user interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025. 13

2025
[44]

d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

arXiv 2025
[45]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

2023
[46]

Distillspec: Improving speculative decoding via knowledge distillation

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. InInternational Conference on Learning Representations, 2024

2024
[47]

We”). The causal head’s rank-1 branch (“ are told that

Dawei Zhu, Xiyu Wei, Guangxiang Zhao, Wenhao Wu, Haosheng Zou, Junfeng Ran, XWang, Lin Sun, Xiangzheng Zhang, and Sujian Li. Chain-of-thought matters: Improving long- context language models with reasoning path supervision. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computatio...

2025

[1] [1]

On-policy distillation of language models: Learning from self- generated mistakes, 2024

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes, 2024

2024

[2] [2]

Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026

StepFun AI. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026

arXiv 2026

[3] [3]

Le, and Charles Sutton

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

Pith/arXiv arXiv 2021

[4] [4]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

Pith/arXiv arXiv 2024

[5] [5]

Code Alpaca: An instruction-following LLaMA model trained on code generation instructions.https://github.com/sahil280114/codealpaca, 2023

Sahil Chaudhary. Code Alpaca: An instruction-following LLaMA model trained on code generation instructions.https://github.com/sahil280114/codealpaca, 2023

2023

[6] [6]

Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

Pith/arXiv arXiv 2023

[7] [7]

Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

Pith/arXiv arXiv 2026

[8] [8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

Pith/arXiv arXiv 2021

[9] [9]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[10] [10]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[11] [11]

DeepSeek-V4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

arXiv 2026

[12] [12]

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

2025

[13] [13]

LayerSkip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean Wu. LayerSkip: Enabling early exit inference and self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual M...

2024

[14] [14]

Break the sequential dependency of LLM inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 14060–14079. PMLR, 2024

2024

[15] [15]

Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Syn- naeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

Pith/arXiv arXiv 2024

[16] [16]

Direct alignment of draft model for speculative decoding with chat-fine-tuned llms.arXiv preprint arXiv:2403.00858, 2024

Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. Direct alignment of draft model for speculative decoding with chat-fine-tuned llms.arXiv preprint arXiv:2403.00858, 2024

arXiv 2024

[17] [17]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024

2024

[18] [18]

Lee, and Di He

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, and Di He. REST: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1582–1595, 2024

2024

[19] [19]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021

2021

[20] [20]

Fast and accurate causal parallel decoding using jacobi forcing.arXiv preprint arXiv:2512.14681, 2025

Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, and Hao Zhang. Fast and accurate causal parallel decoding using jacobi forcing.arXiv preprint arXiv:2512.14681, 2025

arXiv 2025

[21] [21]

SAM decoding: Speculative decoding via suffix automaton

Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, and Jing Zhang. SAM decoding: Speculative decoding via suffix automaton. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 121...

2025

[22] [22]

Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

Pith/arXiv arXiv 2024

[23] [23]

CLLMs: Consistency large language models

Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. CLLMs: Consistency large language models. InProceedings of the 41st International Conference on Machine Learning, 2024

2024

[24] [24]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023

2023

[25] [25]

Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

arXiv 2024

[26] [26]

Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

Pith/arXiv arXiv 2024

[27] [27]

Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

Pith/arXiv arXiv 2025

[28] [28]

Yihao Liang, Ze Wang, Hao Chen, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Emad Barsoum, Zicheng Liu, and Niraj K. Jha. CD4LM: Consistency distillation and adaptive decoding for diffusion language models.arXiv preprint arXiv:2601.02236, 2026

arXiv 2026

[29] [29]

Online speculative decoding

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Online speculative decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 31131–31146. PMLR, 2024. 12

2024

[30] [30]

ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational L...

2025

[31] [31]

Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification. InProceedings of the...

2024

[32] [32]

Nemotron post-training dataset v2

NVIDIA. Nemotron post-training dataset v2. https://huggingface.co/datasets/ nvidia/Nemotron-Post-Training-Dataset-v2, 2025. Dataset

2025

[33] [33]

Suffixdecoding: Extreme speculative decoding for emerging ai applications.arXiv preprint arXiv:2411.04975, 2026

Gabriele Oliaro, Zhihao Jia, Daniel Campos, and Aurick Qiao. Suffixdecoding: Extreme speculative decoding for emerging ai applications.arXiv preprint arXiv:2411.04975, 2026

arXiv 2026

[34] [34]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InF orty-second International Conference on Machine Learning, 2025

2025

[35] [35]

d3LLM: Ultra-fast diffusion LLM using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3LLM: Ultra-fast diffusion LLM using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

arXiv 2026

[36] [36]

Accelerating speculative decoding with block diffusion draft trees.arXiv preprint arXiv:2604.12989, 2026

Liran Ringel and Yaniv Romano. Accelerating speculative decoding with block diffusion draft trees.arXiv preprint arXiv:2604.12989, 2026

Pith/arXiv arXiv 2026

[37] [37]

Prompt lookup decoding

Apoorv Saxena. Prompt lookup decoding. https://github.com/apoorvumang/ prompt-lookup-decoding, 2023. GitHub repository

2023

[38] [38]

Pld+: Accelerating llm inference by leveraging language model artifacts.arXiv preprint arXiv:2412.01447, 2024

Shwetha Somasundaram, Anirudh Phukan, and Apoorv Saxena. Pld+: Accelerating llm inference by leveraging language model artifacts.arXiv preprint arXiv:2412.01447, 2024

arXiv 2024

[39] [39]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022

2022

[40] [40]

Fast-dllm v2: Efficient block-diffusion llm

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm. arXiv preprint arXiv:2509.26328, 2025

arXiv 2025

[41] [41]

Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

LLM-Core Xiaomi. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

Pith/arXiv arXiv 2026

[42] [42]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025

[43] [43]

Narasimhan

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan. τ-bench: A bench- mark for tool-agent-user interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025. 13

2025

[44] [44]

d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

arXiv 2025

[45] [45]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

2023

[46] [46]

Distillspec: Improving speculative decoding via knowledge distillation

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation. InInternational Conference on Learning Representations, 2024

2024

[47] [47]

We”). The causal head’s rank-1 branch (“ are told that

Dawei Zhu, Xiyu Wei, Guangxiang Zhao, Wenhao Wu, Haosheng Zou, Junfeng Ran, XWang, Lin Sun, Xiangzheng Zhang, and Sujian Li. Chain-of-thought matters: Improving long- context language models with reasoning path supervision. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computatio...

2025