arxiv: 2605.07243 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

Weijie Shi , Qiang Xu , Fan Deng , Yaguang Wu , Jiarun Liu , Yehong Xu , Hao Chen , Jia Zhu

show 4 more authors

Jiajie Xu Xiangjun Huang Jian Yang Xiaofang Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative decodingLLM inferenceblock-iterative draftingdynamic tree draftingpath dependencerank headcost-aware adaptationvalid-prefix mask

0 comments

The pith

SpecBlock accelerates LLM inference by generating blocks of dependent tokens iteratively with hidden-state inheritance and dynamic branching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a block-iterative drafter that produces K dependent positions per forward pass and grows the draft tree by repeating these block expansions. Within each block a layer-wise shift carries the prior position's hidden state into every decoder layer, while across blocks selective inheritance lets new blocks start from any prior position to extend valid paths. A co-trained rank head replaces fixed top-k selection by allocating branching factors per position according to predicted acceptance, and a valid-prefix mask drops loss on later positions once an earlier one fails. These elements together aim to retain the accuracy benefits of path dependence while lowering the frequency of drafter calls. Experiments report 8-13 percent mean speedup gains over prior autoregressive drafters at roughly half their drafting cost, with an online cost-aware bandit extending the advantage further.

Core claim

SpecBlock defines a block as K dependent token predictions produced by one drafter forward. Path dependence is maintained inside the block by injecting the previous position's hidden state into every layer and across blocks by allowing each new block to inherit the hidden state from any accepted position in the prior block. A rank head predicts per-position branching to allocate verifier budget where acceptance is likeliest, and a valid-prefix mask ensures the training loss only penalizes prefixes that could actually arise at inference time. A deployment bandit then uses free verifier signals to decide whether to update the drafter parameters only when the expected throughput gain exceeds re

What carries the argument

The block-iterative drafter with layer-wise hidden-state shift inside blocks and selective inheritance across blocks, which carries path dependence while reducing drafter call frequency.

If this is right

Mean inference speedup rises 8-13 percent over EAGLE-3 while drafting cost drops to 44-52 percent of the baseline.
Cost-aware online adaptation using verifier feedback widens the speedup lead to 11-19 percent.
Verifier compute is spent more efficiently because the rank head allocates branching only where acceptance probability is high.
Training and inference remain aligned because the valid-prefix mask excludes loss on positions that could never be reached by a valid prefix.
The drafter's contribution to per-iteration latency shrinks because multiple dependent positions are produced per call.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inheritance pattern could be applied to other multi-step generation tasks that currently suffer from dependence loss in parallel predictors.
If the rank head generalizes, it offers a route to replace hand-tuned tree widths in any speculative system.
The bandit adaptation mechanism suggests a broader class of low-overhead online tuning that could be tested on non-speculative inference pipelines.
Testing whether the block size K can be learned or scheduled dynamically would reveal whether the current fixed-block design leaves further efficiency on the table.

Load-bearing premise

Layer-wise hidden-state shifts and selective inheritance will preserve enough path dependence to keep acceptance rates high enough to offset the added mechanisms without the rank head or mask creating training-inference mismatches.

What would settle it

Measure acceptance rates and end-to-end speedup on a model whose hidden states shift rapidly across layers; if rates fall enough that the extra drafting mechanisms no longer reduce total latency, the claim is false.

Figures

Figures reproduced from arXiv: 2605.07243 by Fan Deng, Hao Chen, Jiajie Xu, Jian Yang, Jiarun Liu, Jia Zhu, Qiang Xu, Weijie Shi, Xiangjun Huang, Xiaofang Zhou, Yaguang Wu, Yehong Xu.

**Figure 2.** Figure 2: SpecBlock drafter architecture and block-iterative drafting. The first block (middle) fuses [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Attention pattern across one cross-block iteration with [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Cost-aware adaptation scheduling. The cost-aware bandit ingests the verifier signal at every [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Acceptance rate diagnostics averaged across benchmarks. (a) Per-position [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Case study response with per-token acceptance shading. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Per-iteration draft tree for the prompt Write a Python function to compute the Fibonacci sequence. EAGLE-3 grows depth-by-depth at one drafter forward per depth; each of the seven forwards is shown in a different color, fwd 1 through fwd 7. SpecBlock reaches a comparable accepted prefix in only two forwards, with block-1 and block-2 shown in two different colors. Tokens marked ✓ are on the verifier-walked … view at source ↗

read the original abstract

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpecBlock adds block iterations, layer-wise state shifts, and selective inheritance to speculative decoding, claiming 8-13% better speedups than EAGLE-3 at roughly half the drafting cost.

read the letter

SpecBlock tries to split the difference between autoregressive drafters like EAGLE-3, which keep path dependence but cost multiple forward passes, and parallel ones that are cheap but lose dependence. It does this by generating K positions per drafter call in a block, then repeating the process while carrying hidden states forward. Within a block a layer-wise shift feeds the prior position's state into every layer; across blocks it allows selective inheritance from any prior position in the last block. A co-trained rank head picks variable branching factors instead of fixed top-k, and a valid-prefix mask drops loss on positions after an early error so training matches what inference actually produces. A cost-aware bandit at runtime decides when to update the drafter based on real verifier feedback. The reported outcome is an 8-13% mean speedup lift over EAGLE-3 at 44-52% of its drafting cost, rising to 11-19% with the bandit. Those numbers are the concrete engineering win if they hold. The mechanisms are clearly described and the motivation is straightforward. The main soft spot is that the abstract gives the headline gains without datasets, ablations, error bars, or any breakdown of acceptance rates per block position, so it is impossible to check whether the state shifts and inheritance actually preserve enough dependence to justify the added pieces. The stress-test worry about training-inference mismatch on the mask or rank head is reasonable until the full results are examined. This is for people who already work on LLM inference optimization and want a practical tweak rather than a new paradigm. If the full paper supplies solid ablations and reproducible numbers, it deserves peer review; the idea is narrow but the cost reduction is the kind of thing practitioners care about.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes SpecBlock, a block-iterative speculative decoding framework for accelerating LLM inference. It generates K dependent positions per drafter forward pass (a 'block'), grows the draft tree iteratively, uses layer-wise hidden-state shifts within blocks and selective inheritance across blocks to maintain path dependence, employs a co-trained rank head for dynamic branching, and a valid-prefix mask to align training with inference. A cost-aware bandit adapts the drafter using verifier feedback. The key empirical claim is an 8-13% mean speedup improvement over EAGLE-3 at 44-52% of the drafting cost, further enhanced to 11-19% with adaptation.

Significance. If the results hold under rigorous testing, SpecBlock represents a meaningful advance in speculative decoding by bridging the gap between path-dependent autoregressive drafters and efficient parallel ones. The mechanisms for preserving dependence and the adaptive component could lead to more efficient inference systems, particularly in resource-constrained settings. The paper's emphasis on reducing drafting cost while improving speedup is practically significant.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: The performance claims (8-13% speedup at 44-52% cost, extending to 11-19% with adaptation) are presented without reference to specific datasets, model architectures, number of trials, error bars, or ablation studies. This makes it impossible to verify the robustness of the central empirical result and whether the block-iterative design indeed offsets the mechanisms' overhead.
[Method (block-iterative design)] Method section on block-iterative design: The claim that layer-wise hidden-state shifts within blocks and selective inheritance across blocks preserve sufficient path dependence to maintain high acceptance rates is central but lacks supporting analysis or ablations. If dependence is lost, acceptance rates could drop, negating the speedup gains even at reduced drafting cost. A concrete test, such as measuring per-position acceptance rates or comparing to ablated versions, is needed.
[Training and Inference Alignment] Training and Inference Alignment subsection: The valid-prefix mask and co-trained rank head are intended to prevent mismatches, but the manuscript should provide evidence (e.g., loss curves or acceptance rate comparisons) that no residual training-inference discrepancy remains, as this could silently degrade the reported improvements.

minor comments (2)

[Notation] Clarify the definition of block size K and how it interacts with the rank head parameters early in the paper.
[Figures] Ensure that any speedup vs. cost plots include confidence intervals and label the baselines clearly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve clarity, verifiability, and empirical support.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The performance claims (8-13% speedup at 44-52% cost, extending to 11-19% with adaptation) are presented without reference to specific datasets, model architectures, number of trials, error bars, or ablation studies. This makes it impossible to verify the robustness of the central empirical result and whether the block-iterative design indeed offsets the mechanisms' overhead.

Authors: We agree that the abstract is a high-level summary and does not include experimental details. The Experiments section evaluates on standard benchmarks (MT-Bench, HumanEval, GSM8K) with Llama-2-7B and Llama-3-8B models, averaging over multiple random seeds. To address the concern directly, we will revise the abstract to reference the key datasets and models, and expand the Experiments section with error bars, explicit trial counts, and ablation studies that isolate the contribution of the block-iterative design versus its overhead. revision: yes
Referee: [Method (block-iterative design)] Method section on block-iterative design: The claim that layer-wise hidden-state shifts within blocks and selective inheritance across blocks preserve sufficient path dependence to maintain high acceptance rates is central but lacks supporting analysis or ablations. If dependence is lost, acceptance rates could drop, negating the speedup gains even at reduced drafting cost. A concrete test, such as measuring per-position acceptance rates or comparing to ablated versions, is needed.

Authors: The mechanisms are described in the Method section, but we acknowledge the absence of targeted empirical validation. In revision we will add (i) ablations that disable layer-wise shifts and selective inheritance, reporting resulting acceptance rates and speedups, and (ii) per-position acceptance-rate plots across draft depths. These will demonstrate that path dependence is preserved and that the observed cost reduction does not degrade acceptance. revision: yes
Referee: [Training and Inference Alignment] Training and Inference Alignment subsection: The valid-prefix mask and co-trained rank head are intended to prevent mismatches, but the manuscript should provide evidence (e.g., loss curves or acceptance rate comparisons) that no residual training-inference discrepancy remains, as this could silently degrade the reported improvements.

Authors: We will augment the subsection with direct evidence of alignment: training loss curves comparing masked versus unmasked objectives, and side-by-side acceptance-rate measurements between the trained drafter and its inference-time behavior. These additions will confirm that the valid-prefix mask and rank head eliminate residual discrepancies. revision: yes

Circularity Check

0 steps flagged

SpecBlock presents a novel block-iterative architecture with independent mechanisms that do not reduce to fitted inputs or self-citations by construction.

full rationale

The paper's core contribution is a new drafter design combining within-block layer-wise hidden-state shifts, across-block selective inheritance, a co-trained rank head, and a valid-prefix mask. These are described as explicit constructions to address limitations of prior autoregressive and parallel drafters. No equations, predictions, or experimental claims in the provided text reduce by definition to quantities fitted from the authors' own prior work or to self-citation chains. The speedup results are empirical comparisons against EAGLE-3 and other baselines, not derivations forced by the inputs. This is a standard non-circular case where the method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The abstract introduces several design choices whose concrete values and training details are not specified, so the ledger is necessarily incomplete.

free parameters (2)

block_size_K
Number of dependent positions generated per drafter forward pass; chosen to balance dependence and cost.
rank_head_parameters
Learned parameters of the co-trained head that allocates per-position branching factors.

pith-pipeline@v0.9.0 · 5656 in / 1250 out tokens · 37099 ms · 2026-05-11T02:24:01.144351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each drafter forward produces K dependent positions... layer-wise shift carries the previous position’s hidden state... rank head... valid-prefix mask
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 6 internal anchors

[1]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review arXiv 2023
[3]

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM I...

work page 2024
[4]

Spectr: Fast speculative decoding via optimal transport

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport. volume 36, pages 30222–30242, 2023

work page 2023
[5]

Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461,

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461, 2023

work page arXiv 2023
[6]

Sequoia: Scalable and robust speculative decoding

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable and robust speculative decoding. volume 37, pages 129531–129563, 2024

work page 2024
[7]

Eagle: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. 2024

work page 2024
[8]

Eagle-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. pages 7421–7432, 2024

work page 2024
[9]

arXiv preprint arXiv:2503.01840 (2025) 5 16 Z

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page arXiv 2025
[10]

Medusa: Simple llm inference acceleration framework with multiple decoding heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. 2024

work page 2024
[11]

Learning har- monized representations for speculative sampling.arXiv preprint arXiv:2408.15766,

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized represen- tations for speculative sampling.arXiv preprint arXiv:2408.15766, 2024

work page arXiv 2024
[12]

Draft& verify: Lossless large language model acceleration via self-speculative decoding

Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft& verify: Lossless large language model acceleration via self-speculative decoding. pages 11263–11282, 2024

work page 2024
[13]

Kanga- roo: Lossless self-speculative decoding via double early exiting,

Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, and Yunhe Wang. Kangaroo: Lossless self-speculative decoding via double early exiting.arXiv preprint arXiv:2404.18911, 2024

work page arXiv 2024
[14]

Rest: Retrieval-based speculative decoding

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason Lee, and Di He. Rest: Retrieval-based speculative decoding. pages 1582–1595, 2024

work page 2024
[15]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057,

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024
[16]

Blockwise parallel decoding for deep autoregressive models

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. volume 31, 2018. 10

work page 2018
[17]

Bita: Bi-directional tuning for lossless acceleration in large language models.Expert Systems with Applications, 279:127305, 2025

Feng Lin, Hanling Yi, Yifan Yang, Hongbin Li, Xiaotian Yu, Guangming Lu, and Rong Xiao. Bita: Bi-directional tuning for lossless acceleration in large language models.Expert Systems with Applications, 279:127305, 2025

work page 2025
[18]

Paral- lelspec: Parallel drafter for efficient speculative decoding.arXiv preprint arXiv:2410.05589, 2024

Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, and Dong Yu. Paral- lelspec: Parallel drafter for efficient speculative decoding.arXiv preprint arXiv:2410.05589, 2024

work page arXiv 2024
[19]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583, 2025

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583, 2025

work page arXiv 2025
[20]

Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, and Chen Tian. Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

work page arXiv 2026
[21]

Hydra: Sequentially-dependent draft heads for medusa decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024
[22]

Fasteagle: Cascaded drafting for accelerating speculative decoding

Haiduo Huang, Jiangcheng Song, Wenzhe Zhao, and Pengju Ren. Fasteagle: Cascaded drafting for accelerating speculative decoding. pages 4111–4115, 2026

work page 2026
[23]

Falcon: Faster and parallel inference of large language models through enhanced semi-autoregressive drafting and custom-designed decoding tree

Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, and Feng Ji. Falcon: Faster and parallel inference of large language models through enhanced semi-autoregressive drafting and custom-designed decoding tree. 39(22):23933–23941, 2025

work page 2025
[24]

Pearl: Parallel speculative decoding with adaptive draft length.arXiv preprint arXiv:2408.11850,

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length.arXiv preprint arXiv:2408.11850, 2024

work page arXiv 2024
[25]

Exploring and improving drafts in blockwise parallel decoding.arXiv preprint arXiv:2404.09221, 2024

Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Adrian Benton, and Michael Riley. Exploring and improving drafts in blockwise parallel decoding.arXiv preprint arXiv:2404.09221, 2024

work page arXiv 2024
[26]

Set block decoding is a language model inference accelerator.arXiv preprint arXiv:2509.04185,

Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, and Yaron Lipman. Set block decoding is a language model inference accelerator.arXiv preprint arXiv:2509.04185, 2025

work page arXiv 2025
[27]

Recurrent drafter for fast speculative decoding in large language models

Yunfei Cheng, Aonan Zhang, Xuanyu Zhang, Chong Wang, and Yi Wang. Recurrent drafter for fast speculative decoding in large language models.arXiv preprint arXiv:2403.09919, 2024

work page arXiv 2024
[28]

C2t: A classifier-based tree construction method in speculative decoding.arXiv preprint arXiv:2502.13652, 2025

Feiye Huo, Jianchao Tan, Kefeng Zhang, Xunliang Cai, and Shengli Sun. C2t: A classifier-based tree construction method in speculative decoding.arXiv preprint arXiv:2502.13652, 2025

work page arXiv 2025
[29]

Opt-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025

Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. Opt-tree: Speculative decoding with adaptive draft tree structure.Transactions of the Association for Computational Linguistics, 13:188–199, 2025

work page 2025
[30]

Dyspec: Faster speculative decoding with dynamic token tree structure.World Wide Web, 28(3):36, 2025

Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, and Lei Zou. Dyspec: Faster speculative decoding with dynamic token tree structure.World Wide Web, 28(3):36, 2025

work page 2025
[31]

Talon: Confidence-aware speculative decoding with adaptive token trees.arXiv preprint arXiv:2601.07353,

Tianyu Liu, Qitan Lv, Yuhao Shen, Xiao Sun, and Xiaoyan Sun. Talon: Confidence-aware speculative decoding with adaptive token trees.arXiv preprint arXiv:2601.07353, 2026

work page arXiv 2026
[32]

Banditspec: Adaptive speculative decoding via bandit algorithms.arXiv preprint arXiv:2505.15141, 2025

Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent YF Tan, and Zhuoran Yang. Banditspec: Adaptive speculative decoding via bandit algorithms.arXiv preprint arXiv:2505.15141, 2025

work page arXiv 2025
[33]

SpecDec++: Boosting speculative decoding via adaptive candidate lengths

Kaixuan Huang, Xudong Guo, and Mengdi Wang. SpecDec++: Boosting speculative decoding via adaptive candidate lengths. InConference on Language Modeling, 2024

work page 2024
[34]

Online speculative decoding

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Online speculative decoding. 2023

work page 2023
[35]

Draft, verify, and improve: Toward training-aware speculative decoding.arXiv preprint arXiv:2510.05421, 2025

Shrenik Bhansali and Larry Heck. Draft, verify, and improve: Toward training-aware speculative decoding.arXiv preprint arXiv:2510.05421, 2025. 11

work page arXiv 2025
[36]

Tide: Temporal incremental draft engine for self-improving llm inference.arXiv preprint arXiv:2602.05145, 2026

Jiyoung Park, Hankyu Jang, Changseok Song, and Wookeun Jung. Tide: Temporal incremental draft engine for self-improving llm inference.arXiv preprint arXiv:2602.05145, 2026

work page arXiv 2026
[37]

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, et al. When rl meets adaptive speculative training: A unified training-serving system.arXiv preprint arXiv:2602.06932, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023

work page 2023
[41]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 2023

work page 2023
[42]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

Stanford alpaca: An instruction-following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

work page 2023
[45]

Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

work page 2019
[46]

‘ python def fibonacci _recursive (n ):

Tom Kocmi, Rachel Bawden, Ondˇrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, et al. Find- ings of the 2022 conference on machine translation (wmt22). InProceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, 2022. 12 A Implementation Details A.1 Drafter ...

work page 2022
[47]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects 27 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...

work page