BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

Hao Zhang; Mingqiao Mo; Xiangxiang Chu; Xin Xiao; Yiming Hu; Yong Wang

arxiv: 2606.31315 · v1 · pith:U66AMIDEnew · submitted 2026-06-30 · 💻 cs.CL

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

Hao Zhang , Yiming Hu , Yong Wang , Mingqiao Mo , Xin Xiao , Xiangxiang Chu This is my paper

Pith reviewed 2026-07-01 05:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative decodingdiffusion modelsblock size adaptationpolicy learninginference accelerationprefilling representationinstance-adaptive decoding

0 comments

The pith

A learned policy selects per-sample block size from prefilling to adapt diffusion speculative decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fixed block sizes are suboptimal because the best size varies across inputs and shows local structure around the training value. This structure turns block-size choice into a low-dimensional prediction task solvable from the prefilling representation. BlockPilot therefore trains a lightweight policy that outputs the right block size once after prefilling and uses it for the rest of decoding. The approach remains plug-and-play and adds negligible cost while raising acceptance length and overall speedup.

Core claim

Block size selection is cast as an instance-adaptive policy-learning problem whose solution is a single forward pass on the prefilling representation that yields the per-sample block size for the subsequent diffusion-based speculative steps.

What carries the argument

instance-adaptive decision mechanism that predicts block size from the prefilling representation in one pass

If this is right

Optimal block sizes concentrate locally around the training value, shrinking the decision space.
Prediction occurs only once after prefilling, preserving the original parallelism of diffusion steps.
The same prefilling representation already computed by the target model supplies the features, adding no extra passes.
The method reports 5.92 acceptance length and 4.20× speedup on Qwen3-4B at temperature 1.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-structure observation could be tested on non-diffusion speculative methods to see whether block-size adaptation transfers.
If the policy generalizes across model families, a single trained predictor might serve multiple target models without retraining.
Combining the adaptive block choice with orthogonal speed-ups such as quantization or early-exit would be a direct next measurement.

Load-bearing premise

The optimal block size for any given input can be recovered reliably from the prefilling representation alone.

What would settle it

Measure acceptance length on held-out samples when the policy is replaced by a random or constant block size; if the gap disappears, the adaptive prediction is not carrying the gain.

Figures

Figures reproduced from arXiv: 2606.31315 by Hao Zhang, Mingqiao Mo, Xiangxiang Chu, Xin Xiao, Yiming Hu, Yong Wang.

**Figure 1.** Figure 1: Diffusion-based speculative decoding with a dLLM draft model. The dLLM proposes a block of tokens in parallel, while the target LLM verifies the block and accepts the longest consistent prefix. Large Language Models (LLMs) [36, 40, 11] have achieved remarkable performance across a wide range of tasks [1, 14], demonstrating strong capabilities in reasoning, code generation, and openended dialogue. Despit… view at source ↗

**Figure 1.** Figure 1: SpeedUp comparison. The data labels for our method are highlighted gure 2: Speedup comparison across models under temperature T = ht ltill ttiHDFlh() dtDFl [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Analysis of optimal block size B∗ . (a) Matching and mismatching proportions across datasets. (b-c) Distribution patterns demonstrating strong locality, where the range [B − 3, B + 3] covers the optimal size for nearly all samples. 2.2 Key Findings To systematically analyze the impact of block size on speculative decoding performance, we perform an exhaustive sweep over candidate block sizes on multiple re… view at source ↗

**Figure 4.** Figure 4: Overview of the BlockPilot inference pipeline. Given an input sequence, the target LLM [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Speculative decoding accelerates inference by using a lightweight draft model to generate candidate tokens in parallel, and are then verified by the target model, enabling lossless acceleration. Recently, diffusion-based speculative decoding further improves parallelism by generating multiple tokens per forward pass via block-level diffusion, achieving state-of-the-art (SOTA) performance. However, existing methods adopt a fixed inference block size and assume a uniform optimal decoding strategy across all inputs. In this paper, we show that this assumption is suboptimal, as the optimal block size varies across samples and plays a critical role in speculative decoding performance. Moreover, these values exhibit a clear local structure, concentrating around the training block size, which reduces the problem to a low-dimensional and structured decision space. Based on these insights, we propose BlockPilot, a sample-adaptive policy that predicts the optimal block size from the prefilling representation. Specifically, we formulate block size selection as a lightweight policy learning problem and propose an instance-adaptive decision mechanism that predicts the optimal block size based on the representation of the prefilling stage. The prediction is performed only once after prefilling, allowing for seamless integration. Extensive experiments demonstrate that our method is plug-and-play, introduces minimal overhead, and consistently improves efficiency, achieving an acceptance length of 5.92 and a 4.20$\times$ speedup on Qwen3-4B under temperature $T=1$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BlockPilot adds a one-shot policy to pick block size from the prefilling representation in diffusion speculative decoding, but the abstract gives no training or experimental details and the fixed-per-sample assumption looks fragile.

read the letter

The main thing to know is that the authors train a lightweight policy to predict the diffusion block size once after prefilling, based on the claim that optimal sizes vary by sample yet cluster locally around the training value. They report 5.92 acceptance length and 4.2x speedup on Qwen3-4B at temperature 1, positioning the method as plug-and-play with little overhead.

What the paper does is identify a real inefficiency in fixed-block diffusion speculative decoding and reduce the choice to a low-dimensional prediction problem. That framing is new within the cited prior work and could be a useful lever for people already tuning these systems.

The soft spots are straightforward. The abstract states the performance numbers but supplies nothing on the policy training procedure, loss, data splits, baselines, or significance tests, so the central empirical claim cannot be evaluated. The stress-test concern also lands: locking in one block size after prefilling assumes the optimum stays stable for the whole sequence. If token difficulty or growing context shifts the best size mid-generation, the single prediction loses its advantage over a well-chosen fixed baseline. The paper asserts local structure, but without within-sequence ablations or variation plots that claim stays untested.

This is aimed at researchers working on speculative decoding and LLM inference speed. A reader already following that literature might pick up the policy idea, but the missing experimental protocol makes it hard to judge whether the gains are reliable.

I would send it for peer review because the practical goal is clear and the technique is simple enough to check, though it will need heavy revision to include training details, ablations on the stability assumption, and proper controls.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes BlockPilot, an instance-adaptive policy for diffusion-based speculative decoding. It observes that optimal block sizes vary across samples yet exhibit local structure around the training block size, reducing the decision to a low-dimensional prediction problem. The method trains a lightweight policy to select the block size once from the prefilling-stage representation and applies the fixed choice for the entire generation, claiming plug-and-play integration, minimal overhead, an acceptance length of 5.92, and 4.20× speedup on Qwen3-4B at temperature 1.

Significance. If the empirical gains are robustly supported and the per-sample constancy assumption holds, the approach offers a low-overhead way to improve existing diffusion speculative decoders without retraining the draft or target models.

major comments (2)

[Method and Experiments sections] The central adaptive benefit rests on the claim that optimal block size is stable within each sample. The manuscript should include a direct measurement of intra-sequence variation in optimal block size (e.g., token-by-token oracle block sizes) to test whether a single post-prefill prediction captures the structure or merely approximates a fixed-block baseline; without this, the reported gains may not exceed what a well-tuned static block size already achieves.
[§3 (Policy Learning)] The abstract and method description state that the policy is trained on observed data, yet no information is supplied on the training procedure, loss, data exclusion rules, or how the policy avoids simply memorizing the training block size distribution. This information is required to assess whether the 4.20× speedup is an artifact of the fitting process rather than a genuine generalization.

minor comments (1)

Clarify the exact architecture and input features of the policy network; the current description leaves the dimensionality and training data size unspecified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Method and Experiments sections] The central adaptive benefit rests on the claim that optimal block size is stable within each sample. The manuscript should include a direct measurement of intra-sequence variation in optimal block size (e.g., token-by-token oracle block sizes) to test whether a single post-prefill prediction captures the structure or merely approximates a fixed-block baseline; without this, the reported gains may not exceed what a well-tuned static block size already achieves.

Authors: We appreciate the referee's emphasis on validating the intra-sequence stability assumption. Our analysis in the manuscript shows that optimal block sizes vary across samples yet concentrate locally around the training block size, which motivates the single post-prefill prediction. To directly test whether this captures per-sample structure or approximates a static baseline, we will add a new experiment in the revised manuscript. This will compute token-by-token oracle block sizes on held-out sequences and report intra-sequence variation statistics (e.g., variance and range within each generation). The results will clarify the extent of stability and whether the reported gains exceed those of a well-tuned fixed block size. revision: yes
Referee: [§3 (Policy Learning)] The abstract and method description state that the policy is trained on observed data, yet no information is supplied on the training procedure, loss, data exclusion rules, or how the policy avoids simply memorizing the training block size distribution. This information is required to assess whether the 4.20× speedup is an artifact of the fitting process rather than a genuine generalization.

Authors: We acknowledge the omission of training details in the submitted version. In the revised manuscript, we will expand §3 with the following: the policy (a lightweight 2-layer MLP) is trained via supervised classification on oracle-derived optimal block size labels collected from a separate training prompt set; the loss is cross-entropy with label smoothing; data exclusion ensures no overlap with validation or test sets; and memorization is mitigated by model capacity limits, dropout (p=0.1), L2 regularization, and early stopping on a held-out validation split. These additions will demonstrate that the policy learns generalizable patterns rather than fitting the training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical policy learning

full rationale

The paper observes (empirically) that optimal block sizes vary across samples and cluster locally around training values, then trains a lightweight policy to predict block size once from the prefilling representation. This is standard supervised learning on data-derived targets; the reported acceptance length and speedup are measured outcomes, not quantities forced by definition or self-citation. No load-bearing step reduces by construction to its own inputs, and no self-citation chains or uniqueness theorems are invoked. The approach remains falsifiable on external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that prefilling representations are predictive of optimal block size and on standard supervised policy-learning assumptions; no invented entities or additional free parameters beyond ordinary model weights are stated in the abstract.

axioms (1)

domain assumption The prefilling representation contains sufficient information to predict the optimal block size for a given input
This premise is required for the one-shot prediction step to succeed; it is invoked when the paper states that the prediction is performed only once after prefilling.

pith-pipeline@v0.9.1-grok · 5790 in / 1283 out tokens · 38907 ms · 2026-07-01T05:57:09.035896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 24 canonical work pages · 16 internal anchors

[1]

Large language models: a survey of their development, capabilities, and applications.Knowledge and Information Systems, 67(3):2967–3022, 2025

Yadagiri Annepaka and Partha Pakray. Large language models: a survey of their development, capabilities, and applications.Knowledge and Information Systems, 67(3):2967–3022, 2025

2025
[2]

Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models, 2025. URL https://arxiv.org/abs/2503. 09573

2025
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Smooth Loss Functions for Deep Top-k Classification

Leonard Berrada, Andrew Zisserman, and M Pawan Kumar. Smooth loss functions for deep top-k classification.arXiv preprint arXiv:1802.07595, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads,
[6]

URLhttps://arxiv.org/abs/2401.10774

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Qwen3-Coder-Next Technical Report

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

2024
[10]

Evaluating Large Language Models Trained on Code

Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Sdar: A synergistic diffusion- autoregression paradigm for scalable sequence generation, 2025

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. Sdar: A synergistic diffusion- autoregression paradigm for scalable sequence generation, 2025. URL https://arxiv.org/ abs/2510.06303

work page arXiv 2025
[12]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

2023
[13]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects.Authorea preprints, 1(3):1–26, 2023

Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muham- mad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects.Authorea preprints, 1(3):1–26, 2023

2023
[16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 10

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Speed: Speculative pipelined execution for efficient decoding

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, and Yakun Sophia Shao. Speed: Speculative pipelined execution for efficient decoding. InEnhancing LLM Performance: Efficacy, Fine-Tuning, and Inference Techniques, pages 19–32. Springer, 2025

2025
[18]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

The winograd schema challenge.KR, 2012(13th):3, 2012

Hector J Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge.KR, 2012(13th):3, 2012

2012
[20]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, pages 19274–19286, 2023

2023
[21]

Diffuspec: Unlocking diffusion language models for speculative decoding, 2025

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding, 2025. URL https: //arxiv.org/abs/2510.02358

work page arXiv 2025
[22]

Eagle-2: Faster inference of language models with dynamic draft trees, 2024

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees, 2024. URL https://arxiv.org/abs/2406.16858

work page arXiv 2024
[23]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test, 2025. URL https://arxiv.org/ abs/2503.01840

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025. URLhttps://arxiv.org/abs/2401.15077

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023
[26]

Tidar: Think in diffusion, talk in autoregression, 2025

Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pavlo Molchanov. Tidar: Think in diffusion, talk in autoregression, 2025. URL https://arxiv.org/abs/2511.08923

work page arXiv 2025
[27]

American Invitational Mathematics Examination - AIME, 2025

MAA. American Invitational Mathematics Examination - AIME, 2025. URL https://maa. org/math-competitions/american-invitational-mathematics-examination-aime

2025
[28]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. URL https: //arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Lorenzo Papa, Paolo Russo, Irene Amerini, and Luping Zhou. A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking.IEEE transactions on pattern analysis and machine intelligence, 46(12):7682–7700, 2024

2024
[30]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019
[31]

Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning. InAAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011

2011
[32]

Your llm knows the future: Uncovering its multi-token prediction potential, 2025

Mohammad Samragh, Arnav Kundu, David Harrison, Kumari Nishu, Devang Naik, Minsik Cho, and Mehrdad Farajtabar. Your llm knows the future: Uncovering its multi-token prediction potential, 2025. URLhttps://arxiv.org/abs/2507.11851

work page arXiv 2025
[33]

Christopher, Thomas Hartvigsen, and Ferdinando Fioretto

Jameson Sandler, Jacob K. Christopher, Thomas Hartvigsen, and Ferdinando Fioretto. Specdiff- 2: Scaling diffusion drafter alignment for faster speculative decoding, 2025. URL https: //arxiv.org/abs/2511.00606. 11

work page arXiv 2025
[34]

Accelerating transformer inference for translation via parallel decoding

Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodolà. Accelerating transformer inference for translation via parallel decoding. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12336–12355, 2023

2023
[35]

Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31, 2018

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31, 2018

2018
[36]

Efficient transformers: A survey

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022

2022
[37]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

work page arXiv 2023
[39]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf. Transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2020

work page internal anchor Pith review Pith/arXiv arXiv 1910
[40]

Fast-dllm v2: Efficient block-diffusion llm,

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm,
[41]

URLhttps://arxiv.org/abs/2509.26328

work page arXiv
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 12 A Theoretical Analysis of Sample-Adaptive Block Size Selection A.1 Acceptance Length as a...

2023

[1] [1]

Large language models: a survey of their development, capabilities, and applications.Knowledge and Information Systems, 67(3):2967–3022, 2025

Yadagiri Annepaka and Partha Pakray. Large language models: a survey of their development, capabilities, and applications.Knowledge and Information Systems, 67(3):2967–3022, 2025

2025

[2] [2]

Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models, 2025. URL https://arxiv.org/abs/2503. 09573

2025

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Smooth Loss Functions for Deep Top-k Classification

Leonard Berrada, Andrew Zisserman, and M Pawan Kumar. Smooth loss functions for deep top-k classification.arXiv preprint arXiv:1802.07595, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads,

[6] [6]

URLhttps://arxiv.org/abs/2401.10774

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Qwen3-Coder-Next Technical Report

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

2024

[10] [10]

Evaluating Large Language Models Trained on Code

Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Sdar: A synergistic diffusion- autoregression paradigm for scalable sequence generation, 2025

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. Sdar: A synergistic diffusion- autoregression paradigm for scalable sequence generation, 2025. URL https://arxiv.org/ abs/2510.06303

work page arXiv 2025

[12] [12]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

2023

[13] [13]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects.Authorea preprints, 1(3):1–26, 2023

Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muham- mad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects.Authorea preprints, 1(3):1–26, 2023

2023

[16] [16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 10

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Speed: Speculative pipelined execution for efficient decoding

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, and Yakun Sophia Shao. Speed: Speculative pipelined execution for efficient decoding. InEnhancing LLM Performance: Efficacy, Fine-Tuning, and Inference Techniques, pages 19–32. Springer, 2025

2025

[18] [18]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

The winograd schema challenge.KR, 2012(13th):3, 2012

Hector J Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge.KR, 2012(13th):3, 2012

2012

[20] [20]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, pages 19274–19286, 2023

2023

[21] [21]

Diffuspec: Unlocking diffusion language models for speculative decoding, 2025

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding, 2025. URL https: //arxiv.org/abs/2510.02358

work page arXiv 2025

[22] [22]

Eagle-2: Faster inference of language models with dynamic draft trees, 2024

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees, 2024. URL https://arxiv.org/abs/2406.16858

work page arXiv 2024

[23] [23]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test, 2025. URL https://arxiv.org/ abs/2503.01840

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025. URLhttps://arxiv.org/abs/2401.15077

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023

[26] [26]

Tidar: Think in diffusion, talk in autoregression, 2025

Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pavlo Molchanov. Tidar: Think in diffusion, talk in autoregression, 2025. URL https://arxiv.org/abs/2511.08923

work page arXiv 2025

[27] [27]

American Invitational Mathematics Examination - AIME, 2025

MAA. American Invitational Mathematics Examination - AIME, 2025. URL https://maa. org/math-competitions/american-invitational-mathematics-examination-aime

2025

[28] [28]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. URL https: //arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Lorenzo Papa, Paolo Russo, Irene Amerini, and Luping Zhou. A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking.IEEE transactions on pattern analysis and machine intelligence, 46(12):7682–7700, 2024

2024

[30] [30]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019

[31] [31]

Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning. InAAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011

2011

[32] [32]

Your llm knows the future: Uncovering its multi-token prediction potential, 2025

Mohammad Samragh, Arnav Kundu, David Harrison, Kumari Nishu, Devang Naik, Minsik Cho, and Mehrdad Farajtabar. Your llm knows the future: Uncovering its multi-token prediction potential, 2025. URLhttps://arxiv.org/abs/2507.11851

work page arXiv 2025

[33] [33]

Christopher, Thomas Hartvigsen, and Ferdinando Fioretto

Jameson Sandler, Jacob K. Christopher, Thomas Hartvigsen, and Ferdinando Fioretto. Specdiff- 2: Scaling diffusion drafter alignment for faster speculative decoding, 2025. URL https: //arxiv.org/abs/2511.00606. 11

work page arXiv 2025

[34] [34]

Accelerating transformer inference for translation via parallel decoding

Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodolà. Accelerating transformer inference for translation via parallel decoding. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12336–12355, 2023

2023

[35] [35]

Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31, 2018

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31, 2018

2018

[36] [36]

Efficient transformers: A survey

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022

2022

[37] [37]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

work page arXiv 2023

[39] [39]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf. Transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2020

work page internal anchor Pith review Pith/arXiv arXiv 1910

[40] [40]

Fast-dllm v2: Efficient block-diffusion llm,

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm,

[41] [41]

URLhttps://arxiv.org/abs/2509.26328

work page arXiv

[42] [42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 12 A Theoretical Analysis of Sample-Adaptive Block Size Selection A.1 Acceptance Length as a...

2023