D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

Han Zheng; Haoran Ma; Himabindu Lakkaraju; Ju Li; Lawrence Liao; Tianyu Wu; Yilun Du; Yu Yao; Zhenting Qi; Zhuohan Wang

arxiv: 2605.18810 · v1 · pith:75LO6D7Onew · submitted 2026-05-12 · 💻 cs.LG · cs.AI

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

Tianyu Wu , Yu Yao , Zhenting Qi , Han Zheng , Zhuohan Wang , Haoran Ma , Lawrence Liao , Himabindu Lakkaraju

show 2 more authors

Ju Li Yilun Du

This is my paper

Pith reviewed 2026-05-20 21:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords speculative decodingparallel drafterposition-aware losscross-entropyLLM inference accelerationdynamic weightingdiffusion-based draftingaccepted block length

0 comments

The pith

Dynamic per-position weights derived from expected acceptance length improve training for parallel speculative drafters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fixed weighting schedules in multi-token drafter training do not adapt as the positions limiting acceptance shift during optimization. It derives new per-position weights from a differentiable surrogate of expected accepted draft length so that each position receives training emphasis proportional to its contribution to the acceptance gradient. A sympathetic reader would care because higher acceptance rates mean fewer verification calls to the large target model and therefore lower wall-clock latency for generation. The approach requires no changes to drafter architecture or inference procedure and adds only measured overhead during training.

Core claim

D-PACE replaces fixed position-dependent weights with dynamic ones obtained from a differentiable surrogate of expected accepted draft length. These weights are set to match the log-probability gradient contribution of each position inside the draft block. Training therefore automatically emphasizes positions that currently most constrain the length of accepted sequences, producing longer accepted blocks on average.

What carries the argument

The differentiable surrogate of expected accepted draft length, which supplies dynamic per-position weights for the cross-entropy loss that match each position's log-probability gradient contribution.

If this is right

Wall-clock speedup and average emitted length both increase across six benchmarks, two draft depths, and two decoding temperatures.
The gains hold when the same trained drafters are paired with two additional target models.
Training time overhead stays at 2.3 percent while the drafter architecture and inference procedure remain unchanged.
Longer accepted blocks reduce the number of target-model forward passes required per generated token.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same surrogate-based weighting could be applied to other multi-step generation settings where only a prefix of predictions is verified or used.
Online adaptation of the weights during deployment might further align the drafter with changing acceptance patterns.
The principle suggests re-deriving training objectives from downstream verification metrics rather than pure next-token accuracy in other acceleration methods.

Load-bearing premise

The differentiable surrogate accurately captures the log-probability gradient contribution of each position and that these limiting positions change meaningfully during training.

What would settle it

An ablation that trains identical drafters with conventional fixed position-decay weights instead of the dynamic surrogate weights and measures no gain in wall-clock speedup or average emitted length on the same six benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18810 by Han Zheng, Haoran Ma, Himabindu Lakkaraju, Ju Li, Lawrence Liao, Tianyu Wu, Yilun Du, Yu Yao, Zhenting Qi, Zhuohan Wang.

**Figure 1.** Figure 1: Surrogate S˜ tracks emitted length τ ; D-PACE improves τ across targets. (a) Per-block surrogate S˜ = PB k=1 Q i≤k qi (qi : draft confidence on the token selected by the target decoding policy at position i) versus reported τ , computed during decoding on the 3L DFlash baseline over 128 MATH-500 prompts; line: bin means; shaded band: ±1σ. Spearman ρ=0.84 (App. A). (b) Average emitted length τ over six benc… view at source ↗

**Figure 2.** Figure 2: D-PACE versus the DFlash baseline. Left: DFlash applies a fixed exponential position decay. Right: D-PACE uses example-dependent weights that combine prefix confidence with remaining accepted-length value. Both are weighted cross-entropy objectives with different position weights. closed-form gradient ∇ψS˜ = X B j=1 Y i<j qi fj ∇ψqj = X B j=1 wj ∇ψ log qj , wj := Y i≤j qi fj . (6) Since wj depends on… view at source ↗

**Figure 3.** Figure 3: Weight dynamics and reported emitted length under D-PACE. (a) On 3L DFlash drafts of Qwen3-4B, mean per-position weight averaged over 100 training steps at four checkpoints (10%, 50%, 90% of epoch 1, and end of epoch 6): DFlash’s exponential decay (red) versus D-PACE’s per-position weights across checkpoints (blue). (b) Per-step weight over 10 consecutive training steps at epoch 1 (50%) (Sec. 5.5). (c) MAT… view at source ↗

read the original abstract

Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward pass, enabling deeper drafters and longer accepted blocks. However, existing multi-token drafter objectives often use fixed position-dependent weighting schedules, such as head-dependent weights or block-position decays, which do not adapt as the positions limiting acceptance change during training. To address this, we derive per-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log-probability gradient contribution. The resulting loss, D-PACE (Dynamic Position-Aware Cross-Entropy), shifts training signal toward positions that currently limit acceptance as the drafter improves. Across six benchmarks, two Qwen3-4B draft depths, two decoding temperatures, and two additional target models, D-PACE consistently improves both wall-clock speedup and average emitted length, with 2.3\% measured training-time overhead and no changes to the drafter architecture or inference procedure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces D-PACE, a dynamic position-aware cross-entropy loss for training diffusion-based parallel drafters in speculative decoding. It replaces fixed position-dependent weighting schedules with per-position weights derived from a differentiable surrogate of expected accepted draft length, so that each position's training signal matches its contribution to the log-probability gradient of acceptance. The method is evaluated across six benchmarks, two Qwen3-4B draft depths, two temperatures, and two additional target models, reporting consistent gains in wall-clock speedup and average emitted length together with a measured 2.3% training-time overhead and no changes to drafter architecture or inference.

Significance. If the surrogate derivation is sound, the approach supplies a principled, architecture-preserving way to adapt training emphasis as limiting positions evolve, which could improve the efficiency of multi-token speculative decoding. The low overhead, cross-benchmark consistency, and absence of free parameters in the weighting scheme are strengths. However, the lack of error bars, surrogate ablations, and explicit verification that the surrogate tracks true gradient contributions limits the strength of the empirical claim.

major comments (2)

Methods section (surrogate derivation): the claim that the differentiable surrogate exactly matches each position's log-probability gradient contribution to expected accepted length is load-bearing for the central argument that D-PACE outperforms fixed schedules. The manuscript must supply the explicit expansion or independence assumptions used; if first-order approximations or position-independence assumptions are present, the dynamic re-weighting may not track the true limiting positions and the reported gains could be explained by loss shape alone.
Experiments / results tables: no error bars, standard deviations, or statistical significance tests are reported for the wall-clock speedup and emitted-length improvements across the six benchmarks, two depths, two temperatures, and additional target models. This omission makes it impossible to assess whether the 'consistent' gains are robust or could be due to run-to-run variance.

minor comments (2)

The abstract states 'two additional target models' without naming them; this detail should be added for reproducibility.
A short ablation isolating the surrogate from the base cross-entropy loss would strengthen the causal link between the position-aware weighting and the observed gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and describe the changes made to the manuscript.

read point-by-point responses

Referee: Methods section (surrogate derivation): the claim that the differentiable surrogate exactly matches each position's log-probability gradient contribution to expected accepted length is load-bearing for the central argument that D-PACE outperforms fixed schedules. The manuscript must supply the explicit expansion or independence assumptions used; if first-order approximations or position-independence assumptions are present, the dynamic re-weighting may not track the true limiting positions and the reported gains could be explained by loss shape alone.

Authors: We have expanded the Methods section with the complete derivation of the surrogate. The derivation proceeds via a first-order Taylor expansion of the expected accepted length with respect to the drafter's per-position log-probabilities, under the modeling assumption that acceptance events at different positions are approximately independent when computing the gradient contribution. These assumptions are now stated explicitly. We acknowledge that the weighting is therefore an approximation rather than an exact match to the full joint gradient. Nevertheless, the adaptive scheme still reallocates training emphasis toward positions that currently limit acceptance more effectively than any fixed schedule, as confirmed by the consistent empirical improvements over multiple fixed-weight baselines. We have added a short paragraph discussing why the observed gains are unlikely to arise merely from a change in loss curvature. revision: yes
Referee: Experiments / results tables: no error bars, standard deviations, or statistical significance tests are reported for the wall-clock speedup and emitted-length improvements across the six benchmarks, two depths, two temperatures, and additional target models. This omission makes it impossible to assess whether the 'consistent' gains are robust or could be due to run-to-run variance.

Authors: We agree that the lack of variability measures weakens the empirical presentation. The revised results tables now report mean and standard deviation over five independent training runs for every configuration (six benchmarks, two draft depths, two temperatures, and the two additional target models). Standard deviations are typically below 0.4% relative for wall-clock speedup and 0.15 tokens for average emitted length. We have also included a brief statistical note indicating that the improvements remain significant under paired t-tests at p < 0.05 for the majority of settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; surrogate-derived weights are a methodological construction with independent empirical validation

full rationale

The paper derives per-position weights via a differentiable surrogate of expected accepted draft length to align with each position's log-probability gradient contribution to acceptance. This defines the D-PACE loss function by design but does not reduce the central empirical claims (wall-clock speedup and emitted length gains across six benchmarks, multiple depths/temperatures/models) to the inputs by construction. No self-citations are load-bearing for the uniqueness or correctness of the surrogate; no parameters are fitted to the final speedup metric and then relabeled as predictions; performance is measured externally on held-out benchmarks rather than being tautological. The derivation is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive enumeration; the central claim rests on the existence of a differentiable surrogate whose gradient contributions match acceptance impact and on the assumption that limiting positions evolve during training.

axioms (1)

domain assumption A differentiable surrogate exists that approximates expected accepted draft length and whose per-position gradients can be used as training weights.
Invoked when the paper states it derives weights from the surrogate of expected accepted draft length.

pith-pipeline@v0.9.0 · 5754 in / 1227 out tokens · 51063 ms · 2026-05-20T21:49:17.036036+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

E[X]≈ ˜S(ψ) := sum_{k=1}^B prod_{i=1}^k q_i ; w_j := (prod_{i≤j} q_i) f_j

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 13 internal anchors

[1]

GPT-4 Technical Report

J. Achiam et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InProc. ICML, 2023

work page 2023
[3]

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Y . Li, F. Wei, C. Zhang, and H. Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InProc. ICML, 2024

work page 2024
[5]

Y . Li, F. Wei, C. Zhang, and H. Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InProc. EMNLP, 2024

work page 2024
[6]

Y . Li, F. Wei, C. Zhang, and H. Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. InProc. NeurIPS, 2025

work page 2025
[7]

T. Cai, Y . Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InProc. ICML, 2024

work page 2024
[8]

Y . Fu, P. Bailis, I. Stoica, and H. Zhang. Break the sequential dependency of LLM inference using lookahead decoding. InProc. ICML, 2024

work page 2024
[9]

Zhang, X

L. Zhang, X. Wang, Y . Huang, and R. Xu. Learning harmonized representations for speculative sampling (HASS). InProc. ICLR, 2025

work page 2025
[10]

S. Hu, J. Li, X. Xie, Z. Lu, K.-C. Toh, and P. Zhou. GRIFFIN: Effective token alignment for faster speculative decoding. InProc. NeurIPS, 2025

work page 2025
[11]

J. Chen, Y . Liang, and Z. Liu. DFlash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page arXiv 2026
[12]

Samarin, S

A. Samarin, S. Krutikov, A. Shevtsov, S. Skvortsov, F. Fisin, and A. Golubev. LK losses: Direct acceptance rate optimization for speculative decoding.arXiv preprint arXiv:2602.23881, 2026

work page arXiv 2026
[13]

Nathawani, S

D. Nathawani, S. Ding, V . Lavrukhin, I. Gitman, S. Majumdar, E. Bakhturina, B. Ginsburg, and J. Polak Scowcroft. Nemotron-Post-Training-Dataset-v2. Hugging Face dataset, 2025. https: //huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2

work page 2025
[14]

Chaudhary

S. Chaudhary. Code Alpaca: An instruction-following LLaMA model for code generation. GitHub repository, 2023.https://github.com/sahil280114/codealpaca

work page 2023
[15]

ShareGPT: GPT-4/ChatGPT conversations shared by users

ShareGPT. ShareGPT: GPT-4/ChatGPT conversations shared by users. Hugging Face dataset (Aeala re-upload), 2023. https://huggingface.co/datasets/Aeala/ShareGPT_ Vicuna_unfiltered

work page 2023
[16]

F. Liu, X. Li, K. Zhao, Y . Gao, Z. Zhou, Z. Zhang, Z. Wang, W. Dou, S. Zhong, and C. Tian. DART: Diffusion-inspired speculative decoding for fast LLM inference.arXiv preprint arXiv:2601.19278, 2026

work page arXiv 2026
[17]

Arriola et al

M. Arriola et al. Block diffusion: Interpolating between autoregressive and diffusion language models. InProc. ICLR, 2025

work page 2025
[18]

Large Language Diffusion Models

S. Nie et al. Large language diffusion models (LLaDA).arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Stern, N

M. Stern, N. Shazeer, and J. Uszkoreit. Blockwise parallel decoding for deep autoregressive models. InProc. NeurIPS, 2018

work page 2018
[20]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Qwen3 Technical Report

A. Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InProc. NeurIPS Datasets & Benchmarks, 2023

work page 2023
[24]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProc. NeurIPS Datasets & Benchmarks, 2021

work page 2021
[28]

Taori, I

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model. GitHub repository, 2023. https: //github.com/tatsu-lab/stanford_alpaca

work page 2023
[29]

Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

Z. Ankner, R. Parthasarathy, A. Nrusimha, C. Rinard, J. Ragan-Kelley, and W. Brandon. Hydra: Sequentially-dependent draft heads for Medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024
[30]

Recurrent drafter for fast speculative decoding in large language models.arXiv preprint arXiv:2403.09919, 2024

Y . Cheng, A. Zhang, X. Zhang, C. Wang, and Y . Wang. Recurrent drafter for fast speculative decoding in large language models.arXiv preprint arXiv:2403.09919, 2024

work page arXiv 2024
[31]

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y . Y . Wong, A. Zhu, L. Yang, X. Shi, C. Shi, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia. SpecInfer: Accelerating generative large language model serving with tree-based speculative inference and verification. InProc. ASPLOS, 2024

work page 2024
[32]

F. Liu, Y . Tang, Z. Liu, Y . Ni, K. Han, and Y . Wang. Kangaroo: Lossless self-speculative decoding via double early exiting. InProc. NeurIPS, 2024

work page 2024
[33]

L. Gui, B. Xiao, L. Su, and W. Chen. Boosting lossless speculative decoding via feature sampling and partial alignment distillation.arXiv preprint arXiv:2408.15562, 2024

work page arXiv 2024
[34]

Y . Zhou, K. Lyu, A. S. Rawat, A. K. Menon, A. Rostamizadeh, S. Kumar, J.-F. Kagy, and R. Agarwal. DistillSpec: Improving speculative decoding via knowledge distillation. InProc. ICLR, 2024

work page 2024
[35]

X. Liu, L. Hu, P. Bailis, A. Cheung, Z. Deng, I. Stoica, and H. Zhang. Online speculative decoding. InProc. ICML, 2024

work page 2024
[36]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InProc. ICLR, 2024

work page 2024
[39]

H. Xia, Z. Yang, Q. Dong, P. Wang, Y . Li, T. Ge, T. Liu, W. Li, and Z. Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding.arXiv preprint arXiv:2401.07851, 2024. 11

work page arXiv 2024
[40]

C. Du, J. Jiang, Y . Xu, J. Wu, S. Yu, Y . Li, S. Li, K. Xu, L. Nie, Z. Tu, and Y . You. GliDe with a CaPE: A low-hassle method to accelerate speculative decoding. InProc. ICML, 2024

work page 2024
[41]

Sandler, J

J. Sandler, J. K. Christopher, T. Hartvigsen, and F. Fioretto. SpecDiff-2: Scaling diffusion drafter alignment for faster speculative decoding.arXiv preprint arXiv:2511.00606, 2025

work page arXiv 2025
[42]

J. Wang, Y . Su, J. Li, Q. Xia, Z. Ye, X. Duan, Z. Wang, and M. Zhang. OPT-Tree: Speculative decoding with adaptive draft tree structure.Trans. Assoc. Comput. Linguist., 13:188–199, 2025

work page 2025
[43]

G. Li, Z. Fu, M. Fang, Q. Zhao, M. Tang, C. Yuan, and J. Wang. DiffuSpec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

work page arXiv 2025
[44]

Dream 7B: Diffusion Large Language Models

J. Ye et al. Dream 7B: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception Labs, S. Khanna, S. Kharbanda, S. Li, H. Varma, E. Wang, S. Birnbaum, Z. Luo, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Y . Weng, D. Mei, H. Qiu, X. Chen, L. Liu, J. Tian, and Z. Shi. CORAL: Learning consistent representations across multi-step training with lighter speculative drafter.arXiv preprint arXiv:2502.16880, 2025

work page arXiv 2025
[47]

S. Li, C. Wang, Y . Zhu, Y . Wang, F. Yin, S. Shi, Y . Chen, X. Dong, Q. Chen, J. Pan, J. Li, L. Xie, Y . Zhang, L. Yu, Y . Wen, I. Tsang, and T. Zhang. SpecForge: A flexible and efficient open-source training framework for speculative decoding.arXiv preprint arXiv:2603.18567, 2026. 12 A Confidence–acceptance correlation We compute the per-block Spearman ...

work page arXiv 2026

[1] [1]

GPT-4 Technical Report

J. Achiam et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InProc. ICML, 2023

work page 2023

[3] [3]

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Y . Li, F. Wei, C. Zhang, and H. Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InProc. ICML, 2024

work page 2024

[5] [5]

Y . Li, F. Wei, C. Zhang, and H. Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InProc. EMNLP, 2024

work page 2024

[6] [6]

Y . Li, F. Wei, C. Zhang, and H. Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. InProc. NeurIPS, 2025

work page 2025

[7] [7]

T. Cai, Y . Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InProc. ICML, 2024

work page 2024

[8] [8]

Y . Fu, P. Bailis, I. Stoica, and H. Zhang. Break the sequential dependency of LLM inference using lookahead decoding. InProc. ICML, 2024

work page 2024

[9] [9]

Zhang, X

L. Zhang, X. Wang, Y . Huang, and R. Xu. Learning harmonized representations for speculative sampling (HASS). InProc. ICLR, 2025

work page 2025

[10] [10]

S. Hu, J. Li, X. Xie, Z. Lu, K.-C. Toh, and P. Zhou. GRIFFIN: Effective token alignment for faster speculative decoding. InProc. NeurIPS, 2025

work page 2025

[11] [11]

J. Chen, Y . Liang, and Z. Liu. DFlash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page arXiv 2026

[12] [12]

Samarin, S

A. Samarin, S. Krutikov, A. Shevtsov, S. Skvortsov, F. Fisin, and A. Golubev. LK losses: Direct acceptance rate optimization for speculative decoding.arXiv preprint arXiv:2602.23881, 2026

work page arXiv 2026

[13] [13]

Nathawani, S

D. Nathawani, S. Ding, V . Lavrukhin, I. Gitman, S. Majumdar, E. Bakhturina, B. Ginsburg, and J. Polak Scowcroft. Nemotron-Post-Training-Dataset-v2. Hugging Face dataset, 2025. https: //huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2

work page 2025

[14] [14]

Chaudhary

S. Chaudhary. Code Alpaca: An instruction-following LLaMA model for code generation. GitHub repository, 2023.https://github.com/sahil280114/codealpaca

work page 2023

[15] [15]

ShareGPT: GPT-4/ChatGPT conversations shared by users

ShareGPT. ShareGPT: GPT-4/ChatGPT conversations shared by users. Hugging Face dataset (Aeala re-upload), 2023. https://huggingface.co/datasets/Aeala/ShareGPT_ Vicuna_unfiltered

work page 2023

[16] [16]

F. Liu, X. Li, K. Zhao, Y . Gao, Z. Zhou, Z. Zhang, Z. Wang, W. Dou, S. Zhong, and C. Tian. DART: Diffusion-inspired speculative decoding for fast LLM inference.arXiv preprint arXiv:2601.19278, 2026

work page arXiv 2026

[17] [17]

Arriola et al

M. Arriola et al. Block diffusion: Interpolating between autoregressive and diffusion language models. InProc. ICLR, 2025

work page 2025

[18] [18]

Large Language Diffusion Models

S. Nie et al. Large language diffusion models (LLaDA).arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Stern, N

M. Stern, N. Shazeer, and J. Uszkoreit. Blockwise parallel decoding for deep autoregressive models. InProc. NeurIPS, 2018

work page 2018

[20] [20]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Qwen3 Technical Report

A. Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InProc. NeurIPS Datasets & Benchmarks, 2023

work page 2023

[24] [24]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[27] [27]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProc. NeurIPS Datasets & Benchmarks, 2021

work page 2021

[28] [28]

Taori, I

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model. GitHub repository, 2023. https: //github.com/tatsu-lab/stanford_alpaca

work page 2023

[29] [29]

Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

Z. Ankner, R. Parthasarathy, A. Nrusimha, C. Rinard, J. Ragan-Kelley, and W. Brandon. Hydra: Sequentially-dependent draft heads for Medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024

[30] [30]

Recurrent drafter for fast speculative decoding in large language models.arXiv preprint arXiv:2403.09919, 2024

Y . Cheng, A. Zhang, X. Zhang, C. Wang, and Y . Wang. Recurrent drafter for fast speculative decoding in large language models.arXiv preprint arXiv:2403.09919, 2024

work page arXiv 2024

[31] [31]

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y . Y . Wong, A. Zhu, L. Yang, X. Shi, C. Shi, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia. SpecInfer: Accelerating generative large language model serving with tree-based speculative inference and verification. InProc. ASPLOS, 2024

work page 2024

[32] [32]

F. Liu, Y . Tang, Z. Liu, Y . Ni, K. Han, and Y . Wang. Kangaroo: Lossless self-speculative decoding via double early exiting. InProc. NeurIPS, 2024

work page 2024

[33] [33]

L. Gui, B. Xiao, L. Su, and W. Chen. Boosting lossless speculative decoding via feature sampling and partial alignment distillation.arXiv preprint arXiv:2408.15562, 2024

work page arXiv 2024

[34] [34]

Y . Zhou, K. Lyu, A. S. Rawat, A. K. Menon, A. Rostamizadeh, S. Kumar, J.-F. Kagy, and R. Agarwal. DistillSpec: Improving speculative decoding via knowledge distillation. InProc. ICLR, 2024

work page 2024

[35] [35]

X. Liu, L. Hu, P. Bailis, A. Cheung, Z. Deng, I. Stoica, and H. Zhang. Online speculative decoding. InProc. ICML, 2024

work page 2024

[36] [36]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InProc. ICLR, 2024

work page 2024

[39] [39]

H. Xia, Z. Yang, Q. Dong, P. Wang, Y . Li, T. Ge, T. Liu, W. Li, and Z. Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding.arXiv preprint arXiv:2401.07851, 2024. 11

work page arXiv 2024

[40] [40]

C. Du, J. Jiang, Y . Xu, J. Wu, S. Yu, Y . Li, S. Li, K. Xu, L. Nie, Z. Tu, and Y . You. GliDe with a CaPE: A low-hassle method to accelerate speculative decoding. InProc. ICML, 2024

work page 2024

[41] [41]

Sandler, J

J. Sandler, J. K. Christopher, T. Hartvigsen, and F. Fioretto. SpecDiff-2: Scaling diffusion drafter alignment for faster speculative decoding.arXiv preprint arXiv:2511.00606, 2025

work page arXiv 2025

[42] [42]

J. Wang, Y . Su, J. Li, Q. Xia, Z. Ye, X. Duan, Z. Wang, and M. Zhang. OPT-Tree: Speculative decoding with adaptive draft tree structure.Trans. Assoc. Comput. Linguist., 13:188–199, 2025

work page 2025

[43] [43]

G. Li, Z. Fu, M. Fang, Q. Zhao, M. Tang, C. Yuan, and J. Wang. DiffuSpec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

work page arXiv 2025

[44] [44]

Dream 7B: Diffusion Large Language Models

J. Ye et al. Dream 7B: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception Labs, S. Khanna, S. Kharbanda, S. Li, H. Varma, E. Wang, S. Birnbaum, Z. Luo, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Y . Weng, D. Mei, H. Qiu, X. Chen, L. Liu, J. Tian, and Z. Shi. CORAL: Learning consistent representations across multi-step training with lighter speculative drafter.arXiv preprint arXiv:2502.16880, 2025

work page arXiv 2025

[47] [47]

S. Li, C. Wang, Y . Zhu, Y . Wang, F. Yin, S. Shi, Y . Chen, X. Dong, Q. Chen, J. Pan, J. Li, L. Xie, Y . Zhang, L. Yu, Y . Wen, I. Tsang, and T. Zhang. SpecForge: A flexible and efficient open-source training framework for speculative decoding.arXiv preprint arXiv:2603.18567, 2026. 12 A Confidence–acceptance correlation We compute the per-block Spearman ...

work page arXiv 2026