PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

Chen Chen; Chen Ma; Hui-Ling Zhen; Mingxuan Yuan; Renxi Liu; Shengyin Sun; Weizhe Lin; Xianzhi Yu; XinQi Li; Yiming Li

arxiv: 2605.15609 · v1 · pith:EJO5EWJSnew · submitted 2026-05-15 · 💻 cs.CL

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

Shengyin Sun , Yiming Li , Renxi Liu , Xinqi Li , Hui-Ling Zhen , Weizhe Lin , Chen Chen , Xianzhi Yu

show 2 more authors

Mingxuan Yuan Chen Ma

This is my paper

Pith reviewed 2026-05-20 19:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion large language modelsspeculative decodinginference accelerationparallel token generationadaptive unmaskinghierarchical verificationmasked denoising

0 comments

The pith

Parallel Speculative Decoding lets diffusion LLMs unmask more tokens per step and collapse steps via confidence-guided drafts while matching greedy accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion large language models can be made faster by combining two efficiency levers in one training-free procedure. dLLMs normally reveal tokens gradually across many denoising passes because each pass only refines predictions. PSD reads the model's own scores to pick which positions to reveal right away and to sketch several layers of possible future tokens at once. A single batched check then accepts the longest still-consistent sketch. The result is substantially more tokens produced for each model call without retraining or quality loss, which matters because repeated passes remain the dominant cost in these models.

Core claim

Using only the token probabilities from a single forward pass, PSD applies an adaptive policy to choose which masked positions to unmask and simultaneously assembles multi-depth speculative drafts; a subsequent batched verification applies hierarchical acceptance to retain the deepest draft that stays consistent with the updated predictions, thereby raising the number of tokens advanced per forward pass.

What carries the argument

Parallel Speculative Decoding framework, which jointly uses adaptive unmasking from scores and multi-depth speculative drafts with hierarchical verification to advance both spatial and temporal efficiency.

If this is right

Produces up to 5.5 times more tokens per forward pass on reasoning and code generation tasks.
Keeps generation accuracy comparable to greedy decoding across the evaluated models.
Requires no additional training or changes to the underlying dLLM weights.
Improves efficiency in both the number of tokens revealed per step and the number of steps collapsed per verification call.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same score-driven selection and draft construction could be tested on other iterative masked-generation methods outside the diffusion family.
If the approach scales, designers of future dLLMs might reduce the total number of denoising iterations built into the model itself.
Pairing the method with existing quantization or caching tricks could produce further speed gains on hardware with limited memory bandwidth.

Load-bearing premise

That the scores from one forward pass are reliable enough to choose unmask positions and build speculative drafts that later verification can accept without introducing errors that cannot be fixed.

What would settle it

Apply PSD to any of the three tested dLLMs on a reasoning or code benchmark and observe whether final sequence accuracy falls below the level achieved by standard greedy decoding at equivalent total compute.

Figures

Figures reproduced from arXiv: 2605.15609 by Chen Chen, Chen Ma, Hui-Ling Zhen, Mingxuan Yuan, Renxi Liu, Shengyin Sun, Weizhe Lin, Xianzhi Yu, XinQi Li, Yiming Li.

**Figure 2.** Figure 2: Accuracy vs. speedup on Dream-v0-Base-7B across 27 parameter configurations of different [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy vs. speedup on LLaDA 1.5 across 27 parameter configurations of different [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy vs. speedup on openPangu-7B-Diffusion-Base across 27 parameter configurations [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Precision@K of the undecoded candidate positions selected at step [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Contribution profiles of parallel decoding and speculative decoding over normalized block [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions. Experiments on three dLLMs across reasoning and code generation tasks show that PSD achieves favorable trade-offs between inference efficiency and generation quality, reaching up to $5.5\times$ tokens per forward pass with accuracy comparable to greedy decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PSD gives diffusion LLMs a training-free boost by using one forward pass for both adaptive unmasking and multi-depth speculative drafts, but the joint prediction nature of dLLMs makes the confidence scores a potential weak link.

read the letter

PSD combines adaptive unmasking with multi-depth speculative drafts and hierarchical verification to cut down on denoising steps in diffusion LLMs. The main gain comes from using one forward pass's confidence scores to decide what to unmask and what to speculate on, then verifying in batch. This is new because it adapts speculative decoding ideas to the parallel prediction setup of dLLMs rather than just copying autoregressive methods. The paper shows results on three models for reasoning and code tasks, with up to 5.5 times more tokens per pass and accuracy matching greedy decoding. That part looks solid on the surface. The experiments report favorable trade-offs, which is useful for anyone trying to deploy these models faster. They avoid extra training, which keeps it practical. The soft spot is in the reliance on single-pass confidence scores. Diffusion models predict all positions together, so a token that looks confident early on can shift once neighbors get unmasked. The hierarchical acceptance might not catch every inconsistency if the verification pass doesn't re-score everything thoroughly. The abstract does not give variance across runs or detailed baseline comparisons, so the robustness is not fully clear yet. This work is aimed at researchers focused on inference acceleration for diffusion-based language models. A reader working on similar efficiency tricks would find the method and the reported speedups worth looking at. I would send it to peer review. The core idea is worth checking with more detailed experiments.

Referee Report

2 major / 2 minor

Summary. The paper proposes Parallel Speculative Decoding (PSD), a training-free framework for diffusion LLMs that uses confidence scores from a single forward pass to adaptively select unmasking positions and construct multi-depth speculative drafts, followed by a batched hierarchical verification step that accepts the deepest consistent draft. This jointly accelerates inference spatially (multiple unmaskings per step) and temporally (collapsing denoising steps). Experiments on three dLLMs for reasoning and code generation tasks report up to 5.5× tokens per forward pass with accuracy comparable to greedy decoding.

Significance. If the central efficiency claim holds under rigorous validation, PSD could meaningfully advance practical deployment of dLLMs by improving the speed-quality Pareto frontier without any retraining or auxiliary models. The training-free design that reuses existing model outputs for both policy decisions and verification is a clear strength, as is the evaluation across multiple models and task types. However, the significance depends on whether single-pass confidence scores can reliably handle token interdependencies in joint denoising.

major comments (2)

[§3] §3 (PSD Framework): The central claim of up to 5.5× tokens per forward pass with greedy-level accuracy rests on the assumption that confidence scores from one forward pass can safely drive both the adaptive unmasking policy and multi-depth draft construction. Because dLLMs denoise all masked positions jointly, a high-confidence token selected in the initial pass can become inconsistent once neighboring masks are updated; the subsequent hierarchical verification (which re-uses the same model) may accept erroneous drafts if its consistency check does not fully re-denoise the sequence. This interdependence is load-bearing for the quality-preservation guarantee.
[§4] §4 (Experiments): The reported favorable trade-offs and 5.5× efficiency gain provide no details on exact baselines used, number of runs, variance or standard deviations, statistical significance tests, or whether the configurable adaptive policy was tuned post-hoc on the test sets. Without these, it is difficult to determine whether the accuracy remains comparable to greedy decoding in a robust, reproducible manner across the three dLLMs and tasks.

minor comments (2)

[§3.3] The term 'hierarchical acceptance' would benefit from explicit pseudocode or a formal definition of the consistency check to clarify how it differs from standard speculative decoding verification.
[§4.1] Notation for 'tokens per forward pass' should be defined with an equation or clear formula in the efficiency analysis section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below in detail, providing clarifications and indicating revisions made to strengthen the presentation of the PSD framework and experimental results.

read point-by-point responses

Referee: [§3] §3 (PSD Framework): The central claim of up to 5.5× tokens per forward pass with greedy-level accuracy rests on the assumption that confidence scores from one forward pass can safely drive both the adaptive unmasking policy and multi-depth draft construction. Because dLLMs denoise all masked positions jointly, a high-confidence token selected in the initial pass can become inconsistent once neighboring masks are updated; the subsequent hierarchical verification (which re-uses the same model) may accept erroneous drafts if its consistency check does not fully re-denoise the sequence. This interdependence is load-bearing for the quality-preservation guarantee.

Authors: We appreciate the referee's emphasis on token interdependencies arising from joint denoising in dLLMs. In PSD, confidence scores from the initial forward pass inform both the adaptive unmasking policy and the construction of multi-depth speculative drafts. The subsequent batched hierarchical verification performs forward passes on the candidate drafts, which incorporate the newly unmasked tokens, thereby allowing the model to generate updated predictions that reflect changes in neighboring positions. Only the deepest draft consistent with these updated predictions is accepted. While this does not guarantee complete independence from all interdependencies without additional passes, the verification step explicitly re-evaluates consistency under the revised context. We have added a dedicated paragraph in §3 clarifying this mechanism and acknowledging the inherent limitations of single-pass decisions in joint denoising. revision: partial
Referee: [§4] §4 (Experiments): The reported favorable trade-offs and 5.5× efficiency gain provide no details on exact baselines used, number of runs, variance or standard deviations, statistical significance tests, or whether the configurable adaptive policy was tuned post-hoc on the test sets. Without these, it is difficult to determine whether the accuracy remains comparable to greedy decoding in a robust, reproducible manner across the three dLLMs and tasks.

Authors: We agree that the original manuscript omitted key reproducibility details. The primary baseline is standard greedy decoding on the same dLLMs, with additional comparisons to other inference acceleration techniques as described in the paper. All reported results are means over 3 independent runs using different random seeds, now accompanied by standard deviations in the revised tables. We applied paired t-tests to assess statistical significance of differences versus greedy decoding and report the corresponding p-values. Hyperparameters of the adaptive unmasking policy were selected exclusively on held-out validation splits for each task and model, with no post-hoc adjustment on test data. Section 4 has been updated to include these specifics along with expanded tables presenting variance and significance metrics. revision: yes

Circularity Check

0 steps flagged

No circularity in PSD derivation chain

full rationale

The paper presents a training-free algorithmic framework for parallel speculative decoding in dLLMs. It uses single-forward-pass confidence scores to drive adaptive unmasking and multi-depth draft construction, followed by batched hierarchical verification. No equations, procedures, or self-citations reduce the reported 5.5× tokens-per-pass gains to fitted parameters, self-definitional loops, or renamed known results. The central claims rest on empirical evaluation across models and tasks rather than any derivation that collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the reliability of model confidence scores for guiding unmasking and draft construction, plus the assumption that batch verification can recover from any introduced inconsistencies without quality loss.

free parameters (1)

configurable adaptive unmasking policy
The policy that decides which positions to unmask based on confidence scores is described as configurable but its exact parameters or thresholds are not detailed.

axioms (1)

domain assumption Model confidence scores from a single forward pass reliably indicate which tokens can be safely unmasked or used for speculative drafts
This assumption underpins both the position selection and the construction of multi-depth drafts before verification.

pith-pipeline@v0.9.0 · 5731 in / 1349 out tokens · 47682 ms · 2026-05-20T19:02:46.908473+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 17 internal anchors

[1]

OpenAI o1 System Card

OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, et al. Openai o1 system card. InarXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. InarXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The llama 3 herd of models. InarXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Speed Always Wins: a survey on efficient architectures for large language models

Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, and Yu Cheng. Speed Always Wins: a survey on efficient architectures for large language models. InarXiv:2508.09834,

work page arXiv
[5]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InarXiv:2502.09992,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models. InarXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Efficient Diffusion Language Models: A comprehensive survey

Haokun Lin, Xinle Jia, Shaozhen Liu, Shujun Xia, Weitao Huang, Haobo Xu, Junyang Li, Yicheng Xiao, Xingrun Xing, Ziyu Guo, Renrui Zhang, Qi Li, Yichen Wu, Renzhen Wang, Xiaojuan Qi, Caifeng Shan, Hongsheng Li, and Zhenan Sun. Efficient Diffusion Language Models: A comprehensive survey. In Authorea:au.176918713.36402137,

work page arXiv
[8]

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Efficient-DLM: from autoregressive to diffusion language models, and beyond in speed. InarXiv:2512.14067, 2025a. Han Peng, Peiyu Liu, Zican Dong, Daixuan Cheng, Junyi...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Accelerating diffusion llm inference via local determinism propagation

Fanheng Kong, Jingyuan Zhang, Yahui Liu, Zirui Wu, Yu Tian, Victoria W., and Guorui Zhou. Accelerating diffusion llm inference via local determinism propagation. InarXiv:2510.07081,

work page arXiv
[10]

Lopa: Scaling dllm inference via looka- head parallel decoding.arXiv preprint arXiv:2512.16229,

Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Mingcong Song, Hongjie Si, Tianqi Hou, Junchi Yan, and Zhijie Deng. LoPA: scaling dllm inference via lookahead parallel decoding. In arXiv:2512.16229,

work page arXiv
[11]

Accelerating Large Language Model Decoding with Speculative Sampling

10 Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dLLM: training-free acceleration of diffusion llm by enabling kv cache and parallel decoding. In Proc. Int. Conf. Learning Representations, 2026a. Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, P...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Spiffy: Multiplying diffusion llm accel- eration via lossless speculative decoding.arXiv preprint arXiv:2509.18085,

Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, and Fatih Porikli. Spiffy: multiplying diffusion llm acceleration via lossless speculative decoding. InarXiv:2509.18085,

work page arXiv
[13]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. InarXiv:2505.19223, 2025a. Tianyi Li, Mingda Chen, Bowei Guo, and Zhiqiang Shen. A survey on diffusion language models. In arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

dinfer: An efficient inference framework for diffusion language models, 2025

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, and Da Zheng. dInfer: an efficient inference framework for diffusion language models. InarXiv:2510.08666,

work page arXiv
[15]

Llada-moe: A sparse moe diffusion language model.arXiv preprint arXiv:2509.24389,

Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, and Ji-Rong Wen. LLaDA-MoE: a sparse moe diffusion ...

work page arXiv
[16]

Suffixdecoding: Extreme speculative decoding for emerging ai applications

Gabriele Oliaro, Zhihao Jia, Daniel Campos, and Aurick Qiao. Suffixdecoding: Extreme speculative decoding for emerging ai applications. InarXiv:2411.04975,

work page arXiv
[17]

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Talor Abramovich, Maor Ashkenazi, Carl Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, and Yonatan Geifman. SPEED-Bench: A unified and diverse benchmark for speculative decoding. InarXiv:2604.09557,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

SpecForge: A flexible and efficient open-source training framework for speculative decoding

Shenggui Li, Chao Wang, Yikai Zhu, Yubo Wang, Fan Yin, Shuai Shi, Yefei Chen, Xiaomin Dong, Qiaoling Chen, Jin Pan, Ji Li, Laixin Xie, Yineng Zhang, Lei Yu, Yonggang Wen, Ivor Tsang, and Tianwei Zhang. SpecForge: A flexible and efficient open-source training framework for speculative decoding. InarXiv:2603.18567, 2026a. 11 Yifeng Gao, Ziang Ji, Yuxuan Wan...

work page arXiv
[19]

Why diffusion language models struggle with truly parallel (non-autoregressive) decoding? InarXiv:2602.23225, 2026b

Pengxiang Li, Dilxat Muhtar, Tianlong Chen, Lu Yin, and Shiwei Liu. Why diffusion language models struggle with truly parallel (non-autoregressive) decoding? InarXiv:2602.23225, 2026b. Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding. InarXiv:2505.16990,

work page arXiv
[20]

Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

Yangyang Zhong, Yanmei Gu, Zhengqing Zang, Xiaomeng Li, Yuqi Ding, Xibei Jia, Yuting Shen, Zhenzhong Lan, Liwang Zhu, Weiping Liu, Junlin Zhou, Haisheng Liu, Zhong Xin Yu, Pengxin Luo, Donglian Qi, Yunfeng Yan, and Junbo Zhao. Parallelism and generation order in masked diffusion language models: Limits today, potential tomorrow. InarXiv:2601.15593,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. InarXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. In arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

From bits to rounds: Parallel decoding with exploration for diffusion language models

Hengyu Fu, Baihe Huang, Virginia Adams, Charles Wang, Venkat Srinivasan, and Jiantao Jiao. From bits to rounds: Parallel decoding with exploration for diffusion language models. InarXiv:2511.21103, 2025b. Yichuan Mo, Quan Chen, Mingjie Li, Zeming Wei, and Yisen Wang. Decoding large language diffusion models with foreseeing movement.CoRR, abs/2512.04135,

work page arXiv
[25]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report. In arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification. InarXiv:2305.09781,

work page arXiv
[27]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InarXiv:2401.10774,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. InProc. Int. Conf. Machine Learning, 2024a. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. InProc. Conf. Empirical Methods in Natural Language Processing,...

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2.5 technical report. InarXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

12 A Additional Experimental Setup A.1 Additional Details on Models and Benchmarks Models.We evaluate three open-source dLLMs that are comparable in scale but differ substantially in how diffusion modeling is introduced during training. This selection allows us to examine whether the empirical observations hold across different dLLM construction pipelines...

work page 2025

[1] [1]

OpenAI o1 System Card

OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, et al. Openai o1 system card. InarXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. InarXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The llama 3 herd of models. InarXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Speed Always Wins: a survey on efficient architectures for large language models

Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, and Yu Cheng. Speed Always Wins: a survey on efficient architectures for large language models. InarXiv:2508.09834,

work page arXiv

[5] [5]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InarXiv:2502.09992,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models. InarXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Efficient Diffusion Language Models: A comprehensive survey

Haokun Lin, Xinle Jia, Shaozhen Liu, Shujun Xia, Weitao Huang, Haobo Xu, Junyang Li, Yicheng Xiao, Xingrun Xing, Ziyu Guo, Renrui Zhang, Qi Li, Yichen Wu, Renzhen Wang, Xiaojuan Qi, Caifeng Shan, Hongsheng Li, and Zhenan Sun. Efficient Diffusion Language Models: A comprehensive survey. In Authorea:au.176918713.36402137,

work page arXiv

[8] [8]

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Efficient-DLM: from autoregressive to diffusion language models, and beyond in speed. InarXiv:2512.14067, 2025a. Han Peng, Peiyu Liu, Zican Dong, Daixuan Cheng, Junyi...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Accelerating diffusion llm inference via local determinism propagation

Fanheng Kong, Jingyuan Zhang, Yahui Liu, Zirui Wu, Yu Tian, Victoria W., and Guorui Zhou. Accelerating diffusion llm inference via local determinism propagation. InarXiv:2510.07081,

work page arXiv

[10] [10]

Lopa: Scaling dllm inference via looka- head parallel decoding.arXiv preprint arXiv:2512.16229,

Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Mingcong Song, Hongjie Si, Tianqi Hou, Junchi Yan, and Zhijie Deng. LoPA: scaling dllm inference via lookahead parallel decoding. In arXiv:2512.16229,

work page arXiv

[11] [11]

Accelerating Large Language Model Decoding with Speculative Sampling

10 Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dLLM: training-free acceleration of diffusion llm by enabling kv cache and parallel decoding. In Proc. Int. Conf. Learning Representations, 2026a. Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, P...

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Spiffy: Multiplying diffusion llm accel- eration via lossless speculative decoding.arXiv preprint arXiv:2509.18085,

Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, and Fatih Porikli. Spiffy: multiplying diffusion llm acceleration via lossless speculative decoding. InarXiv:2509.18085,

work page arXiv

[13] [13]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. InarXiv:2505.19223, 2025a. Tianyi Li, Mingda Chen, Bowei Guo, and Zhiqiang Shen. A survey on diffusion language models. In arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

dinfer: An efficient inference framework for diffusion language models, 2025

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, and Da Zheng. dInfer: an efficient inference framework for diffusion language models. InarXiv:2510.08666,

work page arXiv

[15] [15]

Llada-moe: A sparse moe diffusion language model.arXiv preprint arXiv:2509.24389,

Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, and Ji-Rong Wen. LLaDA-MoE: a sparse moe diffusion ...

work page arXiv

[16] [16]

Suffixdecoding: Extreme speculative decoding for emerging ai applications

Gabriele Oliaro, Zhihao Jia, Daniel Campos, and Aurick Qiao. Suffixdecoding: Extreme speculative decoding for emerging ai applications. InarXiv:2411.04975,

work page arXiv

[17] [17]

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Talor Abramovich, Maor Ashkenazi, Carl Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, and Yonatan Geifman. SPEED-Bench: A unified and diverse benchmark for speculative decoding. InarXiv:2604.09557,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

SpecForge: A flexible and efficient open-source training framework for speculative decoding

Shenggui Li, Chao Wang, Yikai Zhu, Yubo Wang, Fan Yin, Shuai Shi, Yefei Chen, Xiaomin Dong, Qiaoling Chen, Jin Pan, Ji Li, Laixin Xie, Yineng Zhang, Lei Yu, Yonggang Wen, Ivor Tsang, and Tianwei Zhang. SpecForge: A flexible and efficient open-source training framework for speculative decoding. InarXiv:2603.18567, 2026a. 11 Yifeng Gao, Ziang Ji, Yuxuan Wan...

work page arXiv

[19] [19]

Why diffusion language models struggle with truly parallel (non-autoregressive) decoding? InarXiv:2602.23225, 2026b

Pengxiang Li, Dilxat Muhtar, Tianlong Chen, Lu Yin, and Shiwei Liu. Why diffusion language models struggle with truly parallel (non-autoregressive) decoding? InarXiv:2602.23225, 2026b. Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding. InarXiv:2505.16990,

work page arXiv

[20] [20]

Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

Yangyang Zhong, Yanmei Gu, Zhengqing Zang, Xiaomeng Li, Yuqi Ding, Xibei Jia, Yuting Shen, Zhenzhong Lan, Liwang Zhu, Weiping Liu, Junlin Zhou, Haisheng Liu, Zhong Xin Yu, Pengxin Luo, Donglian Qi, Yunfeng Yan, and Junbo Zhao. Parallelism and generation order in masked diffusion language models: Limits today, potential tomorrow. InarXiv:2601.15593,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. InarXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. In arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

From bits to rounds: Parallel decoding with exploration for diffusion language models

Hengyu Fu, Baihe Huang, Virginia Adams, Charles Wang, Venkat Srinivasan, and Jiantao Jiao. From bits to rounds: Parallel decoding with exploration for diffusion language models. InarXiv:2511.21103, 2025b. Yichuan Mo, Quan Chen, Mingjie Li, Zeming Wei, and Yisen Wang. Decoding large language diffusion models with foreseeing movement.CoRR, abs/2512.04135,

work page arXiv

[25] [25]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report. In arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification. InarXiv:2305.09781,

work page arXiv

[27] [27]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InarXiv:2401.10774,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. InProc. Int. Conf. Machine Learning, 2024a. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. InProc. Conf. Empirical Methods in Natural Language Processing,...

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2.5 technical report. InarXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

12 A Additional Experimental Setup A.1 Additional Details on Models and Benchmarks Models.We evaluate three open-source dLLMs that are comparable in scale but differ substantially in how diffusion modeling is introduced during training. This selection allows us to examine whether the empirical observations hold across different dLLM construction pipelines...

work page 2025