arxiv: 2505.22618 · v3 · submitted 2025-05-28 · 💻 cs.CL

Recognition: 2 theorem links

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu , Hao Zhang , Shuchen Xue , Zhijian Liu , Shizhe Diao , Ligeng Zhu , Ping Luo , Song Han

show 1 more author

Enze Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-16 04:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion llmkv cacheparallel decodinginference accelerationnon-autoregressive generationtraining-free optimizationllm throughput

0 comments

The pith

Diffusion LLMs can reach up to 27 times higher throughput by adding a reusable block-wise KV cache and decoding only high-confidence tokens in parallel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion-based large language models support parallel token generation in principle, yet they have run slower than autoregressive models because they lack a key-value cache and because simultaneous decoding breaks learned token dependencies. The paper demonstrates that a block-wise approximate KV cache can be reused across diffusion steps with only negligible quality loss, and that a simple threshold can select which tokens to decode together safely. When tested on LLaDA and Dream models across standard benchmarks, these changes produce up to 27.6 times higher throughput while keeping accuracy close to the original models. If the gains hold, diffusion LLMs become competitive for practical text generation workloads.

Core claim

A block-wise approximate KV cache mechanism tailored for bidirectional diffusion models enables cache reuse with negligible performance drop, while a confidence-aware parallel decoding strategy selectively decodes only tokens above a fixed threshold, thereby mitigating dependency violations and preserving generation quality.

What carries the argument

Block-wise approximate KV cache combined with a confidence threshold that controls which tokens are decoded in parallel

If this is right

Throughput rises by as much as 27.6 times on existing Diffusion LLM checkpoints.
Accuracy remains close to the base model on standard language benchmarks.
The speed gap between diffusion and autoregressive models is largely closed.
No retraining is required, so existing open-source Diffusion LLMs can be accelerated immediately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same block-wise cache pattern could be tested on other bidirectional sequence models outside language.
Replacing the fixed threshold with a length-dependent or entropy-based rule might reduce the few remaining quality drops.
Hardware kernels that exploit the block structure could push the speedup beyond the reported software numbers.

Load-bearing premise

The block-wise KV cache approximation introduces only negligible error and a single fixed threshold works across benchmarks without needing per-task retuning.

What would settle it

A direct comparison on a held-out long-sequence benchmark showing that the accelerated model either loses more than a few percent accuracy or that cache reuse causes measurable cumulative drift compared with full recomputation.

read the original abstract

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Fast-dLLM, a training-free acceleration technique for diffusion-based LLMs. It proposes a block-wise approximate KV cache tailored to bidirectional diffusion attention to enable cache reuse, and a confidence-aware parallel decoding strategy that selectively decodes high-confidence tokens to avoid dependency violations under the conditional independence assumption. Experiments on LLaDA and Dream models across standard LLM benchmarks report up to 27.6× throughput improvement with minimal accuracy loss, narrowing the gap to autoregressive models.

Significance. If the central claims hold, the work would meaningfully advance practical deployment of diffusion LLMs by delivering substantial inference speedups without retraining, leveraging their inherent parallel decoding capability. The training-free design and reported empirical gains on multiple models and benchmarks constitute a concrete engineering contribution, though the absence of supporting analysis for the key approximations limits the strength of the significance assessment.

major comments (3)

[Abstract and §3] Abstract and §3 (block-wise KV cache): the claim that the block-wise approximation enables cache reuse 'with negligible performance drop' is load-bearing for the throughput results, yet the manuscript provides no error-bound analysis, dependency-handling rule for future tokens in bidirectional attention, or quantitative characterization of the approximation error.
[§4] §4 (confidence-aware parallel decoding): the strategy relies on a single fixed confidence threshold, but the exact selection rule is unspecified and no sensitivity analysis or cross-benchmark validation without per-task retuning is presented, leaving the 'minimal accuracy loss' claim vulnerable to benchmark-specific tuning.
[Experiments] Experimental section: the reported 27.6× throughput figures rest on the two unverified conditions above; without ablation on block size, threshold sensitivity, or error metrics for the KV approximation, it is unclear whether the gains generalize or are tied to particular benchmark choices.

minor comments (2)

[Notation and §4] Clarify notation for block size, confidence threshold, and the precise condition under which a token is decoded in parallel.
[Figures] Add error bars or multiple-run statistics to throughput and accuracy plots to support the 'minimal loss' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript's claims regarding the KV cache approximation and parallel decoding strategy.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (block-wise KV cache): the claim that the block-wise approximation enables cache reuse 'with negligible performance drop' is load-bearing for the throughput results, yet the manuscript provides no error-bound analysis, dependency-handling rule for future tokens in bidirectional attention, or quantitative characterization of the approximation error.

Authors: We agree that formal error-bound analysis and explicit dependency rules would strengthen the presentation. The block-wise approximation reuses cached keys and values for tokens within the same diffusion block while approximating cross-block interactions under the bidirectional attention pattern; future tokens are handled by a mask that prevents premature dependency violations during the denoising steps. Although we lack a closed-form error bound, the empirical results on LLaDA and Dream show accuracy drops below 1% on average across benchmarks. In revision we will add a dedicated subsection with quantitative error metrics (e.g., average attention-score deviation) and a clear statement of the dependency-handling rule. revision: partial
Referee: [§4] §4 (confidence-aware parallel decoding): the strategy relies on a single fixed confidence threshold, but the exact selection rule is unspecified and no sensitivity analysis or cross-benchmark validation without per-task retuning is presented, leaving the 'minimal accuracy loss' claim vulnerable to benchmark-specific tuning.

Authors: The threshold is fixed at a single value chosen on a validation split and applied uniformly; we will state this selection rule explicitly in the revised §4. We will also add a sensitivity study across thresholds on all reported benchmarks, confirming that accuracy remains stable without per-task retuning and thereby supporting the claim of minimal accuracy loss. revision: yes
Referee: [Experiments] Experimental section: the reported 27.6× throughput figures rest on the two unverified conditions above; without ablation on block size, threshold sensitivity, or error metrics for the KV approximation, it is unclear whether the gains generalize or are tied to particular benchmark choices.

Authors: We acknowledge that additional ablations would better demonstrate generalization. The current results already span two distinct diffusion LLMs and multiple standard benchmarks, but we will expand the experimental section with block-size ablations, threshold-sensitivity curves, and explicit KV-approximation error metrics to clarify that the reported speedups are not benchmark-specific. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the paper's engineering methods

full rationale

The paper presents a training-free acceleration approach via block-wise approximate KV cache and confidence-aware parallel decoding for diffusion LLMs. These are described as practical mechanisms whose effectiveness is demonstrated empirically on LLaDA and Dream models across benchmarks, with reported throughput gains and minimal accuracy loss. No mathematical derivation chain exists that reduces a claimed prediction or result to a fitted parameter or self-defined quantity by construction. The approximations and threshold choice are validated through experiments rather than justified via self-citation load-bearing arguments or ansatz smuggling. The central claims rest on external benchmark comparisons, making the work self-contained against those benchmarks without internal circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard transformer attention mechanics and the diffusion denoising process; the only added assumptions are that block-wise KV approximation preserves sufficient signal and that token confidence correlates with correct dependency structure.

free parameters (1)

confidence threshold
Value used to decide which tokens are decoded in parallel; must be chosen to balance speed and quality.

axioms (1)

domain assumption Bidirectional attention in diffusion models permits block-wise KV cache reuse with only small error
Invoked to justify the cache mechanism without showing the approximation error bound.

pith-pipeline@v0.9.0 · 5502 in / 1148 out tokens · 65862 ms · 2026-05-16T04:22:39.565036+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NPU Design for Diffusion Language Model Inference
cs.AR 2026-01 unverdicted novelty 8.0

Introduces the first NPU accelerator for diffusion language models with dLLM-specific ISA, hardware execution model, BAOS KV quantization, and 7nm RTL synthesis.
Support Before Frequency in Discrete Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
cs.LG 2026-05 unverdicted novelty 7.0

LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
Fast Byte Latent Transformer
cs.CL 2026-05 unverdicted novelty 7.0

BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
cs.CL 2026-05 unverdicted novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
cs.LG 2026-05 unverdicted novelty 7.0

DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 conditional novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 7.0

ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 unverdicted novelty 7.0

Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
cs.CL 2026-05 unverdicted novelty 6.0

Joint training of the latent space with the diffusion process produces a competitive latent diffusion language model that is faster than existing discrete and continuous diffusion baselines.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Consistent Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
cs.LG 2026-01 conditional novelty 6.0

ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 20 Pith papers · 2 internal anchors

[1]

Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models, 2025

work page 2025
[2]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021

work page 2021
[3]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35:28266–28279, 2022

work page 2022
[4]

Fast sampling via de-randomization for discrete diffusion models

Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via de-randomization for discrete diffusion models. arXiv preprint arXiv:2312.09193, 2023

work page arXiv 2023
[5]

Discrete flow matching

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024

work page arXiv 2024
[6]

Approximate accelerated stochastic simulation of chemically reacting systems

Daniel T Gillespie. Approximate accelerated stochastic simulation of chemically reacting systems. The Journal of chemical physics, 115(4):1716–1733, 2001

work page 2001
[7]

Scaling diffusion language models via adaptation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. arXiv preprint arXiv:2410.17891, 2024

work page arXiv 2024
[8]

Gemini diffusion

Google DeepMind. Gemini diffusion. https://deepmind.google/models/gemini-diffusion, 2025. Accessed: 2025-05-24

work page 2025
[9]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, et al. The llama 3 herd of models, 2024

work page 2024
[10]

arXiv:2211.15029 [cs] version:

Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. arXiv preprint arXiv:2211.15029, 2022

work page arXiv 2022
[11]

Argmax flows and multinomial diffusion: Learning categorical distributions

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021

work page 2021
[12]

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models, 2023

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models, 2023

work page 2023
[13]

Introducing mercury: The first commercial diffusion-based language model

Inception Labs. Introducing mercury: The first commercial diffusion-based language model. https:// www.inceptionlabs.ai/introducing-mercury, 2025. Accessed: 2025-05-24

work page 2025
[14]

Disk: A diffusion model for structured knowledge

Ouail Kitouni, Niklas Nolte, James Hensman, and Bhaskar Mitra. Disk: A diffusion model for structured knowledge. arXiv preprint arXiv:2312.05253, 2023

work page arXiv 2023
[15]

Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019

work page 2019
[16]

Discrete copula diffusion

Anji Liu, Oliver Broadrick, Mathias Niepert, and Guy Van den Broeck. Discrete copula diffusion. arXiv preprint arXiv:2410.01949, 2024

work page arXiv 2024
[17]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Concrete score matching: Generalized score matching for discrete data

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. Advances in Neural Information Processing Systems, 35:34532–34545, 2022

work page 2022
[19]

Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022

work page 2022
[20]

Scaling up masked diffusion models on text, 2025

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text, 2025

work page 2025
[21]

Large language diffusion models, 2025

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. 21 Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

work page 2025
[22]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page arXiv 2024
[23]

Zero-shot text-to-image generation, 2021

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021

work page 2021
[24]

Hellendoorn, and Graham Neubig

Machel Reid, Vincent J. Hellendoorn, and Graham Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022

work page 2022
[25]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022

work page 2022
[26]

Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022

work page 2022
[27]

arXiv:2406.07524 [cs]

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024

work page arXiv 2024
[28]

arXiv:2406.04329 [cs]

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data. arXiv preprint arXiv:2406.04329, 2024

work page arXiv 2024
[29]

Deep unsupervised learning using nonequilib- rium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilib- rium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015
[30]

Ideas in inference-time scaling can benefit generative pre-training algorithms

Jiaming Song and Linqi Zhou. Ideas in inference-time scaling can benefit generative pre-training algorithms. arXiv preprint arXiv:2503.07154, 2025

work page arXiv 2025
[31]

Score-based continuous-time discrete diffusion models

Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022

work page arXiv 2022
[32]

Attention Is All You Need

Ashish Vaswani. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

A survey on non-autoregressive generation for neural machine translation and beyond, 2023

Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, and Tie yan Liu. A survey on non-autoregressive generation for neural machine translation and beyond, 2023

work page 2023
[34]

Energy-based diffusion language models for text generation

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-based diffusion language models for text generation. arXiv preprint arXiv:2410.21357, 2024

work page arXiv 2024
[35]

Diffsound: Discrete diffusion model for text-to-sound generation, 2023

Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation, 2023

work page 2023
[36]

Dream 7b, 2025

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b, 2025

work page 2025
[37]

Diffusion language models can perform many tasks with scaling and instruction-finetuning

Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Quanquan Gu. Diffusion language models can perform many tasks with scaling and instruction-finetuning. arXiv preprint arXiv:2308.12219, 2023

work page arXiv 2023
[38]

Llada-v: Large language diffusion models with visual instruction tuning

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning. arXiv preprint arXiv:2505.16933, 2025

work page arXiv 2025
[39]

Discrete diffusion in large language and multimodal models: A survey, 2025

Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models: A survey, 2025

work page 2025
[40]

Dimple: Discrete diffusion multimodal large language model with parallel decoding, 2025

Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding, 2025

work page 2025
[41]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024

work page arXiv 2024
[42]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023
[43]

Diffusion-nat: Self-prompting discrete diffusion for non-autoregressive text generation, 2023

Kun Zhou, Yifan Li, Wayne Xin Zhao, and Ji-Rong Wen. Diffusion-nat: Self-prompting discrete diffusion for non-autoregressive text generation, 2023

work page 2023
[44]

Llada 1.5: Variance-reduced preference optimization for large language diffusion models, 2025

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models, 2025. 22

work page 2025