Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Pith reviewed 2026-05-16 04:22 UTC · model grok-4.3
The pith
Diffusion LLMs can reach up to 27 times higher throughput by adding a reusable block-wise KV cache and decoding only high-confidence tokens in parallel.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A block-wise approximate KV cache mechanism tailored for bidirectional diffusion models enables cache reuse with negligible performance drop, while a confidence-aware parallel decoding strategy selectively decodes only tokens above a fixed threshold, thereby mitigating dependency violations and preserving generation quality.
What carries the argument
Block-wise approximate KV cache combined with a confidence threshold that controls which tokens are decoded in parallel
If this is right
- Throughput rises by as much as 27.6 times on existing Diffusion LLM checkpoints.
- Accuracy remains close to the base model on standard language benchmarks.
- The speed gap between diffusion and autoregressive models is largely closed.
- No retraining is required, so existing open-source Diffusion LLMs can be accelerated immediately.
Where Pith is reading between the lines
- The same block-wise cache pattern could be tested on other bidirectional sequence models outside language.
- Replacing the fixed threshold with a length-dependent or entropy-based rule might reduce the few remaining quality drops.
- Hardware kernels that exploit the block structure could push the speedup beyond the reported software numbers.
Load-bearing premise
The block-wise KV cache approximation introduces only negligible error and a single fixed threshold works across benchmarks without needing per-task retuning.
What would settle it
A direct comparison on a held-out long-sequence benchmark showing that the accelerated model either loses more than a few percent accuracy or that cache reuse causes measurable cumulative drift compared with full recomputation.
read the original abstract
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Fast-dLLM, a training-free acceleration technique for diffusion-based LLMs. It proposes a block-wise approximate KV cache tailored to bidirectional diffusion attention to enable cache reuse, and a confidence-aware parallel decoding strategy that selectively decodes high-confidence tokens to avoid dependency violations under the conditional independence assumption. Experiments on LLaDA and Dream models across standard LLM benchmarks report up to 27.6× throughput improvement with minimal accuracy loss, narrowing the gap to autoregressive models.
Significance. If the central claims hold, the work would meaningfully advance practical deployment of diffusion LLMs by delivering substantial inference speedups without retraining, leveraging their inherent parallel decoding capability. The training-free design and reported empirical gains on multiple models and benchmarks constitute a concrete engineering contribution, though the absence of supporting analysis for the key approximations limits the strength of the significance assessment.
major comments (3)
- [Abstract and §3] Abstract and §3 (block-wise KV cache): the claim that the block-wise approximation enables cache reuse 'with negligible performance drop' is load-bearing for the throughput results, yet the manuscript provides no error-bound analysis, dependency-handling rule for future tokens in bidirectional attention, or quantitative characterization of the approximation error.
- [§4] §4 (confidence-aware parallel decoding): the strategy relies on a single fixed confidence threshold, but the exact selection rule is unspecified and no sensitivity analysis or cross-benchmark validation without per-task retuning is presented, leaving the 'minimal accuracy loss' claim vulnerable to benchmark-specific tuning.
- [Experiments] Experimental section: the reported 27.6× throughput figures rest on the two unverified conditions above; without ablation on block size, threshold sensitivity, or error metrics for the KV approximation, it is unclear whether the gains generalize or are tied to particular benchmark choices.
minor comments (2)
- [Notation and §4] Clarify notation for block size, confidence threshold, and the precise condition under which a token is decoded in parallel.
- [Figures] Add error bars or multiple-run statistics to throughput and accuracy plots to support the 'minimal loss' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript's claims regarding the KV cache approximation and parallel decoding strategy.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (block-wise KV cache): the claim that the block-wise approximation enables cache reuse 'with negligible performance drop' is load-bearing for the throughput results, yet the manuscript provides no error-bound analysis, dependency-handling rule for future tokens in bidirectional attention, or quantitative characterization of the approximation error.
Authors: We agree that formal error-bound analysis and explicit dependency rules would strengthen the presentation. The block-wise approximation reuses cached keys and values for tokens within the same diffusion block while approximating cross-block interactions under the bidirectional attention pattern; future tokens are handled by a mask that prevents premature dependency violations during the denoising steps. Although we lack a closed-form error bound, the empirical results on LLaDA and Dream show accuracy drops below 1% on average across benchmarks. In revision we will add a dedicated subsection with quantitative error metrics (e.g., average attention-score deviation) and a clear statement of the dependency-handling rule. revision: partial
-
Referee: [§4] §4 (confidence-aware parallel decoding): the strategy relies on a single fixed confidence threshold, but the exact selection rule is unspecified and no sensitivity analysis or cross-benchmark validation without per-task retuning is presented, leaving the 'minimal accuracy loss' claim vulnerable to benchmark-specific tuning.
Authors: The threshold is fixed at a single value chosen on a validation split and applied uniformly; we will state this selection rule explicitly in the revised §4. We will also add a sensitivity study across thresholds on all reported benchmarks, confirming that accuracy remains stable without per-task retuning and thereby supporting the claim of minimal accuracy loss. revision: yes
-
Referee: [Experiments] Experimental section: the reported 27.6× throughput figures rest on the two unverified conditions above; without ablation on block size, threshold sensitivity, or error metrics for the KV approximation, it is unclear whether the gains generalize or are tied to particular benchmark choices.
Authors: We acknowledge that additional ablations would better demonstrate generalization. The current results already span two distinct diffusion LLMs and multiple standard benchmarks, but we will expand the experimental section with block-size ablations, threshold-sensitivity curves, and explicit KV-approximation error metrics to clarify that the reported speedups are not benchmark-specific. revision: yes
Circularity Check
No significant circularity in the paper's engineering methods
full rationale
The paper presents a training-free acceleration approach via block-wise approximate KV cache and confidence-aware parallel decoding for diffusion LLMs. These are described as practical mechanisms whose effectiveness is demonstrated empirically on LLaDA and Dream models across benchmarks, with reported throughput gains and minimal accuracy loss. No mathematical derivation chain exists that reduces a claimed prediction or result to a fitted parameter or self-defined quantity by construction. The approximations and threshold choice are validated through experiments rather than justified via self-citation load-bearing arguments or ansatz smuggling. The central claims rest on external benchmark comparisons, making the work self-contained against those benchmarks without internal circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- confidence threshold
axioms (1)
- domain assumption Bidirectional attention in diffusion models permits block-wise KV cache reuse with only small error
Forward citations
Cited by 44 Pith papers
-
NPU Design for Diffusion Language Model Inference
Introduces the first NPU accelerator for diffusion language models with dLLM-specific ISA, hardware execution model, BAOS KV quantization, and 7nm RTL synthesis.
-
On the Reasoning Abilities of Masked Diffusion Language Models
Masked diffusion models are equivalent to polynomially-padded PLTs, solve all CoT-augmented transformer problems, and are more efficient than CoT for regular languages due to parallelism.
-
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
Learned Relay Representations enable masked diffusion models to propagate useful latent information across denoising steps, scaling to Fast-dLLM v2 to outperform supervised finetuning on coding tasks while cutting inf...
-
Drifting Objectives for Refining Discrete Diffusion Language Models
TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.
-
Dynamic Chunking for Diffusion Language Models
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
-
Support Before Frequency in Discrete Diffusion
Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effec...
-
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
-
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
-
Fast Byte Latent Transformer
BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
-
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
-
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
-
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
-
PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion
PartDiffuser is a semi-autoregressive discrete diffusion framework that generates high-fidelity 3D meshes from point clouds by combining inter-part autoregression with intra-part parallel diffusion using a part-aware ...
-
Efficient Autoregressive Inference for Transformer Probabilistic Models
A causal autoregressive buffer enables efficient batched autoregressive sampling and joint density evaluation in set-based transformer models by caching context and attending to prior predictions.
-
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit
CreditDecoding accelerates parallel decoding in diffusion LLMs by fusing accumulated Trace Credit with current logits to accept early-correct tokens sooner, yielding up to 5.48x speedup and accuracy gains.
-
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
-
FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration
FlexDraft is a lossless speculative decoding framework that adapts to batch sizes via attention tuning on final layers, MLP-based bonus calibration, and dynamic parallel/sequential decoding.
-
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
-
Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs
Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.
-
Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers
Diffusion LLMs can act as their own efficiency teachers by using revokable parallel decoding to identify reliable token orders and then distilling those orders into the model parameters for faster inference.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
-
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
Joint training of the latent space with the diffusion process produces a competitive latent diffusion language model that is faster than existing discrete and continuous diffusion baselines.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
Consistent Diffusion Language Models
CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models
DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.
-
FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation
FlowLM converts diffusion LMs to flow matching via fine-tuning, achieving few-step generation that rivals or beats 2000-step diffusion and saturates faster than training flow models from scratch.
-
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
-
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
-
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
-
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
-
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
Saber improves both speed and accuracy of diffusion language models on code generation by dynamically adjusting unmasking steps and reverting low-confidence tokens via backtracking.
-
FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models
FS-DFM enables 1024-token generation at perplexity parity with 1024-step baselines using only 8 steps via explicit step-budget training, reliable updates, and teacher guidance.
-
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
Fast-dDrive is a block-diffusion VLA that reports SOTA ADE on WOD-E2E, 0.32 m L2 on nuScenes, and 12x throughput over AR baselines via section scaffolds and test-time rollout averaging.
-
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
ECHO introduces one-step block diffusion via Direct Conditional Distillation and Response-Asymmetric Diffusion to generate chest X-ray reports faster than autoregressive models while improving clinical metrics.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
Reference graph
Works this paper leans on
-
[1]
Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov
Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models, 2025
work page 2025
-
[2]
Structured denoising diffusion models in discrete state-spaces
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021
work page 2021
-
[3]
A continuous time framework for discrete denoising models
Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35:28266–28279, 2022
work page 2022
-
[4]
Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via de-randomization for discrete diffusion models. arXiv preprint arXiv:2312.09193, 2023
-
[5]
Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024
-
[6]
Approximate accelerated stochastic simulation of chemically reacting systems
Daniel T Gillespie. Approximate accelerated stochastic simulation of chemically reacting systems. The Journal of chemical physics, 115(4):1716–1733, 2001
work page 2001
-
[7]
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. arXiv preprint arXiv:2410.17891, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
Google DeepMind. Gemini diffusion. https://deepmind.google/models/gemini-diffusion, 2025. Accessed: 2025-05-24
work page 2025
-
[9]
The llama 3 herd of models, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, et al. The llama 3 herd of models, 2024
work page 2024
-
[10]
Diffusionbert: Improving generative masked language models with diffusion models,
Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. arXiv preprint arXiv:2211.15029, 2022
-
[11]
Argmax flows and multinomial diffusion: Learning categorical distributions
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021
work page 2021
-
[12]
Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models, 2023
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models, 2023
work page 2023
-
[13]
Introducing mercury: The first commercial diffusion-based language model
Inception Labs. Introducing mercury: The first commercial diffusion-based language model. https:// www.inceptionlabs.ai/introducing-mercury, 2025. Accessed: 2025-05-24
work page 2025
-
[14]
Disk: A diffusion model for structured knowledge
Ouail Kitouni, Niklas Nolte, James Hensman, and Bhaskar Mitra. Disk: A diffusion model for structured knowledge. arXiv preprint arXiv:2312.05253, 2023
-
[15]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019
work page 2019
-
[16]
Anji Liu, Oliver Broadrick, Mathias Niepert, and Guy Van den Broeck. Discrete copula diffusion. arXiv preprint arXiv:2410.01949, 2024
-
[17]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Concrete score matching: Generalized score matching for discrete data
Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. Advances in Neural Information Processing Systems, 35:34532–34545, 2022
work page 2022
-
[19]
Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022
work page 2022
-
[20]
Scaling up masked diffusion models on text, 2025
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text, 2025
work page 2025
-
[21]
Large language diffusion models, 2025
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. 21 Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
work page 2025
-
[22]
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024
work page internal anchor Pith review arXiv 2024
-
[23]
Zero-shot text-to-image generation, 2021
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021
work page 2021
-
[24]
Hellendoorn, and Graham Neubig
Machel Reid, Vincent J. Hellendoorn, and Graham Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022
work page 2022
-
[25]
High-resolution image synthesis with latent diffusion models, 2022
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022
work page 2022
-
[26]
Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022
work page 2022
-
[27]
Simple and effective masked diffusion language models,
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024
-
[28]
Simplified and generalized masked diffusion for discrete data,
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data. arXiv preprint arXiv:2406.04329, 2024
-
[29]
Deep unsupervised learning using nonequilib- rium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilib- rium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015
work page 2015
-
[30]
Ideas in inference-time scaling can benefit generative pre-training algorithms
Jiaming Song and Linqi Zhou. Ideas in inference-time scaling can benefit generative pre-training algorithms. arXiv preprint arXiv:2503.07154, 2025
-
[31]
Score-based continuous-time discrete diffusion models
Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022
-
[32]
Ashish Vaswani. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
A survey on non-autoregressive generation for neural machine translation and beyond, 2023
Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, and Tie yan Liu. A survey on non-autoregressive generation for neural machine translation and beyond, 2023
work page 2023
-
[34]
Energy-based diffusion language models for text generation.arXiv preprint arXiv:2410.21357,
Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-based diffusion language models for text generation. arXiv preprint arXiv:2410.21357, 2024
-
[35]
Diffsound: Discrete diffusion model for text-to-sound generation, 2023
Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation, 2023
work page 2023
-
[36]
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b, 2025
work page 2025
-
[37]
Diffusion language mod- els can perform many tasks with scaling and instruction-finetuning,
Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Quanquan Gu. Diffusion language models can perform many tasks with scaling and instruction-finetuning. arXiv preprint arXiv:2308.12219, 2023
-
[38]
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning. arXiv preprint arXiv:2505.16933, 2025
work page internal anchor Pith review arXiv 2025
-
[39]
Discrete diffusion in large language and multimodal models: A survey, 2025
Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models: A survey, 2025
work page 2025
-
[40]
Dimple: Discrete diffusion multimodal large language model with parallel decoding, 2025
Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding, 2025
work page 2025
-
[41]
Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024
-
[42]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023
work page 2023
-
[43]
Diffusion-nat: Self-prompting discrete diffusion for non-autoregressive text generation, 2023
Kun Zhou, Yifan Li, Wayne Xin Zhao, and Ji-Rong Wen. Diffusion-nat: Self-prompting discrete diffusion for non-autoregressive text generation, 2023
work page 2023
-
[44]
Llada 1.5: Variance-reduced preference optimization for large language diffusion models, 2025
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models, 2025. 22
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.