arxiv: 2512.14067 · v2 · submitted 2025-12-16 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu , Lexington Whalen , Zhifan Ye , Xin Dong , Shizhe Diao , Jingyu Liu , Chengyue Wu , Hao Zhang

show 6 more authors

Enze Xie Song Han Maksim Khadkevich Jan Kautz Yingyan Celine Lin Pavlo Molchanov

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords diffusion language modelsautoregressive to diffusionblock-wise attentiontoken maskingnon-autoregressive generationlanguage model efficiencyKV cachingcontinuous pretraining

0 comments

The pith

AR-to-dLM conversion with block-wise attention preserves accuracy at higher generation speeds

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to convert pretrained autoregressive language models into diffusion language models capable of parallel generation. The key is to maintain the original weight distributions using a block-wise attention pattern that stays causal between blocks but allows bidirectional attention within blocks, along with position-dependent masking to align training and inference. A sympathetic reader would care because this promises the speed benefits of non-autoregressive generation without the usual accuracy penalty of training diffusion models from scratch. The resulting Efficient-DLM models show improved accuracy and substantially higher throughput than comparable autoregressive and diffusion baselines. The studies also offer practical insights for scaling such conversions effectively.

Core claim

The paper establishes that successful conversion from autoregressive to diffusion language models hinges on preserving the pretrained weight distributions. This is accomplished through continuous pretraining with a block-wise attention pattern, causal across blocks to support KV caching yet bidirectional inside blocks, and a position-dependent token masking strategy that assigns higher masking probabilities to later tokens. These choices close the training-test gap and yield diffusion models that match or surpass the accuracy of the source autoregressive models while enabling faster parallel sampling.

What carries the argument

The block-wise attention pattern that remains causal across blocks while permitting bidirectional modeling within each block, together with position-dependent token masking during training.

If this is right

Block-wise attention enables KV caching in diffusion models, improving inference efficiency beyond standard bidirectional setups.
Position-dependent masking aligns training distributions with the left-to-right bias at test time, reducing performance gaps.
Efficient-DLM variants achieve higher task accuracy than prior diffusion language models like Dream and autoregressive models like Qwen3 at similar parameter counts.
Systematic comparisons of attention patterns reveal that full bidirectionality disrupts pretrained weights more than the block-wise approach.
The framework provides scalable methods for AR-to-dLM conversion applicable to larger models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the weight preservation principle holds, similar block-wise designs could accelerate adoption of diffusion models in other sequence tasks like code generation.
Combining this with existing AR models could lead to hybrid systems that switch between sequential and parallel generation modes.
Further work might test whether the same principles apply when converting from diffusion back to autoregressive or to other paradigms.
The efficiency gains suggest potential for deploying these models in resource-constrained environments where speed is critical.

Load-bearing premise

Preserving the exact weight distributions of the pretrained autoregressive model through block-wise attention is both necessary and sufficient to prevent major accuracy degradation in the converted diffusion model.

What would settle it

A direct comparison experiment where an AR model is converted to dLM using fully bidirectional attention from the start, without block-wise structure, and measuring if accuracy and throughput match or exceed the block-wise Efficient-DLM results.

read the original abstract

Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable recipe for turning pretrained AR models into faster diffusion LMs via block-wise attention and position-dependent masking, with reported accuracy and throughput gains, but the evidence tying those gains specifically to weight-distribution preservation is indirect.

read the letter

The main thing to know is that this work shows how to convert a pretrained autoregressive model into a diffusion language model that generates in parallel and runs several times faster while holding or improving task accuracy. They do this with two changes: a block-wise attention pattern that stays causal between blocks but allows bidirectional attention inside each block, and a masking schedule that raises the probability for later tokens during training. On their 8B model they report beating Dream 7B by 5.4 points accuracy at 4.5x throughput and edging out Qwen3 4B as well. The systematic comparison of attention patterns during continuous pretraining is the clearest part of the contribution. Block-wise attention keeps the weight distributions closer to the original AR model than full bidirectional attention does, and it also supports KV caching, which helps the speed side. The position-dependent masking is a straightforward fix for the train-test distribution mismatch, and the paper walks through training dynamics and other design choices in enough detail to give practical guidance. Those elements are new enough relative to earlier AR-to-dLM attempts. The soft spot is the causal claim about weight preservation. The experiments compare complete training runs under different attention patterns, but they do not isolate whether the performance difference comes from reduced distribution shift or from other factors such as KV-cache behavior or the masking schedule itself. There is no direct measurement like KL divergence on attention weights or layer norms, and no ablation that holds the pattern fixed while varying only the distribution. The abstract also omits error bars, statistical tests, and full ablation tables, which makes the numbers harder to assess quickly. This paper is aimed at people working on inference-efficient language models and non-autoregressive generation. Readers who need concrete conversion methods and efficiency numbers will get usable takeaways. It deserves a serious referee because the core framework is grounded, the speed-accuracy trade-off looks practically relevant, and the gaps are fixable with additional experiments rather than fundamental problems.

Referee Report

2 major / 1 minor

Summary. The paper claims that pretrained autoregressive language models can be efficiently converted to diffusion language models (dLMs) via continuous pretraining with a block-wise attention pattern (causal across blocks, bidirectional within) that preserves AR weight distributions better than fully bidirectional attention, combined with position-dependent token masking to align train/test mask distributions. This yields the Efficient-DLM family, with an 8B model reported to achieve +5.4% accuracy and 4.5x throughput over Dream 7B, and +2.7% accuracy with 2.7x throughput over Qwen3 4B, while providing studies on attention patterns and training dynamics.

Significance. If the results hold under rigorous verification, the work would be significant for bridging AR and dLM paradigms by enabling reuse of pretrained weights for parallel generation, offering practical speedups without full retraining. The focus on distribution preservation and masking alignment provides potentially reusable design principles for scalable non-autoregressive models.

major comments (2)

[Experiments on attention patterns (systematic comparison section)] The central claim that block-wise attention preserves pretrained AR weight distributions (and that this preservation is necessary/sufficient for effective conversion) lacks isolated causal evidence. The experiments compare full training runs under different attention patterns but do not include an ablation that holds the masking schedule and other factors fixed while directly measuring weight-distribution shift (e.g., KL divergence on attention weights or layer norms) to rule out confounds such as KV-cache effects or training dynamics.
[Performance evaluation and results] The performance claims (e.g., +5.4% accuracy and 4.5x throughput for Efficient-DLM 8B vs. Dream 7B) are presented without error bars, number of runs, statistical significance tests, or ablations controlling for model size, training compute, and baseline implementation details, undermining assessment of the win-win accuracy-efficiency result.

minor comments (1)

[Method description] The abstract and method description would benefit from explicit notation for the position-dependent masking probability schedule (e.g., as a function of token position) to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point-by-point below, agreeing where additional evidence or clarification is warranted and outlining specific revisions to strengthen the paper.

read point-by-point responses

Referee: [Experiments on attention patterns (systematic comparison section)] The central claim that block-wise attention preserves pretrained AR weight distributions (and that this preservation is necessary/sufficient for effective conversion) lacks isolated causal evidence. The experiments compare full training runs under different attention patterns but do not include an ablation that holds the masking schedule and other factors fixed while directly measuring weight-distribution shift (e.g., KL divergence on attention weights or layer norms) to rule out confounds such as KV-cache effects or training dynamics.

Authors: We agree that a more isolated measurement of weight-distribution shift would provide stronger causal support for the role of block-wise attention. Our existing systematic comparison shows that block-wise attention yields both higher final accuracy and closer final weight similarity to the pretrained AR model than fully bidirectional attention (see Section 4.1 and Figure 3). To directly address the concern, we will add a controlled ablation in the revised manuscript: we train all variants for an identical number of steps with the same position-dependent masking schedule, then report KL divergence on attention weights (averaged across heads and layers) and L2 distance on layer-norm parameters relative to the pretrained checkpoint. This isolates the attention pattern from KV-cache and full-run dynamics. revision: yes
Referee: [Performance evaluation and results] The performance claims (e.g., +5.4% accuracy and 4.5x throughput for Efficient-DLM 8B vs. Dream 7B) are presented without error bars, number of runs, statistical significance tests, or ablations controlling for model size, training compute, and baseline implementation details, undermining assessment of the win-win accuracy-efficiency result.

Authors: We acknowledge that the main results are reported from single training runs, which is standard for large-scale experiments given compute limits. Throughput numbers are deterministic and measured under identical hardware conditions, so they carry no statistical variance. For accuracy, we will add error bars by rerunning the final Efficient-DLM 8B and the key baselines for two additional seeds and report mean ± std; we will also include a brief statistical significance note using paired t-tests across the benchmark suite. On controls: model-size and compute differences are inherent to the public baselines (Dream 7B, Qwen3 4B), but we will expand the appendix with explicit details on training token count, optimizer settings, and hardware to allow direct comparison. We cannot retroactively match total pretraining compute of the original AR models, but the continuous-pretraining setup keeps the AR-to-dLM conversion cost fixed across our ablations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparisons and design choices are independently validated

full rationale

The paper's core methodology—systematic comparison of attention patterns leading to block-wise attention for preserving AR weight distributions, plus position-dependent masking—rests on reported experimental outcomes from continuous pretraining runs and external baseline evaluations (e.g., vs. Dream 7B and Qwen3 4B). No equations, derivations, or fitted parameters are defined in terms of the target predictions by construction. No self-citations serve as load-bearing justifications for uniqueness or necessity claims. The chain is self-contained against external benchmarks and does not reduce any result to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that pretrained AR weight distributions are worth preserving and that a block-wise attention pattern can achieve this while enabling bidirectional modeling inside blocks.

free parameters (2)

block size in attention pattern
Chosen to balance causality across blocks with bidirectional modeling inside blocks; value not specified in abstract.
position-dependent masking probability schedule
Designed to mimic left-to-right test-time distribution; exact functional form and parameters not provided.

axioms (1)

domain assumption Maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion
Explicitly stated as the basis for choosing block-wise attention over fully bidirectional modeling.

pith-pipeline@v0.9.0 · 5696 in / 1278 out tokens · 45608 ms · 2026-05-16T22:26:02.244907+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
cs.RO 2026-05 unverdicted novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
cs.CV 2026-04 unverdicted novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
MARS: Enabling Autoregressive Models Multi-Token Generation
cs.CL 2026-04 unverdicted novelty 7.0

MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 4 Pith papers · 11 internal anchors

[1]

Diffusionbert: Improv- ing generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029, 2022

Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuan- jing Huang, and Xipeng Qiu. Diffusionbert: Improv- ing generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029, 2022

work page arXiv 2022
[2]

Simple and effec- tive masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136– 130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effec- tive masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136– 130184, 2024

work page 2024
[3]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Esoteric language models.arXiv preprint arXiv:2506.01928, 2025

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat. Esoteric language models.arXiv preprint arXiv:2506.01928, 2025

work page arXiv 2025
[7]

The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and Volodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

work page arXiv 2025
[8]

arXiv preprint arXiv:2410.18514

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

work page arXiv 2024
[9]

Any-order gpt as masked diffusion model: Decoupling formulation and architecture

Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, and Zhi-Ming Ma. Any-order gpt as masked diffusion model: Decoupling formulation and architecture. arXiv preprint arXiv:2506.19935, 2025

work page arXiv 2025
[10]

Scaling diffusion language models via adapta- tion from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Ji- acheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adapta- tion from autoregressive models. InThe Thirteenth International Conference on Learning Representa- tions, 2025

work page 2025
[11]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[12]

Efficient hybrid mamba- transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025

NVIDIA Nemotron Nano. Efficient hybrid mamba- transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025

work page arXiv 2025
[13]

Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P. Xing. Megamath: Pushing the limits of open math corpora.arXiv preprint arXiv:2504.02807,

work page arXiv
[14]

Rewriting pre-training data boosts llm performance in math and code, 2025

Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Ohi, Masaki Kawamura, Taishi Nakamura, Takumi 10 Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, and Naoaki Okazaki. Rewriting pre-training ...

work page 2025
[15]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- man, Sid Black, Anthony DiPofi, Charles Foster, Lau- rence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan...

work page 2024
[16]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Your absorbing discrete diffusion secretly models the con- ditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the con- ditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

work page arXiv 2024
[19]

Iris AM Huijben, Wouter Kool, Max B Paulus, and Ruud JG Van Sloun. A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning.IEEE transactions on pattern anal- ysis and machine intelligence, 45(2):1353–1371, 2022

work page 2022
[20]

Ar-diffusion: Auto- regressive diffusion model for text generation.Ad- vances in Neural Information Processing Systems, 36:39957–39974, 2023

Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, Jian Jiao, Juntao Li, Jian Guo, Nan Duan, Weizhu Chen, et al. Ar-diffusion: Auto- regressive diffusion model for text generation.Ad- vances in Neural Information Processing Systems, 36:39957–39974, 2023

work page 2023
[21]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Smollm2: When smol goes big – data-centric training of a small language model, 2025

Loubna Ben Allal, Anton Lozhkov, Elie Bak- ouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémen- tine Fourrier, Ben Burtenshaw, Hugo Larcher, Hao- jun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro vo...

work page 2025
[23]

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

LLM2Vec: Large language models are secretly powerful text encoders

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. LLM2Vec: Large language models are secretly powerful text encoders. InFirst Conference on Language Modeling, 2024

work page 2024
[25]

Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

work page 2022
[26]

Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022

ShansanGong, MukaiLi, JiangtaoFeng, ZhiyongWu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022

work page arXiv 2022
[27]

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular con- trol.arXiv preprint arXiv:2210.17432, 2022

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular con- trol.arXiv preprint arXiv:2210.17432, 2022

work page arXiv 2022
[28]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021
[29]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ra- tios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Gemini diffusion, 2025

Google DeepMind. Gemini diffusion, 2025. Model page: state-of-the-art, experimental text diffusion model

work page 2025
[31]

Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

Samar Khanna, Siddhant Kharbanda, Shufan Li, HarshitVarma, EricWang, SawyerBirnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

work page arXiv 2025
[32]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, JiataoGu, NavdeepJaitly, LingpengKong, andYizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

work page arXiv 2025
[34]

Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025

Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, et al. Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025

work page arXiv 2025
[35]

d1: Scaling reasoning in diffusion large language models via reinforcement learning

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning. arXiv preprint arXiv:2504.12216, 2025. 11 Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

work page arXiv 2025
[36]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, ChunweiWu, JunHu, JunZhou, JianfeiChen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance- reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

work page arXiv 2025
[38]

MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Diffusionbeats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025

MihirPrabhudesai, MengningWu, AmirZadeh, Kate- rinaFragkiadaki, andDeepakPathak. Diffusionbeats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025

work page arXiv 2025
[40]

Diffusion vs

Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Co- han, Anh Tuan Luu, and Chen Zhao. Diffusion vs. autoregressive language models: A text embedding perspective.arXiv preprint arXiv:2505.15045, 2025

work page arXiv 2025
[41]

dllm-cache: Accelerating diffu- sion large language models with adaptive caching

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffu- sion large language models with adaptive caching. arXiv preprint arXiv:2506.06295, 2025

work page arXiv 2025
[42]

dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025

Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025

work page arXiv 2025
[43]

Accelerating diffusion llms via adaptive paral- lel decoding.arXiv preprint arXiv:2506.00413, 2025

Daniel Israel, Guy Van den Broeck, and Aditya Grover. Accelerating diffusion llms via adaptive paral- lel decoding.arXiv preprint arXiv:2506.00413, 2025

work page arXiv 2025
[44]

Accelerating diffusion large language models with slowfast: The three golden principles.arXiv preprint arXiv:2506.10848, 2025

Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, and Linfeng Zhang. Accelerating diffusion large language models with slowfast: The three golden principles.arXiv preprint arXiv:2506.10848, 2025

work page arXiv 2025
[45]

Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, et al. Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025

work page arXiv 2025
[46]

Diffusion llms can do faster- than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster- than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

work page arXiv 2025
[47]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 12 Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed A. Detailed Experimental Settings Evaluation settings.For all evaluations of our Effi...

work page 2022