Recognition: no theorem link
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Pith reviewed 2026-05-16 22:26 UTC · model grok-4.3
The pith
AR-to-dLM conversion with block-wise attention preserves accuracy at higher generation speeds
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that successful conversion from autoregressive to diffusion language models hinges on preserving the pretrained weight distributions. This is accomplished through continuous pretraining with a block-wise attention pattern, causal across blocks to support KV caching yet bidirectional inside blocks, and a position-dependent token masking strategy that assigns higher masking probabilities to later tokens. These choices close the training-test gap and yield diffusion models that match or surpass the accuracy of the source autoregressive models while enabling faster parallel sampling.
What carries the argument
The block-wise attention pattern that remains causal across blocks while permitting bidirectional modeling within each block, together with position-dependent token masking during training.
If this is right
- Block-wise attention enables KV caching in diffusion models, improving inference efficiency beyond standard bidirectional setups.
- Position-dependent masking aligns training distributions with the left-to-right bias at test time, reducing performance gaps.
- Efficient-DLM variants achieve higher task accuracy than prior diffusion language models like Dream and autoregressive models like Qwen3 at similar parameter counts.
- Systematic comparisons of attention patterns reveal that full bidirectionality disrupts pretrained weights more than the block-wise approach.
- The framework provides scalable methods for AR-to-dLM conversion applicable to larger models.
Where Pith is reading between the lines
- If the weight preservation principle holds, similar block-wise designs could accelerate adoption of diffusion models in other sequence tasks like code generation.
- Combining this with existing AR models could lead to hybrid systems that switch between sequential and parallel generation modes.
- Further work might test whether the same principles apply when converting from diffusion back to autoregressive or to other paradigms.
- The efficiency gains suggest potential for deploying these models in resource-constrained environments where speed is critical.
Load-bearing premise
Preserving the exact weight distributions of the pretrained autoregressive model through block-wise attention is both necessary and sufficient to prevent major accuracy degradation in the converted diffusion model.
What would settle it
A direct comparison experiment where an AR model is converted to dLM using fully bidirectional attention from the start, without block-wise structure, and measuring if accuracy and throughput match or exceed the block-wise Efficient-DLM results.
read the original abstract
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that pretrained autoregressive language models can be efficiently converted to diffusion language models (dLMs) via continuous pretraining with a block-wise attention pattern (causal across blocks, bidirectional within) that preserves AR weight distributions better than fully bidirectional attention, combined with position-dependent token masking to align train/test mask distributions. This yields the Efficient-DLM family, with an 8B model reported to achieve +5.4% accuracy and 4.5x throughput over Dream 7B, and +2.7% accuracy with 2.7x throughput over Qwen3 4B, while providing studies on attention patterns and training dynamics.
Significance. If the results hold under rigorous verification, the work would be significant for bridging AR and dLM paradigms by enabling reuse of pretrained weights for parallel generation, offering practical speedups without full retraining. The focus on distribution preservation and masking alignment provides potentially reusable design principles for scalable non-autoregressive models.
major comments (2)
- [Experiments on attention patterns (systematic comparison section)] The central claim that block-wise attention preserves pretrained AR weight distributions (and that this preservation is necessary/sufficient for effective conversion) lacks isolated causal evidence. The experiments compare full training runs under different attention patterns but do not include an ablation that holds the masking schedule and other factors fixed while directly measuring weight-distribution shift (e.g., KL divergence on attention weights or layer norms) to rule out confounds such as KV-cache effects or training dynamics.
- [Performance evaluation and results] The performance claims (e.g., +5.4% accuracy and 4.5x throughput for Efficient-DLM 8B vs. Dream 7B) are presented without error bars, number of runs, statistical significance tests, or ablations controlling for model size, training compute, and baseline implementation details, undermining assessment of the win-win accuracy-efficiency result.
minor comments (1)
- [Method description] The abstract and method description would benefit from explicit notation for the position-dependent masking probability schedule (e.g., as a function of token position) to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the two major comments point-by-point below, agreeing where additional evidence or clarification is warranted and outlining specific revisions to strengthen the paper.
read point-by-point responses
-
Referee: [Experiments on attention patterns (systematic comparison section)] The central claim that block-wise attention preserves pretrained AR weight distributions (and that this preservation is necessary/sufficient for effective conversion) lacks isolated causal evidence. The experiments compare full training runs under different attention patterns but do not include an ablation that holds the masking schedule and other factors fixed while directly measuring weight-distribution shift (e.g., KL divergence on attention weights or layer norms) to rule out confounds such as KV-cache effects or training dynamics.
Authors: We agree that a more isolated measurement of weight-distribution shift would provide stronger causal support for the role of block-wise attention. Our existing systematic comparison shows that block-wise attention yields both higher final accuracy and closer final weight similarity to the pretrained AR model than fully bidirectional attention (see Section 4.1 and Figure 3). To directly address the concern, we will add a controlled ablation in the revised manuscript: we train all variants for an identical number of steps with the same position-dependent masking schedule, then report KL divergence on attention weights (averaged across heads and layers) and L2 distance on layer-norm parameters relative to the pretrained checkpoint. This isolates the attention pattern from KV-cache and full-run dynamics. revision: yes
-
Referee: [Performance evaluation and results] The performance claims (e.g., +5.4% accuracy and 4.5x throughput for Efficient-DLM 8B vs. Dream 7B) are presented without error bars, number of runs, statistical significance tests, or ablations controlling for model size, training compute, and baseline implementation details, undermining assessment of the win-win accuracy-efficiency result.
Authors: We acknowledge that the main results are reported from single training runs, which is standard for large-scale experiments given compute limits. Throughput numbers are deterministic and measured under identical hardware conditions, so they carry no statistical variance. For accuracy, we will add error bars by rerunning the final Efficient-DLM 8B and the key baselines for two additional seeds and report mean ± std; we will also include a brief statistical significance note using paired t-tests across the benchmark suite. On controls: model-size and compute differences are inherent to the public baselines (Dream 7B, Qwen3 4B), but we will expand the appendix with explicit details on training token count, optimizer settings, and hardware to allow direct comparison. We cannot retroactively match total pretraining compute of the original AR models, but the continuous-pretraining setup keeps the AR-to-dLM conversion cost fixed across our ablations. revision: partial
Circularity Check
No circularity: empirical comparisons and design choices are independently validated
full rationale
The paper's core methodology—systematic comparison of attention patterns leading to block-wise attention for preserving AR weight distributions, plus position-dependent masking—rests on reported experimental outcomes from continuous pretraining runs and external baseline evaluations (e.g., vs. Dream 7B and Qwen3 4B). No equations, derivations, or fitted parameters are defined in terms of the target predictions by construction. No self-citations serve as load-bearing justifications for uniqueness or necessity claims. The chain is self-contained against external benchmarks and does not reduce any result to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- block size in attention pattern
- position-dependent masking probability schedule
axioms (1)
- domain assumption Maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion
Forward citations
Cited by 4 Pith papers
-
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
-
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
-
MARS: Enabling Autoregressive Models Multi-Token Generation
MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
Reference graph
Works this paper leans on
-
[1]
Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuan- jing Huang, and Xipeng Qiu. Diffusionbert: Improv- ing generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029, 2022
-
[2]
Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effec- tive masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136– 130184, 2024
work page 2024
-
[3]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Esoteric language models.arXiv preprint arXiv:2506.01928, 2025
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat. Esoteric language models.arXiv preprint arXiv:2506.01928, 2025
-
[7]
The diffusion duality.arXiv preprint arXiv:2506.10892, 2025
Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and Volodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025
-
[8]
arXiv preprint arXiv:2410.18514
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024
-
[9]
Any-order gpt as masked diffusion model: Decoupling formulation and architecture
Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, and Zhi-Ming Ma. Any-order gpt as masked diffusion model: Decoupling formulation and architecture. arXiv preprint arXiv:2506.19935, 2025
-
[10]
Scaling diffusion language models via adapta- tion from autoregressive models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Ji- acheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adapta- tion from autoregressive models. InThe Thirteenth International Conference on Learning Representa- tions, 2025
work page 2025
-
[11]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024
work page 2024
-
[12]
Efficient hybrid mamba- transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025
NVIDIA Nemotron Nano. Efficient hybrid mamba- transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025
- [13]
-
[14]
Rewriting pre-training data boosts llm performance in math and code, 2025
Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Ohi, Masaki Kawamura, Taishi Nakamura, Takumi 10 Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, and Naoaki Okazaki. Rewriting pre-training ...
work page 2025
-
[15]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- man, Sid Black, Anthony DiPofi, Charles Foster, Lau- rence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan...
work page 2024
-
[16]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the con- ditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024
-
[19]
Iris AM Huijben, Wouter Kool, Max B Paulus, and Ruud JG Van Sloun. A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning.IEEE transactions on pattern anal- ysis and machine intelligence, 45(2):1353–1371, 2022
work page 2022
-
[20]
Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, Jian Jiao, Juntao Li, Jian Guo, Nan Duan, Weizhu Chen, et al. Ar-diffusion: Auto- regressive diffusion model for text generation.Ad- vances in Neural Information Processing Systems, 36:39957–39974, 2023
work page 2023
-
[21]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Smollm2: When smol goes big – data-centric training of a small language model, 2025
Loubna Ben Allal, Anton Lozhkov, Elie Bak- ouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémen- tine Fourrier, Ben Burtenshaw, Hugo Larcher, Hao- jun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro vo...
work page 2025
-
[23]
MTEB: Massive Text Embedding Benchmark
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
LLM2Vec: Large language models are secretly powerful text encoders
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. LLM2Vec: Large language models are secretly powerful text encoders. InFirst Conference on Language Modeling, 2024
work page 2024
-
[25]
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022
work page 2022
-
[26]
ShansanGong, MukaiLi, JiangtaoFeng, ZhiyongWu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022
-
[27]
Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular con- trol.arXiv preprint arXiv:2210.17432, 2022
-
[28]
Structured denoising diffusion models in discrete state-spaces
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34:17981–17993, 2021
work page 2021
-
[29]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ra- tios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Google DeepMind. Gemini diffusion, 2025. Model page: state-of-the-art, experimental text diffusion model
work page 2025
-
[31]
Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025
Samar Khanna, Siddhant Kharbanda, Shufan Li, HarshitVarma, EricWang, SawyerBirnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025
-
[32]
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, JiataoGu, NavdeepJaitly, LingpengKong, andYizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025
-
[34]
Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025
Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, et al. Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025
-
[35]
d1: Scaling reasoning in diffusion large language models via reinforcement learning
Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning. arXiv preprint arXiv:2504.12216, 2025. 11 Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
-
[36]
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, ChunweiWu, JunHu, JunZhou, JianfeiChen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance- reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025
-
[38]
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Diffusionbeats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025
MihirPrabhudesai, MengningWu, AmirZadeh, Kate- rinaFragkiadaki, andDeepakPathak. Diffusionbeats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025
-
[40]
Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Co- han, Anh Tuan Luu, and Chen Zhao. Diffusion vs. autoregressive language models: A text embedding perspective.arXiv preprint arXiv:2505.15045, 2025
-
[41]
dllm-cache: Accelerating diffu- sion large language models with adaptive caching
Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Linfeng Zhang. dllm-cache: Accelerating diffu- sion large language models with adaptive caching. arXiv preprint arXiv:2506.06295, 2025
-
[42]
dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025
Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models.arXiv preprint arXiv:2505.15781, 2025
-
[43]
Accelerating diffusion llms via adaptive paral- lel decoding.arXiv preprint arXiv:2506.00413, 2025
Daniel Israel, Guy Van den Broeck, and Aditya Grover. Accelerating diffusion llms via adaptive paral- lel decoding.arXiv preprint arXiv:2506.00413, 2025
-
[44]
Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, and Linfeng Zhang. Accelerating diffusion large language models with slowfast: The three golden principles.arXiv preprint arXiv:2506.10848, 2025
-
[45]
Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, et al. Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303, 2025
-
[46]
Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster- than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025
-
[47]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 12 Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed A. Detailed Experimental Settings Evaluation settings.For all evaluations of our Effi...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.