Improved Large Language Diffusion Models
Pith reviewed 2026-06-25 21:32 UTC · model grok-4.3
The pith
An 8B masked diffusion language model trained from scratch with fully bidirectional attention matches autoregressive models on language benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
iLLaDA keeps the masked diffusion objective through both pre-training and supervised fine-tuning, uses fully bidirectional attention throughout, scales to 12T tokens, and applies variable-length generation plus confidence-based scoring; under these choices the model improves 21.6 points on BBH and 14.9 points on ARC-Challenge relative to LLaDA while remaining competitive with Qwen2.5-7B on several benchmarks, demonstrating that fully bidirectional diffusion training from scratch is a competitive path toward strong language models.
What carries the argument
Masked diffusion objective with fully bidirectional attention, which replaces causal factorization and allows non-autoregressive training and generation while preserving the diffusion loss throughout pre-training and fine-tuning.
If this is right
- Maintaining the diffusion objective through supervised fine-tuning preserves the non-autoregressive training regime at scale.
- Variable-length generation reduces inference cost without reverting to autoregressive sampling.
- Confidence-based scoring provides a consistent way to evaluate multiple-choice questions under diffusion sampling.
- Broad gains on mathematical and code benchmarks follow from scaling the bidirectional diffusion recipe to 12T tokens.
Where Pith is reading between the lines
- The same bidirectional diffusion setup could be tested on sequence tasks outside language, such as protein or music modeling.
- If diffusion training tolerates longer contexts more gracefully than causal attention, it might relax current context-length limits.
- Parallel sampling during generation could become a practical advantage once the objective is shown to match autoregressive quality.
Load-bearing premise
The reported gains come from the masked diffusion objective and bidirectional attention rather than from unmeasured differences in total compute, data quality, or post-training steps.
What would settle it
A side-by-side training run in which an autoregressive model receives the identical token count, data mixture, and fine-tuning protocol as iLLaDA yet still outperforms it on the same suite of benchmarks.
read the original abstract
Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present \emph{iLLaDA}, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2.5 7B on several benchmarks. These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github.com/ML-GSAI/LLaDA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention and the masked diffusion objective retained through pre-training (12T tokens) and SFT (25B-token corpus for 12 epochs). It introduces variable-length generation and confidence-based scoring for multiple-choice tasks, reports large gains over LLaDA (e.g., +21.6 BBH, +14.9 ARC-Challenge for the base model; +14.5 MATH, +16.5 HumanEval for the instruct model), and states competitiveness with Qwen2.5-7B, concluding that fully bidirectional diffusion training from scratch is a competitive path to strong language models. Model weights and code are released.
Significance. If the reported gains can be isolated to the masked diffusion objective and bidirectional attention, the work would establish a viable non-autoregressive scaling route that challenges the dominance of causal autoregressive training and broadens architectural options for language models. The public release of weights and code supports reproducibility and further investigation.
major comments (2)
- [Abstract] Abstract: the central attribution of gains (e.g., +21.6 BBH, +14.9 ARC-Challenge) to the masked diffusion objective plus fully bidirectional attention is not secured, because no ablation holds total compute, data mixture, optimizer schedule, and post-training fixed while swapping only the objective and attention mask; the 12 T pre-training tokens, 25 B SFT corpus, and 12-epoch fine-tuning therefore remain plausible alternative drivers.
- [Abstract] Abstract: benchmark improvements are presented without error bars, multiple random seeds, or statistical tests, so the reliability of claims such as +14.5 MATH and +16.5 HumanEval cannot be assessed from the reported point estimates alone.
minor comments (1)
- [Abstract] Abstract: the description of variable-length generation and confidence-based scoring is too brief to allow replication or assessment of their contribution to the reported efficiency and accuracy numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central attribution of gains (e.g., +21.6 BBH, +14.9 ARC-Challenge) to the masked diffusion objective plus fully bidirectional attention is not secured, because no ablation holds total compute, data mixture, optimizer schedule, and post-training fixed while swapping only the objective and attention mask; the 12 T pre-training tokens, 25 B SFT corpus, and 12-epoch fine-tuning therefore remain plausible alternative drivers.
Authors: The referee correctly identifies that we lack a controlled ablation isolating the contribution of the masked diffusion objective and bidirectional attention. Such an experiment would require training additional models with identical compute, data, and schedules but different objectives, which exceeds our available resources. Our results demonstrate that scaling masked diffusion training to 12T tokens yields strong performance compared to the LLaDA baseline. We have revised the abstract to avoid over-attributing the gains exclusively to the objective and attention, instead emphasizing the overall training approach. A discussion of alternative factors has been added to the limitations section. revision: partial
-
Referee: [Abstract] Abstract: benchmark improvements are presented without error bars, multiple random seeds, or statistical tests, so the reliability of claims such as +14.5 MATH and +16.5 HumanEval cannot be assessed from the reported point estimates alone.
Authors: We agree that multiple seeds and statistical tests would provide a more robust assessment of the improvements. Due to the high computational cost of training 8B models on trillions of tokens, we are unable to conduct multiple independent runs. We have added a statement in the experimental setup section noting that all results are from single training runs, consistent with practices in similar large-scale model papers. revision: partial
Circularity Check
No circularity: purely empirical training and evaluation results
full rationale
The manuscript reports training runs of an 8B masked diffusion model (iLLaDA) from scratch using a fixed objective and bidirectional attention, followed by direct benchmark evaluation. No equations, predictions, or first-principles derivations are present that could reduce reported scores to fitted parameters or self-citations by construction. Comparisons to LLaDA and Qwen2.5 are external reference points, not load-bearing inputs to any claimed derivation. The work is self-contained as an empirical demonstration.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A survey of large language models.arXiv preprint arXiv:2303.18223, 2023
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023
Pith/arXiv arXiv 2023
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020
1901
-
[3]
ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November 2022
OpenAI. ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November 2022. URL https: //openai.com/blog/chatgpt/
2022
-
[4]
The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
Pith/arXiv arXiv 2024
-
[5]
Structured denoising diffusion models in discrete state-spaces.Advancesin Neural Information Processing Systems, 34:17981–17993, 2021
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advancesin Neural Information Processing Systems, 34:17981–17993, 2021
2021
-
[6]
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
Pith/arXiv arXiv 2023
-
[7]
Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024
arXiv 2024
-
[8]
Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024
arXiv 2024
-
[9]
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024
Pith/arXiv arXiv 2024
-
[10]
Large language diffusion models.Advances in Neural Information Processing Systems, 38: 50608–50646, 2026
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, 6 and Chongxuan Li. Large language diffusion models.Advances in Neural Information Processing Systems, 38: 50608–50646, 2026
2026
-
[11]
Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024
arXiv 2024
-
[12]
Beyond autoregression: Discrete diffusion for complex reasoning and planning
Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning. InInternational Conference on Learning Representations, 2025
2025
-
[13]
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025
Pith/arXiv arXiv 2025
-
[14]
Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
Pith/arXiv arXiv 2025
-
[15]
Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068, 2026
Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068, 2026
arXiv 2026
-
[16]
Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025
arXiv 2025
-
[17]
Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025
Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025
arXiv 2025
-
[18]
Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025
Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025
arXiv 2025
-
[19]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...
Pith/arXiv arXiv 2024
-
[20]
Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
Pith/arXiv arXiv 2024
-
[21]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023
2023
-
[22]
Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025
Pith/arXiv arXiv 2025
-
[23]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018
Pith/arXiv arXiv 2018
-
[24]
Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019
2019
-
[25]
Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
Pith/arXiv arXiv 2002
-
[26]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[27]
dkv-cache: The cache for diffusion language models
Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models. arXiv preprint arXiv:2505.15781, 2025. 7
arXiv 2025
-
[28]
Attention is all you need for kv cache in diffusion llms
Quan Nguyen-Tri, Mukul Ranjan, and Zhiqiang Shen. Attention is all you need for kv cache in diffusion llms. arXiv preprint arXiv:2510.14973, 2025
arXiv 2025
-
[29]
Minsoo Cheong, Donghyun Son, Woosang Lim, and Sungjoo Yoo. Entropycache: Decoded token entropy guided kv caching for diffusion language models.arXiv preprint arXiv:2603.18489, 2026
arXiv 2026
-
[30]
Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, and Linfeng Zhang. Diffusion llm with native variable generation lengths: Let [eos] lead the way.arXiv preprint arXiv:2510.24605, 2025
arXiv 2025
-
[31]
Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026
arXiv 2026
-
[32]
Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
Pith/arXiv arXiv 2017
-
[33]
Llada 1.5: Variance-reduced preference optimization for large language diffusion models
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. arXiv preprint arXiv:2505.19223, 2025
Pith/arXiv arXiv 2025
-
[34]
Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019
Pith/arXiv arXiv 1905
-
[35]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, 2020
2020
-
[36]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
Pith/arXiv arXiv 2018
-
[37]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022
2022
-
[38]
Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
Pith/arXiv arXiv 2009
-
[39]
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022
Pith/arXiv arXiv 2022
-
[40]
Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
Pith/arXiv arXiv 2021
-
[41]
Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
Pith/arXiv arXiv 2021
-
[42]
Evaluating large language models trained on code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
Pith/arXiv arXiv 2021
-
[43]
Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
Pith/arXiv arXiv 2021
-
[44]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024
2024
-
[45]
Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024
arXiv 2024
-
[46]
Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025. 8
arXiv 2025
-
[47]
Haoyu He, Katrin Renz, Yong Cao, and Andreas Geiger. MDPO: Overcoming the training-inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025
arXiv 2025
-
[48]
Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled rl for diffusion llms emerges from a sequence-level perspective.arXiv preprint arXiv:2512.03759, 2025. 9 A Evaluation Details This appendix provides additional details for the evaluations in Sec. 3. For iLLaDA-8B-Base, we use open-ended generati...
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.