Improved Large Language Diffusion Models

Chongxuan Li; Ji-Rong Wen; Qiyang Min; Shaoxuan Xu; Shen Nie; Wayne Xin Zhao; Yankai Lin; Yong Shan; Yuxuan Song; Zihao Huang

arxiv: 2606.25331 · v1 · pith:UWZ4GWDMnew · submitted 2026-06-24 · 💻 cs.CL · cs.AI· cs.LG

Improved Large Language Diffusion Models

Shen Nie , Qiyang Min , Shaoxuan Xu , Zihao Huang , Yuxuan Song , Yong Shan , Yankai Lin , Wayne Xin Zhao

show 2 more authors

Chongxuan Li Ji-Rong Wen

This is my paper

Pith reviewed 2026-06-25 21:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords large language modelsdiffusion modelsmasked diffusionbidirectional attentionnon-autoregressive generationlanguage model pretraininginstruction tuning

0 comments

The pith

An 8B masked diffusion language model trained from scratch with fully bidirectional attention matches autoregressive models on language benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that iLLaDA, an 8 billion parameter model, can be trained end-to-end with a masked diffusion objective and bidirectional attention instead of the usual autoregressive factorization and causal masking. Pre-training runs to 12 trillion tokens and supervised fine-tuning uses a 25 billion token instruction set for 12 epochs, with variable-length generation added for speed. On this path the model records large gains over earlier diffusion models and stays competitive with a 7B autoregressive baseline across general, math, and code tasks. A reader would care because the result indicates that the dominant autoregressive recipe is not the only route to capable language models.

Core claim

iLLaDA keeps the masked diffusion objective through both pre-training and supervised fine-tuning, uses fully bidirectional attention throughout, scales to 12T tokens, and applies variable-length generation plus confidence-based scoring; under these choices the model improves 21.6 points on BBH and 14.9 points on ARC-Challenge relative to LLaDA while remaining competitive with Qwen2.5-7B on several benchmarks, demonstrating that fully bidirectional diffusion training from scratch is a competitive path toward strong language models.

What carries the argument

Masked diffusion objective with fully bidirectional attention, which replaces causal factorization and allows non-autoregressive training and generation while preserving the diffusion loss throughout pre-training and fine-tuning.

If this is right

Maintaining the diffusion objective through supervised fine-tuning preserves the non-autoregressive training regime at scale.
Variable-length generation reduces inference cost without reverting to autoregressive sampling.
Confidence-based scoring provides a consistent way to evaluate multiple-choice questions under diffusion sampling.
Broad gains on mathematical and code benchmarks follow from scaling the bidirectional diffusion recipe to 12T tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bidirectional diffusion setup could be tested on sequence tasks outside language, such as protein or music modeling.
If diffusion training tolerates longer contexts more gracefully than causal attention, it might relax current context-length limits.
Parallel sampling during generation could become a practical advantage once the objective is shown to match autoregressive quality.

Load-bearing premise

The reported gains come from the masked diffusion objective and bidirectional attention rather than from unmeasured differences in total compute, data quality, or post-training steps.

What would settle it

A side-by-side training run in which an autoregressive model receives the identical token count, data mixture, and fine-tuning protocol as iLLaDA yet still outperforms it on the same suite of benchmarks.

read the original abstract

Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present \emph{iLLaDA}, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2.5 7B on several benchmarks. These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github.com/ML-GSAI/LLaDA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iLLaDA scales masked diffusion to 8B with bidirectional attention and reports competitive benchmarks, but the gains are not isolated from data volume, SFT details, or compute.

read the letter

The core takeaway is that an 8B masked diffusion model trained from scratch on 12T tokens with fully bidirectional attention can match or beat prior diffusion work and stay close to Qwen2.5-7B on several benchmarks. That scaling result is the main new piece.

They keep the masked diffusion objective through both pre-training and a 25B-token SFT stage run for 12 epochs, add variable-length generation at inference, and use confidence scoring for multiple-choice tasks. The reported lifts over LLaDA (21.6 on BBH, 14.9 on ARC-Challenge for the base model; 14.5 on MATH and 16.5 on HumanEval for the instruct version) are concrete numbers, and releasing weights plus code helps.

The soft spot is exactly the one in the stress-test note. Nothing in the abstract or the described experiments holds total tokens, data mixture, optimizer schedule, or post-training fixed while swapping only the objective and attention mask. The 12T pre-training tokens and the long SFT run are large enough that they could drive much of the improvement. Without those controls, the claim that bidirectional diffusion itself is the competitive path rests on correlation rather than isolation.

The math and training description look standard for this line of work; no obvious circularity or invented entities. The citation pattern builds directly on earlier diffusion LM papers.

This is for groups already tracking non-autoregressive or diffusion-based language models. A reader who wants to see whether the paradigm can reach current autoregressive scales will find the numbers worth checking. It deserves peer review because the scale is large enough that referees can ask for the missing ablations and still get useful information out of the experiment.

Referee Report

2 major / 1 minor

Summary. The paper introduces iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention and the masked diffusion objective retained through pre-training (12T tokens) and SFT (25B-token corpus for 12 epochs). It introduces variable-length generation and confidence-based scoring for multiple-choice tasks, reports large gains over LLaDA (e.g., +21.6 BBH, +14.9 ARC-Challenge for the base model; +14.5 MATH, +16.5 HumanEval for the instruct model), and states competitiveness with Qwen2.5-7B, concluding that fully bidirectional diffusion training from scratch is a competitive path to strong language models. Model weights and code are released.

Significance. If the reported gains can be isolated to the masked diffusion objective and bidirectional attention, the work would establish a viable non-autoregressive scaling route that challenges the dominance of causal autoregressive training and broadens architectural options for language models. The public release of weights and code supports reproducibility and further investigation.

major comments (2)

[Abstract] Abstract: the central attribution of gains (e.g., +21.6 BBH, +14.9 ARC-Challenge) to the masked diffusion objective plus fully bidirectional attention is not secured, because no ablation holds total compute, data mixture, optimizer schedule, and post-training fixed while swapping only the objective and attention mask; the 12 T pre-training tokens, 25 B SFT corpus, and 12-epoch fine-tuning therefore remain plausible alternative drivers.
[Abstract] Abstract: benchmark improvements are presented without error bars, multiple random seeds, or statistical tests, so the reliability of claims such as +14.5 MATH and +16.5 HumanEval cannot be assessed from the reported point estimates alone.

minor comments (1)

[Abstract] Abstract: the description of variable-length generation and confidence-based scoring is too brief to allow replication or assessment of their contribution to the reported efficiency and accuracy numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central attribution of gains (e.g., +21.6 BBH, +14.9 ARC-Challenge) to the masked diffusion objective plus fully bidirectional attention is not secured, because no ablation holds total compute, data mixture, optimizer schedule, and post-training fixed while swapping only the objective and attention mask; the 12 T pre-training tokens, 25 B SFT corpus, and 12-epoch fine-tuning therefore remain plausible alternative drivers.

Authors: The referee correctly identifies that we lack a controlled ablation isolating the contribution of the masked diffusion objective and bidirectional attention. Such an experiment would require training additional models with identical compute, data, and schedules but different objectives, which exceeds our available resources. Our results demonstrate that scaling masked diffusion training to 12T tokens yields strong performance compared to the LLaDA baseline. We have revised the abstract to avoid over-attributing the gains exclusively to the objective and attention, instead emphasizing the overall training approach. A discussion of alternative factors has been added to the limitations section. revision: partial
Referee: [Abstract] Abstract: benchmark improvements are presented without error bars, multiple random seeds, or statistical tests, so the reliability of claims such as +14.5 MATH and +16.5 HumanEval cannot be assessed from the reported point estimates alone.

Authors: We agree that multiple seeds and statistical tests would provide a more robust assessment of the improvements. Due to the high computational cost of training 8B models on trillions of tokens, we are unable to conduct multiple independent runs. We have added a statement in the experimental setup section noting that all results are from single training runs, consistent with practices in similar large-scale model papers. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical training and evaluation results

full rationale

The manuscript reports training runs of an 8B masked diffusion model (iLLaDA) from scratch using a fixed objective and bidirectional attention, followed by direct benchmark evaluation. No equations, predictions, or first-principles derivations are present that could reduce reported scores to fitted parameters or self-citations by construction. Comparisons to LLaDA and Qwen2.5 are external reference points, not load-bearing inputs to any claimed derivation. The work is self-contained as an empirical demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that the masked diffusion objective with bidirectional attention produces the stated benchmark gains when scaled; no free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5773 in / 1111 out tokens · 16704 ms · 2026-06-25T21:32:42.357838+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 21 linked inside Pith

[1]

A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

Pith/arXiv arXiv 2023
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

1901
[3]

ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November 2022

OpenAI. ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November 2022. URL https: //openai.com/blog/chatgpt/

2022
[4]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[5]

Structured denoising diffusion models in discrete state-spaces.Advancesin Neural Information Processing Systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advancesin Neural Information Processing Systems, 34:17981–17993, 2021

2021
[6]

Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

Pith/arXiv arXiv 2023
[7]

Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

arXiv 2024
[8]

Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024

arXiv 2024
[9]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

Pith/arXiv arXiv 2024
[10]

Large language diffusion models.Advances in Neural Information Processing Systems, 38: 50608–50646, 2026

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, 6 and Chongxuan Li. Large language diffusion models.Advances in Neural Information Processing Systems, 38: 50608–50646, 2026

2026
[11]

Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

arXiv 2024
[12]

Beyond autoregression: Discrete diffusion for complex reasoning and planning

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning. InInternational Conference on Learning Representations, 2025

2025
[13]

Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

Pith/arXiv arXiv 2025
[14]

Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Pith/arXiv arXiv 2025
[15]

Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068, 2026

Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068, 2026

arXiv 2026
[16]

Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

arXiv 2025
[17]

Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025

Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025

arXiv 2025
[18]

Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

arXiv 2025
[19]

Qwen2 technical report, 2024

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

Pith/arXiv arXiv 2024
[20]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

Pith/arXiv arXiv 2024
[21]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

2023
[22]

Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

Pith/arXiv arXiv 2025
[23]

BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

Pith/arXiv arXiv 2018
[24]

Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

2019
[25]

Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Pith/arXiv arXiv 2002
[26]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[27]

dkv-cache: The cache for diffusion language models

Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models. arXiv preprint arXiv:2505.15781, 2025. 7

arXiv 2025
[28]

Attention is all you need for kv cache in diffusion llms

Quan Nguyen-Tri, Mukul Ranjan, and Zhiqiang Shen. Attention is all you need for kv cache in diffusion llms. arXiv preprint arXiv:2510.14973, 2025

arXiv 2025
[29]

Entropycache: Decoded token entropy guided kv caching for diffusion language models.arXiv preprint arXiv:2603.18489, 2026

Minsoo Cheong, Donghyun Son, Woosang Lim, and Sungjoo Yoo. Entropycache: Decoded token entropy guided kv caching for diffusion language models.arXiv preprint arXiv:2603.18489, 2026

arXiv 2026
[30]

Diffusion llm with native variable generation lengths: Let [eos] lead the way.arXiv preprint arXiv:2510.24605, 2025

Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, and Linfeng Zhang. Diffusion llm with native variable generation lengths: Let [eos] lead the way.arXiv preprint arXiv:2510.24605, 2025

arXiv 2025
[31]

d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

arXiv 2026
[32]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[33]

Llada 1.5: Variance-reduced preference optimization for large language diffusion models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. arXiv preprint arXiv:2505.19223, 2025

Pith/arXiv arXiv 2025
[34]

Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

Pith/arXiv arXiv 1905
[35]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, 2020

2020
[36]

Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Pith/arXiv arXiv 2018
[37]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

2022
[38]

Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

Pith/arXiv arXiv 2009
[39]

Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

Pith/arXiv arXiv 2022
[40]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[41]

Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Pith/arXiv arXiv 2021
[42]

Evaluating large language models trained on code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021
[43]

Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

Pith/arXiv arXiv 2021
[44]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

2024
[45]

Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

arXiv 2024
[46]

d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025. 8

arXiv 2025
[47]

MDPO: Overcoming the training-inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

Haoyu He, Katrin Renz, Yong Cao, and Andreas Geiger. MDPO: Overcoming the training-inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

arXiv 2025
[48]

Wait, let me check again

Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled rl for diffusion llms emerges from a sequence-level perspective.arXiv preprint arXiv:2512.03759, 2025. 9 A Evaluation Details This appendix provides additional details for the evaluations in Sec. 3. For iLLaDA-8B-Base, we use open-ended generati...

arXiv 2025

[1] [1]

A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

Pith/arXiv arXiv 2023

[2] [2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

1901

[3] [3]

ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November 2022

OpenAI. ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November 2022. URL https: //openai.com/blog/chatgpt/

2022

[4] [4]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[5] [5]

Structured denoising diffusion models in discrete state-spaces.Advancesin Neural Information Processing Systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advancesin Neural Information Processing Systems, 34:17981–17993, 2021

2021

[6] [6]

Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

Pith/arXiv arXiv 2023

[7] [7]

Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

arXiv 2024

[8] [8]

Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024

arXiv 2024

[9] [9]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

Pith/arXiv arXiv 2024

[10] [10]

Large language diffusion models.Advances in Neural Information Processing Systems, 38: 50608–50646, 2026

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, 6 and Chongxuan Li. Large language diffusion models.Advances in Neural Information Processing Systems, 38: 50608–50646, 2026

2026

[11] [11]

Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

arXiv 2024

[12] [12]

Beyond autoregression: Discrete diffusion for complex reasoning and planning

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning. InInternational Conference on Learning Representations, 2025

2025

[13] [13]

Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

Pith/arXiv arXiv 2025

[14] [14]

Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Pith/arXiv arXiv 2025

[15] [15]

Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068, 2026

Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068, 2026

arXiv 2026

[16] [16]

Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

arXiv 2025

[17] [17]

Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025

Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025

arXiv 2025

[18] [18]

Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

arXiv 2025

[19] [19]

Qwen2 technical report, 2024

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

Pith/arXiv arXiv 2024

[20] [20]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

Pith/arXiv arXiv 2024

[21] [21]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

2023

[22] [22]

Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

Pith/arXiv arXiv 2025

[23] [23]

BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

Pith/arXiv arXiv 2018

[24] [24]

Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

2019

[25] [25]

Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Pith/arXiv arXiv 2002

[26] [26]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[27] [27]

dkv-cache: The cache for diffusion language models

Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models. arXiv preprint arXiv:2505.15781, 2025. 7

arXiv 2025

[28] [28]

Attention is all you need for kv cache in diffusion llms

Quan Nguyen-Tri, Mukul Ranjan, and Zhiqiang Shen. Attention is all you need for kv cache in diffusion llms. arXiv preprint arXiv:2510.14973, 2025

arXiv 2025

[29] [29]

Entropycache: Decoded token entropy guided kv caching for diffusion language models.arXiv preprint arXiv:2603.18489, 2026

Minsoo Cheong, Donghyun Son, Woosang Lim, and Sungjoo Yoo. Entropycache: Decoded token entropy guided kv caching for diffusion language models.arXiv preprint arXiv:2603.18489, 2026

arXiv 2026

[30] [30]

Diffusion llm with native variable generation lengths: Let [eos] lead the way.arXiv preprint arXiv:2510.24605, 2025

Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, and Linfeng Zhang. Diffusion llm with native variable generation lengths: Let [eos] lead the way.arXiv preprint arXiv:2510.24605, 2025

arXiv 2025

[31] [31]

d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

arXiv 2026

[32] [32]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[33] [33]

Llada 1.5: Variance-reduced preference optimization for large language diffusion models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. arXiv preprint arXiv:2505.19223, 2025

Pith/arXiv arXiv 2025

[34] [34]

Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

Pith/arXiv arXiv 1905

[35] [35]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, 2020

2020

[36] [36]

Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Pith/arXiv arXiv 2018

[37] [37]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

2022

[38] [38]

Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

Pith/arXiv arXiv 2009

[39] [39]

Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

Pith/arXiv arXiv 2022

[40] [40]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[41] [41]

Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

Pith/arXiv arXiv 2021

[42] [42]

Evaluating large language models trained on code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021

[43] [43]

Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

Pith/arXiv arXiv 2021

[44] [44]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

2024

[45] [45]

Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

arXiv 2024

[46] [46]

d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025. 8

arXiv 2025

[47] [47]

MDPO: Overcoming the training-inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

Haoyu He, Katrin Renz, Yong Cao, and Andreas Geiger. MDPO: Overcoming the training-inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

arXiv 2025

[48] [48]

Wait, let me check again

Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled rl for diffusion llms emerges from a sequence-level perspective.arXiv preprint arXiv:2512.03759, 2025. 9 A Evaluation Details This appendix provides additional details for the evaluations in Sec. 3. For iLLaDA-8B-Base, we use open-ended generati...

arXiv 2025