pith. sign in

arxiv: 2606.25331 · v1 · pith:UWZ4GWDMnew · submitted 2026-06-24 · 💻 cs.CL · cs.AI· cs.LG

Improved Large Language Diffusion Models

Pith reviewed 2026-06-25 21:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords large language modelsdiffusion modelsmasked diffusionbidirectional attentionnon-autoregressive generationlanguage model pretraininginstruction tuning
0
0 comments X

The pith

An 8B masked diffusion language model trained from scratch with fully bidirectional attention matches autoregressive models on language benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that iLLaDA, an 8 billion parameter model, can be trained end-to-end with a masked diffusion objective and bidirectional attention instead of the usual autoregressive factorization and causal masking. Pre-training runs to 12 trillion tokens and supervised fine-tuning uses a 25 billion token instruction set for 12 epochs, with variable-length generation added for speed. On this path the model records large gains over earlier diffusion models and stays competitive with a 7B autoregressive baseline across general, math, and code tasks. A reader would care because the result indicates that the dominant autoregressive recipe is not the only route to capable language models.

Core claim

iLLaDA keeps the masked diffusion objective through both pre-training and supervised fine-tuning, uses fully bidirectional attention throughout, scales to 12T tokens, and applies variable-length generation plus confidence-based scoring; under these choices the model improves 21.6 points on BBH and 14.9 points on ARC-Challenge relative to LLaDA while remaining competitive with Qwen2.5-7B on several benchmarks, demonstrating that fully bidirectional diffusion training from scratch is a competitive path toward strong language models.

What carries the argument

Masked diffusion objective with fully bidirectional attention, which replaces causal factorization and allows non-autoregressive training and generation while preserving the diffusion loss throughout pre-training and fine-tuning.

If this is right

  • Maintaining the diffusion objective through supervised fine-tuning preserves the non-autoregressive training regime at scale.
  • Variable-length generation reduces inference cost without reverting to autoregressive sampling.
  • Confidence-based scoring provides a consistent way to evaluate multiple-choice questions under diffusion sampling.
  • Broad gains on mathematical and code benchmarks follow from scaling the bidirectional diffusion recipe to 12T tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bidirectional diffusion setup could be tested on sequence tasks outside language, such as protein or music modeling.
  • If diffusion training tolerates longer contexts more gracefully than causal attention, it might relax current context-length limits.
  • Parallel sampling during generation could become a practical advantage once the objective is shown to match autoregressive quality.

Load-bearing premise

The reported gains come from the masked diffusion objective and bidirectional attention rather than from unmeasured differences in total compute, data quality, or post-training steps.

What would settle it

A side-by-side training run in which an autoregressive model receives the identical token count, data mixture, and fine-tuning protocol as iLLaDA yet still outperforms it on the same suite of benchmarks.

read the original abstract

Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present \emph{iLLaDA}, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2.5 7B on several benchmarks. These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github.com/ML-GSAI/LLaDA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention and the masked diffusion objective retained through pre-training (12T tokens) and SFT (25B-token corpus for 12 epochs). It introduces variable-length generation and confidence-based scoring for multiple-choice tasks, reports large gains over LLaDA (e.g., +21.6 BBH, +14.9 ARC-Challenge for the base model; +14.5 MATH, +16.5 HumanEval for the instruct model), and states competitiveness with Qwen2.5-7B, concluding that fully bidirectional diffusion training from scratch is a competitive path to strong language models. Model weights and code are released.

Significance. If the reported gains can be isolated to the masked diffusion objective and bidirectional attention, the work would establish a viable non-autoregressive scaling route that challenges the dominance of causal autoregressive training and broadens architectural options for language models. The public release of weights and code supports reproducibility and further investigation.

major comments (2)
  1. [Abstract] Abstract: the central attribution of gains (e.g., +21.6 BBH, +14.9 ARC-Challenge) to the masked diffusion objective plus fully bidirectional attention is not secured, because no ablation holds total compute, data mixture, optimizer schedule, and post-training fixed while swapping only the objective and attention mask; the 12 T pre-training tokens, 25 B SFT corpus, and 12-epoch fine-tuning therefore remain plausible alternative drivers.
  2. [Abstract] Abstract: benchmark improvements are presented without error bars, multiple random seeds, or statistical tests, so the reliability of claims such as +14.5 MATH and +16.5 HumanEval cannot be assessed from the reported point estimates alone.
minor comments (1)
  1. [Abstract] Abstract: the description of variable-length generation and confidence-based scoring is too brief to allow replication or assessment of their contribution to the reported efficiency and accuracy numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central attribution of gains (e.g., +21.6 BBH, +14.9 ARC-Challenge) to the masked diffusion objective plus fully bidirectional attention is not secured, because no ablation holds total compute, data mixture, optimizer schedule, and post-training fixed while swapping only the objective and attention mask; the 12 T pre-training tokens, 25 B SFT corpus, and 12-epoch fine-tuning therefore remain plausible alternative drivers.

    Authors: The referee correctly identifies that we lack a controlled ablation isolating the contribution of the masked diffusion objective and bidirectional attention. Such an experiment would require training additional models with identical compute, data, and schedules but different objectives, which exceeds our available resources. Our results demonstrate that scaling masked diffusion training to 12T tokens yields strong performance compared to the LLaDA baseline. We have revised the abstract to avoid over-attributing the gains exclusively to the objective and attention, instead emphasizing the overall training approach. A discussion of alternative factors has been added to the limitations section. revision: partial

  2. Referee: [Abstract] Abstract: benchmark improvements are presented without error bars, multiple random seeds, or statistical tests, so the reliability of claims such as +14.5 MATH and +16.5 HumanEval cannot be assessed from the reported point estimates alone.

    Authors: We agree that multiple seeds and statistical tests would provide a more robust assessment of the improvements. Due to the high computational cost of training 8B models on trillions of tokens, we are unable to conduct multiple independent runs. We have added a statement in the experimental setup section noting that all results are from single training runs, consistent with practices in similar large-scale model papers. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical training and evaluation results

full rationale

The manuscript reports training runs of an 8B masked diffusion model (iLLaDA) from scratch using a fixed objective and bidirectional attention, followed by direct benchmark evaluation. No equations, predictions, or first-principles derivations are present that could reduce reported scores to fitted parameters or self-citations by construction. Comparisons to LLaDA and Qwen2.5 are external reference points, not load-bearing inputs to any claimed derivation. The work is self-contained as an empirical demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that the masked diffusion objective with bidirectional attention produces the stated benchmark gains when scaled; no free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5773 in / 1111 out tokens · 16704 ms · 2026-06-25T21:32:42.357838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 21 linked inside Pith

  1. [1]

    A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

  3. [3]

    ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November 2022

    OpenAI. ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November 2022. URL https: //openai.com/blog/chatgpt/

  4. [4]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  5. [5]

    Structured denoising diffusion models in discrete state-spaces.Advancesin Neural Information Processing Systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advancesin Neural Information Processing Systems, 34:17981–17993, 2021

  6. [6]

    Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

  7. [7]

    Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

  8. [8]

    Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024

    Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024

  9. [9]

    Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

  10. [10]

    Large language diffusion models.Advances in Neural Information Processing Systems, 38: 50608–50646, 2026

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, 6 and Chongxuan Li. Large language diffusion models.Advances in Neural Information Processing Systems, 38: 50608–50646, 2026

  11. [11]

    Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

    Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

  12. [12]

    Beyond autoregression: Discrete diffusion for complex reasoning and planning

    Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning. InInternational Conference on Learning Representations, 2025

  13. [13]

    Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

    Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

  14. [14]

    Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

  15. [15]

    Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068, 2026

    Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068, 2026

  16. [16]

    Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

    Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

  17. [17]

    Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025

    Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857, 2025

  18. [18]

    Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

    Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, and Michael Qizhe Shieh. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276, 2025

  19. [19]

    Qwen2 technical report, 2024

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

  20. [20]

    Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  21. [21]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

  22. [22]

    Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

  23. [23]

    BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

  24. [24]

    Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

  25. [25]

    Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  26. [26]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  27. [27]

    dkv-cache: The cache for diffusion language models

    Xinyin Ma, Runpeng Yu, Gongfan Fang, and Xinchao Wang. dkv-cache: The cache for diffusion language models. arXiv preprint arXiv:2505.15781, 2025. 7

  28. [28]

    Attention is all you need for kv cache in diffusion llms

    Quan Nguyen-Tri, Mukul Ranjan, and Zhiqiang Shen. Attention is all you need for kv cache in diffusion llms. arXiv preprint arXiv:2510.14973, 2025

  29. [29]

    Entropycache: Decoded token entropy guided kv caching for diffusion language models.arXiv preprint arXiv:2603.18489, 2026

    Minsoo Cheong, Donghyun Son, Woosang Lim, and Sungjoo Yoo. Entropycache: Decoded token entropy guided kv caching for diffusion language models.arXiv preprint arXiv:2603.18489, 2026

  30. [30]

    Diffusion llm with native variable generation lengths: Let [eos] lead the way.arXiv preprint arXiv:2510.24605, 2025

    Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, and Linfeng Zhang. Diffusion llm with native variable generation lengths: Let [eos] lead the way.arXiv preprint arXiv:2510.24605, 2025

  31. [31]

    d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

    Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

  32. [32]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

    I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  33. [33]

    Llada 1.5: Variance-reduced preference optimization for large language diffusion models

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. arXiv preprint arXiv:2505.19223, 2025

  34. [34]

    Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

  35. [35]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, 2020

  36. [36]

    Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  37. [37]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

  38. [38]

    Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  39. [39]

    Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

  40. [40]

    Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  41. [41]

    Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  42. [42]

    Evaluating large language models trained on code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  43. [43]

    Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  44. [44]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

  45. [45]

    Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?arXiv preprint arXiv:2406.04127, 2024

  46. [46]

    d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025. 8

  47. [47]

    MDPO: Overcoming the training-inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

    Haoyu He, Katrin Renz, Yong Cao, and Andreas Geiger. MDPO: Overcoming the training-inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

  48. [48]

    Wait, let me check again

    Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled rl for diffusion llms emerges from a sequence-level perspective.arXiv preprint arXiv:2512.03759, 2025. 9 A Evaluation Details This appendix provides additional details for the evaluations in Sec. 3. For iLLaDA-8B-Base, we use open-ended generati...