pith. machine review for the scientific record. sign in

arxiv: 2605.13026 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Understanding and Accelerating the Training of Masked Diffusion Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords masked diffusion modelslanguage modelingtraining accelerationtime samplinglocality biasdiffusion modelsnon-autoregressive
0
0 comments X

The pith

Bell-shaped time sampling accelerates masked diffusion language models to target performance up to four times faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Masked diffusion models for language are promising alternatives to autoregressive models but train much more slowly. Analysis reveals that the main cause is language's locality bias, where each token's prediction depends heavily on nearby tokens. The proposed fix is to use bell-shaped time sampling during training instead of uniform sampling over time steps. This simple change allows models to reach the same validation negative log-likelihood up to four times faster on the One Billion Word Benchmark. The acceleration also appears in faster improvements on generative perplexity, zero-shot perplexity, and various downstream tasks.

Core claim

The paper establishes that masked diffusion models learn slowly because language exhibits a strong locality bias, concentrating predictive information in nearby positions. By switching to bell-shaped time sampling, which samples diffusion time steps according to a bell-shaped distribution rather than uniformly, the training dynamics better address this bias. As a result, the models achieve equivalent validation negative log-likelihood up to approximately four times faster than with standard training, while also showing accelerated progress on generative and zero-shot perplexity as well as downstream performance metrics.

What carries the argument

Bell-shaped time sampling: a training strategy that draws diffusion time steps from a bell-shaped distribution to focus learning on intermediate noise levels where local dependencies are most effectively addressed.

If this is right

  • MDMs reach target validation NLL up to 4x faster on LM1B.
  • Generative perplexity, zero-shot perplexity, and downstream task performance improve more rapidly.
  • Final model performance remains comparable to standard training.
  • The method requires no architectural changes, only a modification to the time sampling distribution.
  • MDMs become more viable for scaling to larger model sizes due to reduced training compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adjusting the time sampling distribution could be a general technique for speeding up diffusion models where data has strong local structure.
  • This suggests that uniform time sampling may be suboptimal when the underlying data distribution has position-dependent predictability.
  • Future work might explore learned or adaptive time sampling distributions tailored to specific datasets.
  • Combining this with other efficiency techniques could compound the training speedups for large-scale language modeling.

Load-bearing premise

The locality bias of language is the primary reason for slow MDM training and bell-shaped sampling mitigates it effectively without creating new training problems or reducing final performance.

What would settle it

A training run on the One Billion Word Benchmark where bell-shaped sampling shows no reduction in steps to reach the standard training's final validation NLL, or where it results in higher final NLL.

Figures

Figures reproduced from arXiv: 2605.13026 by Chieh-Hsin Lai, Chunsan Hong, Jong Chul Ye, Sanghyun Lee, Satoshi Hayakawa, Seungryong Kim, Yuhta Takida, Yuki Mitsufuji.

Figure 1
Figure 1. Figure 1: Locality bias in language makes MDMs slow learners. The red box illustrates the concept of locality bias: a token is influenced more strongly by nearby tokens and more weakly by distant ones. During training, ARMs learn to predict the target token from left-filled sequences, where an appropriate amount of local information is always available; in contrast, MDMs learn to predict the target token from arbitr… view at source ↗
Figure 2
Figure 2. Figure 2: Training curve on LM1B. Blue region for σ-ARM is 1-sigma band for 5 distinct σ, and the solid line is the mean. Before analyzing why MDMs train slowly, we first ask a broader question: why is learning any-order language gen￾eration difficult in the first place? Among several possible factors, we focus on two main possibilities: whether the difficulty primarily arises from the large order-space com￾plexity … view at source ↗
Figure 3
Figure 3. Figure 3: Left: NLL per number of context tokens on LM1B after 1M training steps. ARM shows an approximately uniform distribution, whereas MDM shows a clearly uneven one. Right: NLL drop across context token numbers for ARM and MDM over different training stages on LM1B. For Nc = 0, MDM has already converged to the optimal loss by 10K steps. Since MDM models the joint distribution as a product of independent conditi… view at source ↗
Figure 4
Figure 4. Figure 4: Left: Probability that a uniformly sampled mask pattern with exactly L − Nc masks avoids k-inefficient set Ek. Right: Probability that each uniform and Dirac-delta time sampling avoids Ek. on the middle-context region. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Validation NLL curves on LM1B for models trained with various time distributions listed [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Validation NLL curves on various language modeling benchmarks. “Base” denotes base [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training curve on LM1B. Middle-flat noise scheduler. One can flatten the noise sched￾uler around αt = 0.5 in LMDM, so that the model naturally encounters more middle-context samples during training. The key difference from bell-shaped sampling is whether the tar￾get cross-entropy loss is reweighted. Under the theoretical NELBO, the scaling factor is smallest around t = 0.5 and increases as t moves away fro… view at source ↗
Figure 8
Figure 8. Figure 8: Additional ablation on Dirac-delta time sampling. We compare different choices of the delta [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional ablation on truncated Gaussian time sampling. We vary the mean and standard [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional ablation on truncated Laplace time sampling. We compare Laplace distributions [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Left: the middle-flat scheduler α MF t = F −1 TG(1 − t) with ℓ = 0, h = 1, µ = 0.5 and σ = 0.1 is flatter around α = 0.5 than the linear scheduler, so a uniform draw of t produces many more middle-context corruption levels. Right: under t ∼ Uniform(0, 1), the induced marginal distribution of α MF t matches the target truncated-Gaussian density fTG(a). Recall that the forward process of MDM is defined as f… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparisons on professional email generation tasks. Across both email￾writing tasks, the baseline MDM frequently degenerates into malformed formatting and repetitive text fragments. In contrast, the Gaussian time trained MDM generates substantially more coherent responses with recognizable email structure, appropriate formatting, and improved instruction follow￾ing, despite minor repetition ar… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparisons between the baseline MDM and Gaussian time trained MDM. Across both structured extraction and open-ended generation tasks, the baseline model fre￾quently degenerates into repetitive or malformed text, whereas the Gaussian-trained model produces substantially more coherent and instruction-following responses. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
read the original abstract

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes why masked diffusion language models (MDMs) train more slowly than autoregressive models, attributing the slowdown primarily to the locality bias of natural language where predictive information is concentrated in nearby tokens. It proposes bell-shaped time sampling as a training modification and reports that MDMs trained this way reach the same validation negative log-likelihood up to ~4× faster than uniform sampling on the LM1B benchmark, with accompanying gains in generative perplexity, zero-shot perplexity, and downstream task performance.

Significance. If the empirical result holds, the work supplies a lightweight, practical change to the MDM training pipeline that reduces wall-clock time to target performance without altering the converged model quality. The inclusion of matched-compute learning curves and schedule ablations provides direct evidence for the speedup claim and strengthens the case that MDMs can become more competitive at scale.

major comments (1)
  1. [§4] §4 (LM1B experiments): the reported ~4× speedup to target validation NLL is presented without error bars, multiple random seeds, or statistical tests on the learning curves; this makes the precise magnitude of the acceleration difficult to assess as robust rather than run-specific.
minor comments (2)
  1. [§3] The definition and parameterization of the bell-shaped time distribution should be given explicitly as an equation (or pseudocode) rather than described only in prose, to allow exact reproduction.
  2. [Figures 2-4] Figure captions for the ablation plots would benefit from stating the exact hyper-parameters of the bell-shaped schedule used in each curve.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. We appreciate the constructive feedback on the robustness of the reported speedup and address the comment below.

read point-by-point responses
  1. Referee: [§4] §4 (LM1B experiments): the reported ~4× speedup to target validation NLL is presented without error bars, multiple random seeds, or statistical tests on the learning curves; this makes the precise magnitude of the acceleration difficult to assess as robust rather than run-specific.

    Authors: We agree that the absence of error bars and multiple seeds limits the ability to quantify robustness. In the revised manuscript we will add results from three independent random seeds for the LM1B experiments, reporting mean validation NLL curves with standard deviation bands. The original single-run curves were generated under a fixed compute budget, but the observed acceleration was large and aligned with the locality-bias analysis; the additional runs will confirm consistency across initializations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical acceleration analysis

full rationale

The paper presents an empirical analysis of MDM training slowdown due to language locality bias, followed by a proposed bell-shaped time-sampling remedy whose benefits are validated directly on held-out benchmarks (LM1B NLL curves, generative perplexity, zero-shot tasks) via matched-compute ablations. No derivation reduces to a fitted parameter renamed as prediction, no self-citation chain supplies the central claim, and the locality-bias observation is extracted from data inspection rather than defined in terms of the proposed schedule. The result remains an externally falsifiable training modification with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that locality bias dominates training dynamics and on an unspecified parametric form for the bell-shaped sampler; no new physical entities are introduced.

free parameters (1)
  • bell-shape parameters
    The precise width, center, and height of the bell-shaped time distribution are almost certainly tuned to data but are not reported in the abstract.
axioms (1)
  • domain assumption Locality bias of language is the primary factor slowing MDM training
    The abstract states this as the main finding from their analysis and the justification for the sampling change.

pith-pipeline@v0.9.0 · 5507 in / 1307 out tokens · 46097 ms · 2026-05-14T20:27:51.874375+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 9 internal anchors

  1. [1]

    Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

    Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Repre- sentations (ICLR), 2025

  2. [2]

    Struc- tured denoising diffusion models in discrete state-spaces

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Struc- tured denoising diffusion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pp. 17981–17993, 2021

  3. [3]

    Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

  4. [4]

    Piqa: Reasoning about phys- ical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020

  5. [5]

    One billion word benchmark for measuring progress in statistical language modeling

    Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. InInterspeech 2014, pp. 2635–2639, 2014. doi: 10.21437/Interspeech.2014-564

  6. [6]

    Perception prioritized training of diffusion models

    Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11472–11481, 2022

  7. [7]

    A discourse-aware attention model for abstractive summarization of long documents

    Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers...

  8. [8]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

  9. [9]

    Stable-diffcoder: Pushing the frontier of code diffusion large language model

    Chenghao Fan, Wen Heng, Bo Li, Sichen Liu, Yuxuan Song, Jing Su, Xiaoye Qu, Kai Shen, and Wei Wei. Stable-diffcoder: Pushing the frontier of code diffusion large language model. arXiv preprint arXiv:2601.15892, 2026

  10. [10]

    Information-theoretic locality properties of natural language

    Richard Futrell. Information-theoretic locality properties of natural language. InPro- ceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019), pp. 2–

  11. [11]

    doi: 10.18653/v1/W19-7902

    Association for Computational Linguistics, 2019. doi: 10.18653/v1/W19-7902. URL https://aclanthology.org/W19-7902/

  12. [12]

    Openwebtext corpus

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019. 10

  13. [13]

    Scaling diffusion language models via adaptation from autoregressive models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adaptation from autoregressive models. InInternational Conference on Learning Representations (ICLR), 2025

  14. [14]

    Diffucoder: Understanding and improving masked diffusion models for code generation

    Shansan Gong, Ruixiang ZHANG, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=58NA3unZj5

  15. [15]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  16. [16]

    Improved noise schedule for diffusion training

    Tiankai Hang, Shuyang Gu, Jianmin Bao, Fangyun Wei, Dong Chen, Xin Geng, and Baining Guo. Improved noise schedule for diffusion training. InIEEE International Conference on Computer Vision (ICCV), pp. 4796–4806, 2025

  17. [17]

    Marton Havasi, Brian Karrer, Itai Gat, and Ricky T. Q. Chen. Edit flows: Variable length discrete flow matching with sequence-level edit operations. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  18. [18]

    Distillation of discrete diffusion through dimensional correlations

    Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Distillation of discrete diffusion through dimensional correlations. InProceedings of the 42nd International Conference on Machine Learning, pp. 22259–22297. PMLR, 2025

  19. [19]

    Demystifying MaskGIT sampler and beyond: Adaptive order selection in masked diffusion

    Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Demystifying MaskGIT sampler and beyond: Adaptive order selection in masked diffusion. Transactions on Machine Learning Research, 2026

  20. [20]

    Diffusionbert: Improving generative masked language models with diffusion models

    Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuan-Jing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. In Proceedings of the 61st annual meeting of the association for computational linguistics (ACL), volume 1, pp. 4521–4534, 2023

  21. [21]

    Soft-masked diffusion language models

    Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, and Abbas Rahimi. Soft-masked diffusion language models. InInternational Conference on Learning Representations (ICLR), 2026

  22. [22]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp. 6840–6851, 2020

  23. [23]

    Improving discrete diffusion unmasking policies beyond explicit reference policies

    Chunsan Hong, Seonho An, Min-Soo Kim, and Jong Chul Ye. Improving discrete diffusion unmasking policies beyond explicit reference policies. InInternational Conference on Learning Representations (ICLR), 2026

  24. [24]

    Unifying masked diffusion models with various generation orders and beyond.arXiv preprint arXiv:2602.02112, 2026

    Chunsan Hong, Sanghyun Lee, and Jong Chul Ye. Unifying masked diffusion models with various generation orders and beyond.arXiv preprint arXiv:2602.02112, 2026

  25. [25]

    Argmax flows and multinomial diffusion: Learning categorical distributions

    Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pp. 12454–12465, 2021

  26. [26]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  27. [27]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  28. [28]

    Bringing stability to diffusion: Decomposing and reducing variance of training masked diffusion models

    Mengni Jia, Mengyu Zhou, Yihao Liu, xiaoxi jiang, and guanjunjiang. Bringing stability to diffusion: Decomposing and reducing variance of training masked diffusion models. In International Conference on Learning Representations (ICLR), 2026

  29. [29]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, 2017. 11

  30. [30]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  31. [31]

    Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models

    Jaemin Kim and Jong Chul Ye. Adaptive guidance for retrieval-augmented masked diffusion models.arXiv preprint arXiv:2603.17677, 2026

  32. [32]

    Train for the worst, plan for the best: Understanding token ordering in masked diffusions

    Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions. InInternational Conference on Machine Learning (ICML), 2025

  33. [33]

    Kakade, Timothy Ngotiaoco, Sitan Chen, and Michael Samuel Albergo

    Jaeyeon Kim, Lee Cheuk Kit, Carles Domingo-Enrich, Yilun Du, Sham M. Kakade, Timothy Ngotiaoco, Sitan Chen, and Michael Samuel Albergo. Any-order flexible length masked diffusion. InInternational Conference on Learning Representations (ICLR), 2026

  34. [34]

    Variational diffusion models

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp. 21696–21707, 2021

  35. [35]

    Hiding quiet solutions in random constraint satisfaction problems.Physical review letters, 102(23):238701, 2009

    Florent Krzakala and Lenka Zdeborová. Hiding quiet solutions in random constraint satisfaction problems.Physical review letters, 102(23):238701, 2009

  36. [36]

    Lookahead unmasking elicits accurate decoding in diffusion language models.arXiv preprint arXiv:2511.05563, 2025

    Sanghyun Lee, Seungryong Kim, Jongho Park, and Dongmin Park. Lookahead unmasking elicits accurate decoding in diffusion language models.arXiv preprint arXiv:2511.05563, 2025

  37. [37]

    Alpacaeval: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/alpaca_eval, 2023

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/alpaca_eval, 2023

  38. [38]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pp. 74–81, 2004

  39. [39]

    Discrete copula diffusion

    Anji Liu, Oliver Broadrick, Mathias Niepert, and Guy Van den Broeck. Discrete copula diffusion. InInternational Conference on Learning Representations (ICLR), 2025

  40. [40]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning (ICML), 2024

  41. [41]

    Fineweb-edu: the finest collection of educational content, 2024

    Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu

  42. [42]

    Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

    Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

  43. [43]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations (ICLR), 2017

  44. [44]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2381–2391, 2018

  45. [45]

    A corpus and cloze evaluation for deeper understanding of commonsense stories

    Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies,...

  46. [46]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pp. 8162–8171. PMLR, 2021

  47. [47]

    Scaling up masked diffusion models on text

    Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InInternational Conference on Learning Representations (ICLR), 2025

  48. [48]

    Large language diffusion models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  49. [49]

    Your absorbing discrete diffusion secretly models the conditional distributions of clean data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. InInternational Conference on Learning Representations (ICLR), 2025. 12

  50. [50]

    The lambada dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (ACL), volume 1, pp. 1525–1534, 2016

  51. [51]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE International Conference on Computer Vision (ICCV), pp. 4195–4205, 2023

  52. [52]

    Bronstein, Anru Zhang, Joey Bose, and Alexander Tong

    Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Michael M. Bronstein, Anru Zhang, Joey Bose, and Alexander Tong. Planner aware path learning in diffusion language models training. InInternational Conference on Learning Representations (ICLR), 2026

  53. [53]

    Diffusion beats autoregressive in data-constrained settings

    Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  54. [54]

    Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025

    Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025

  55. [55]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  56. [56]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  57. [57]

    Simple and effective masked diffusion language models

    Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pp. 130136–130184, 2024

  58. [58]

    The diffusion duality

    Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality. InInternational Conference on Machine Learning (ICML), 2025

  59. [59]

    Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

    Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

  60. [60]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  61. [61]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207, 2021

  62. [62]

    Social iqa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 4463–4473, 2019

  63. [63]

    Simplified and generalized masked diffusion for discrete data

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pp. 103131–103167, 2024

  64. [64]

    Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

    Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

  65. [65]

    RoFormer: Enhanced Transformer with Rotary Position Embedding,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. URL https://doi.org/10.1016/j. neucom.2023.127063

  66. [66]

    Tess 2: A large-scale generalist diffusion language model

    Jaesung Tae, Hamish Ivison, Sachin Kumar, and Arman Cohan. Tess 2: A large-scale generalist diffusion language model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 21171–21188, 2025. 13

  67. [67]

    Evaluating the design space of diffusion-based generative models

    Yuqing Wang, Ye He, and Molei Tao. Evaluating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  68. [68]

    Variational autoencoding discrete diffusion with enhanced dimensional correlations modeling

    Tianyu Xie, Shuchen Xue, Zijin Feng, Tianyang Hu, Jiacheng Sun, Zhenguo Li, and Cheng Zhang. Variational autoencoding discrete diffusion with enhanced dimensional correlations modeling. InInternational Conference on Learning Representations (ICLR), 2026

  69. [69]

    Fast-dllm v2: Efficient block-diffusion llm

    Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, et al. Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025

  70. [70]

    Any-order GPT as masked diffusion model: Decoupling formulation and architecture

    Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, and Zhi-Ming Ma. Any-order GPT as masked diffusion model: Decoupling formulation and architecture. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025. URLhttps://openreview.net/forum?id=KbRxn8fzrY

  71. [71]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

  72. [72]

    Unlocking the potentials of retrieval-augmented generation for diffusion language models.arXiv preprint arXiv:2601.11342, 2026

    Chuanyue Yu, Jiahui Wang, Yuhan Li, Heng Chang, Ge Lan, Qingyun Sun, Jia Li, Jianxin Li, and Ziwei Zhang. Unlocking the potentials of retrieval-augmented generation for diffusion language models.arXiv preprint arXiv:2601.11342, 2026

  73. [73]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pp

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pp. 4791–4800, 2019

  74. [74]

    Character-level convolutional networks for text classification

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28, 2015

  75. [75]

    d1: Scaling reasoning in diffusion large language models via reinforcement learning

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  76. [76]

    Continuously augmented discrete diffusion model for categorical generative modeling

    Huangjie Zheng, Shansan Gong, Ruixiang ZHANG, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=JNAZ3e7Bwt

  77. [77]

    Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

    Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. InInternational Conference on Learning Representations (ICLR), 2025

  78. [78]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  79. [79]

    Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

    Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, and Dinghuai Zhang. Coevolutionary continuous discrete diffu- sion: Make your diffusion language model a latent reasoner.arXiv preprint arXiv:2510.03206, 2025. 14 A Background A.1 Related Work Discrete diffusion models.Discrete diffusion m...

  80. [80]

    negotiate a fair salary: negotiating a fair salary is a salary that is fair to the parties and the b

Showing first 80 references.