Recognition: unknown
Understanding and Accelerating the Training of Masked Diffusion Language Models
Pith reviewed 2026-05-14 20:27 UTC · model grok-4.3
The pith
Bell-shaped time sampling accelerates masked diffusion language models to target performance up to four times faster.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that masked diffusion models learn slowly because language exhibits a strong locality bias, concentrating predictive information in nearby positions. By switching to bell-shaped time sampling, which samples diffusion time steps according to a bell-shaped distribution rather than uniformly, the training dynamics better address this bias. As a result, the models achieve equivalent validation negative log-likelihood up to approximately four times faster than with standard training, while also showing accelerated progress on generative and zero-shot perplexity as well as downstream performance metrics.
What carries the argument
Bell-shaped time sampling: a training strategy that draws diffusion time steps from a bell-shaped distribution to focus learning on intermediate noise levels where local dependencies are most effectively addressed.
If this is right
- MDMs reach target validation NLL up to 4x faster on LM1B.
- Generative perplexity, zero-shot perplexity, and downstream task performance improve more rapidly.
- Final model performance remains comparable to standard training.
- The method requires no architectural changes, only a modification to the time sampling distribution.
- MDMs become more viable for scaling to larger model sizes due to reduced training compute.
Where Pith is reading between the lines
- Adjusting the time sampling distribution could be a general technique for speeding up diffusion models where data has strong local structure.
- This suggests that uniform time sampling may be suboptimal when the underlying data distribution has position-dependent predictability.
- Future work might explore learned or adaptive time sampling distributions tailored to specific datasets.
- Combining this with other efficiency techniques could compound the training speedups for large-scale language modeling.
Load-bearing premise
The locality bias of language is the primary reason for slow MDM training and bell-shaped sampling mitigates it effectively without creating new training problems or reducing final performance.
What would settle it
A training run on the One Billion Word Benchmark where bell-shaped sampling shows no reduction in steps to reach the standard training's final validation NLL, or where it results in higher final NLL.
Figures
read the original abstract
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes why masked diffusion language models (MDMs) train more slowly than autoregressive models, attributing the slowdown primarily to the locality bias of natural language where predictive information is concentrated in nearby tokens. It proposes bell-shaped time sampling as a training modification and reports that MDMs trained this way reach the same validation negative log-likelihood up to ~4× faster than uniform sampling on the LM1B benchmark, with accompanying gains in generative perplexity, zero-shot perplexity, and downstream task performance.
Significance. If the empirical result holds, the work supplies a lightweight, practical change to the MDM training pipeline that reduces wall-clock time to target performance without altering the converged model quality. The inclusion of matched-compute learning curves and schedule ablations provides direct evidence for the speedup claim and strengthens the case that MDMs can become more competitive at scale.
major comments (1)
- [§4] §4 (LM1B experiments): the reported ~4× speedup to target validation NLL is presented without error bars, multiple random seeds, or statistical tests on the learning curves; this makes the precise magnitude of the acceleration difficult to assess as robust rather than run-specific.
minor comments (2)
- [§3] The definition and parameterization of the bell-shaped time distribution should be given explicitly as an equation (or pseudocode) rather than described only in prose, to allow exact reproduction.
- [Figures 2-4] Figure captions for the ablation plots would benefit from stating the exact hyper-parameters of the bell-shaped schedule used in each curve.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the recommendation for minor revision. We appreciate the constructive feedback on the robustness of the reported speedup and address the comment below.
read point-by-point responses
-
Referee: [§4] §4 (LM1B experiments): the reported ~4× speedup to target validation NLL is presented without error bars, multiple random seeds, or statistical tests on the learning curves; this makes the precise magnitude of the acceleration difficult to assess as robust rather than run-specific.
Authors: We agree that the absence of error bars and multiple seeds limits the ability to quantify robustness. In the revised manuscript we will add results from three independent random seeds for the LM1B experiments, reporting mean validation NLL curves with standard deviation bands. The original single-run curves were generated under a fixed compute budget, but the observed acceleration was large and aligned with the locality-bias analysis; the additional runs will confirm consistency across initializations. revision: yes
Circularity Check
No significant circularity in empirical acceleration analysis
full rationale
The paper presents an empirical analysis of MDM training slowdown due to language locality bias, followed by a proposed bell-shaped time-sampling remedy whose benefits are validated directly on held-out benchmarks (LM1B NLL curves, generative perplexity, zero-shot tasks) via matched-compute ablations. No derivation reduces to a fitted parameter renamed as prediction, no self-citation chain supplies the central claim, and the locality-bias observation is extracted from data inspection rather than defined in terms of the proposed schedule. The result remains an externally falsifiable training modification with independent content.
Axiom & Free-Parameter Ledger
free parameters (1)
- bell-shape parameters
axioms (1)
- domain assumption Locality bias of language is the primary factor slowing MDM training
Reference graph
Works this paper leans on
-
[1]
Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov
Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Repre- sentations (ICLR), 2025
work page 2025
-
[2]
Struc- tured denoising diffusion models in discrete state-spaces
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Struc- tured denoising diffusion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pp. 17981–17993, 2021
work page 2021
-
[3]
Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Piqa: Reasoning about phys- ical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020
work page 2020
-
[5]
One billion word benchmark for measuring progress in statistical language modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. InInterspeech 2014, pp. 2635–2639, 2014. doi: 10.21437/Interspeech.2014-564
-
[6]
Perception prioritized training of diffusion models
Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11472–11481, 2022
work page 2022
-
[7]
A discourse-aware attention model for abstractive summarization of long documents
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers...
-
[8]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[9]
Stable-diffcoder: Pushing the frontier of code diffusion large language model
Chenghao Fan, Wen Heng, Bo Li, Sichen Liu, Yuxuan Song, Jing Su, Xiaoye Qu, Kai Shen, and Wei Wei. Stable-diffcoder: Pushing the frontier of code diffusion large language model. arXiv preprint arXiv:2601.15892, 2026
-
[10]
Information-theoretic locality properties of natural language
Richard Futrell. Information-theoretic locality properties of natural language. InPro- ceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019), pp. 2–
work page 2019
-
[11]
Association for Computational Linguistics, 2019. doi: 10.18653/v1/W19-7902. URL https://aclanthology.org/W19-7902/
-
[12]
Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019. 10
work page 2019
-
[13]
Scaling diffusion language models via adaptation from autoregressive models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adaptation from autoregressive models. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[14]
Diffucoder: Understanding and improving masked diffusion models for code generation
Shansan Gong, Ruixiang ZHANG, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=58NA3unZj5
work page 2026
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Improved noise schedule for diffusion training
Tiankai Hang, Shuyang Gu, Jianmin Bao, Fangyun Wei, Dong Chen, Xin Geng, and Baining Guo. Improved noise schedule for diffusion training. InIEEE International Conference on Computer Vision (ICCV), pp. 4796–4806, 2025
work page 2025
-
[17]
Marton Havasi, Brian Karrer, Itai Gat, and Ricky T. Q. Chen. Edit flows: Variable length discrete flow matching with sequence-level edit operations. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[18]
Distillation of discrete diffusion through dimensional correlations
Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Distillation of discrete diffusion through dimensional correlations. InProceedings of the 42nd International Conference on Machine Learning, pp. 22259–22297. PMLR, 2025
work page 2025
-
[19]
Demystifying MaskGIT sampler and beyond: Adaptive order selection in masked diffusion
Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Demystifying MaskGIT sampler and beyond: Adaptive order selection in masked diffusion. Transactions on Machine Learning Research, 2026
work page 2026
-
[20]
Diffusionbert: Improving generative masked language models with diffusion models
Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuan-Jing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. In Proceedings of the 61st annual meeting of the association for computational linguistics (ACL), volume 1, pp. 4521–4534, 2023
work page 2023
-
[21]
Soft-masked diffusion language models
Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, and Abbas Rahimi. Soft-masked diffusion language models. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[22]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp. 6840–6851, 2020
work page 2020
-
[23]
Improving discrete diffusion unmasking policies beyond explicit reference policies
Chunsan Hong, Seonho An, Min-Soo Kim, and Jong Chul Ye. Improving discrete diffusion unmasking policies beyond explicit reference policies. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[24]
Chunsan Hong, Sanghyun Lee, and Jong Chul Ye. Unifying masked diffusion models with various generation orders and beyond.arXiv preprint arXiv:2602.02112, 2026
-
[25]
Argmax flows and multinomial diffusion: Learning categorical distributions
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pp. 12454–12465, 2021
work page 2021
-
[26]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Mengni Jia, Mengyu Zhou, Yihao Liu, xiaoxi jiang, and guanjunjiang. Bringing stability to diffusion: Decomposing and reducing variance of training masked diffusion models. In International Conference on Learning Representations (ICLR), 2026
work page 2026
-
[29]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, 2017. 11
work page 2017
-
[30]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[31]
Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models
Jaemin Kim and Jong Chul Ye. Adaptive guidance for retrieval-augmented masked diffusion models.arXiv preprint arXiv:2603.17677, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Train for the worst, plan for the best: Understanding token ordering in masked diffusions
Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions. InInternational Conference on Machine Learning (ICML), 2025
work page 2025
-
[33]
Kakade, Timothy Ngotiaoco, Sitan Chen, and Michael Samuel Albergo
Jaeyeon Kim, Lee Cheuk Kit, Carles Domingo-Enrich, Yilun Du, Sham M. Kakade, Timothy Ngotiaoco, Sitan Chen, and Michael Samuel Albergo. Any-order flexible length masked diffusion. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[34]
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp. 21696–21707, 2021
work page 2021
-
[35]
Florent Krzakala and Lenka Zdeborová. Hiding quiet solutions in random constraint satisfaction problems.Physical review letters, 102(23):238701, 2009
work page 2009
-
[36]
Sanghyun Lee, Seungryong Kim, Jongho Park, and Dongmin Park. Lookahead unmasking elicits accurate decoding in diffusion language models.arXiv preprint arXiv:2511.05563, 2025
-
[37]
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/alpaca_eval, 2023
work page 2023
-
[38]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pp. 74–81, 2004
work page 2004
-
[39]
Anji Liu, Oliver Broadrick, Mathias Niepert, and Guy Van den Broeck. Discrete copula diffusion. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[40]
Discrete diffusion modeling by estimating the ratios of the data distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[41]
Fineweb-edu: the finest collection of educational content, 2024
Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu
work page 2024
-
[42]
Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993
work page 1993
-
[43]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations (ICLR), 2017
work page 2017
-
[44]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2381–2391, 2018
work page 2018
-
[45]
A corpus and cloze evaluation for deeper understanding of commonsense stories
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies,...
work page 2016
-
[46]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pp. 8162–8171. PMLR, 2021
work page 2021
-
[47]
Scaling up masked diffusion models on text
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[48]
Large language diffusion models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[49]
Your absorbing discrete diffusion secretly models the conditional distributions of clean data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. InInternational Conference on Learning Representations (ICLR), 2025. 12
work page 2025
-
[50]
The lambada dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (ACL), volume 1, pp. 1525–1534, 2016
work page 2016
-
[51]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE International Conference on Computer Vision (ICCV), pp. 4195–4205, 2023
work page 2023
-
[52]
Bronstein, Anru Zhang, Joey Bose, and Alexander Tong
Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Michael M. Bronstein, Anru Zhang, Joey Bose, and Alexander Tong. Planner aware path learning in diffusion language models training. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[53]
Diffusion beats autoregressive in data-constrained settings
Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[54]
Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025
Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025
-
[55]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[56]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[57]
Simple and effective masked diffusion language models
Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pp. 130136–130184, 2024
work page 2024
-
[58]
Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality. InInternational Conference on Machine Learning (ICML), 2025
work page 2025
-
[59]
Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026
-
[60]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[61]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[62]
Social iqa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 4463–4473, 2019
work page 2019
-
[63]
Simplified and generalized masked diffusion for discrete data
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pp. 103131–103167, 2024
work page 2024
-
[64]
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025
work page internal anchor Pith review arXiv 2025
-
[65]
RoFormer: Enhanced Transformer with Rotary Position Embedding,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. URL https://doi.org/10.1016/j. neucom.2023.127063
-
[66]
Tess 2: A large-scale generalist diffusion language model
Jaesung Tae, Hamish Ivison, Sachin Kumar, and Arman Cohan. Tess 2: A large-scale generalist diffusion language model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 21171–21188, 2025. 13
work page 2025
-
[67]
Evaluating the design space of diffusion-based generative models
Yuqing Wang, Ye He, and Molei Tao. Evaluating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[68]
Variational autoencoding discrete diffusion with enhanced dimensional correlations modeling
Tianyu Xie, Shuchen Xue, Zijin Feng, Tianyang Hu, Jiacheng Sun, Zhenguo Li, and Cheng Zhang. Variational autoencoding discrete diffusion with enhanced dimensional correlations modeling. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[69]
Fast-dllm v2: Efficient block-diffusion llm
Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, et al. Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025
-
[70]
Any-order GPT as masked diffusion model: Decoupling formulation and architecture
Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, and Zhi-Ming Ma. Any-order GPT as masked diffusion model: Decoupling formulation and architecture. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025. URLhttps://openreview.net/forum?id=KbRxn8fzrY
work page 2025
-
[71]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Chuanyue Yu, Jiahui Wang, Yuhan Li, Heng Chang, Ge Lan, Qingyun Sun, Jia Li, Jianxin Li, and Ziwei Zhang. Unlocking the potentials of retrieval-augmented generation for diffusion language models.arXiv preprint arXiv:2601.11342, 2026
-
[73]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pp. 4791–4800, 2019
work page 2019
-
[74]
Character-level convolutional networks for text classification
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28, 2015
work page 2015
-
[75]
d1: Scaling reasoning in diffusion large language models via reinforcement learning
Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[76]
Continuously augmented discrete diffusion model for categorical generative modeling
Huangjie Zheng, Shansan Gong, Ruixiang ZHANG, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=JNAZ3e7Bwt
work page 2026
-
[77]
Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[78]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
work page 2023
-
[79]
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, and Dinghuai Zhang. Coevolutionary continuous discrete diffu- sion: Make your diffusion language model a latent reasoner.arXiv preprint arXiv:2510.03206, 2025. 14 A Background A.1 Related Work Discrete diffusion models.Discrete diffusion m...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
negotiate a fair salary: negotiating a fair salary is a salary that is fair to the parties and the b
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.