Continuous Diffusion Scales Competitively with Discrete Diffusion for Language
Pith reviewed 2026-05-20 11:01 UTC · model grok-4.3
The pith
RePlaid shows continuous diffusion language models scale competitively with discrete ones, closing the gap to a 20x compute difference from autoregressive models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RePlaid exhibits a compute gap of only 20× compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime while achieving a new state-of-the-art PPL bound of 22.1 among continuous DLMs on OpenWebText. This is enabled by aligning Plaid's architecture with modern discrete DLMs and using likelihood-based training, which optimizes the noise schedule to yield linear cross-entropy over time and creates structured geometries in embeddings.
What carries the argument
Architecture alignment of continuous diffusion language models with discrete counterparts combined with likelihood optimization that minimizes ELBO variance for linear information loss.
If this is right
- Continuous DLMs become viable at scale with limited extra compute cost.
- Performance advantages appear in over-trained settings compared to some discrete models.
- New state-of-the-art perplexity achieved for continuous diffusion on standard benchmarks.
- Likelihood training distributes denoising difficulty evenly without custom time adjustments.
Where Pith is reading between the lines
- If continuous diffusion scales well, models may gain flexibility in generating text by operating in continuous space rather than discrete tokens.
- The linear cross-entropy from optimized schedules could simplify training procedures in other diffusion models.
- Structured embeddings might lead to improved performance in tasks requiring semantic understanding.
Load-bearing premise
Aligning the architecture of Plaid with modern discrete DLMs fairly isolates the continuous versus discrete difference without confounding effects from training or tuning differences.
What would settle it
Reproducing the OpenWebText experiments and finding RePlaid's perplexity bound significantly above 22.1 or the compute gap exceeding 20x with matched hyperparameter tuning.
Figures
read the original abstract
While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper revisits the Plaid continuous diffusion language model and constructs RePlaid by aligning its architecture with modern discrete DLMs. It reports the first scaling law for continuous DLMs, claiming a compute gap of only 20× relative to autoregressive models, outperformance of Duo (with fewer parameters) and MDLM (in the over-trained regime), and a new state-of-the-art PPL bound of 22.1 among continuous DLMs on OpenWebText, along with superior generation quality. The authors also provide theoretical insights showing that likelihood-based noise schedule optimization minimizes ELBO variance to produce linear cross-entropy over time, and that likelihood-trained embeddings induce structured geometries that drive likelihood gains.
Significance. If the empirical scaling results and isolation of the continuous formulation hold after addressing comparison details, the work would be significant for demonstrating that continuous diffusion can scale competitively with discrete approaches for language modeling. This challenges prevailing views on scalability, provides the first explicit scaling law for continuous DLMs, and offers theoretical grounding for likelihood training advantages, potentially broadening research into continuous diffusion models as practical alternatives.
major comments (2)
- [§4] §4 (Scaling Experiments): The central claim of a 20× compute gap and fair isolation of continuous vs. discrete effects via architectural alignment requires explicit reporting of matched training budgets, data splits, hyperparameter grids, and optimization trajectories for RePlaid versus Duo, MDLM, and autoregressive baselines; without these, residual differences in noise schedule optimization or embedding geometry could confound attribution to the continuous formulation.
- [Table 1] Table 1 / Figure 3 (PPL and scaling curves): The reported SOTA PPL bound of 22.1 and outperformance in the over-trained regime should include per-model training FLOPs or step counts alongside the curves to substantiate competitiveness; current presentation leaves open whether gains stem from the continuous likelihood objective or unstated tuning advantages.
minor comments (3)
- [Abstract] The abstract and §3 would benefit from a brief explicit statement of the exact likelihood objective used for RePlaid to distinguish it from prior continuous DLMs.
- Figure legends for scaling plots should clarify axis scaling (e.g., compute in FLOPs vs. tokens) and include error bars or multiple seeds for the reported curves.
- [§2] Add a short reference or citation to the original Plaid work when describing the base architecture in §2 to aid readers unfamiliar with the lineage.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our scaling results and experimental controls. We address each major comment below and will incorporate the requested details into the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Scaling Experiments): The central claim of a 20× compute gap and fair isolation of continuous vs. discrete effects via architectural alignment requires explicit reporting of matched training budgets, data splits, hyperparameter grids, and optimization trajectories for RePlaid versus Duo, MDLM, and autoregressive baselines; without these, residual differences in noise schedule optimization or embedding geometry could confound attribution to the continuous formulation.
Authors: We agree that greater transparency on experimental controls is valuable. In the revision we will add an expanded experimental setup section and a supplementary table that lists training budgets (FLOPs and steps), data splits, hyperparameter grids, and optimizer settings for RePlaid alongside the corresponding values reported for Duo, MDLM, and the autoregressive baselines. Our architectural alignment fixes the transformer backbone, context length, and embedding dimension across models; noise schedules for all diffusion models were optimized under the same likelihood objective. We will explicitly note any unavoidable differences arising from model-specific implementations while arguing that these do not undermine the isolation of the continuous formulation. revision: yes
-
Referee: [Table 1] Table 1 / Figure 3 (PPL and scaling curves): The reported SOTA PPL bound of 22.1 and outperformance in the over-trained regime should include per-model training FLOPs or step counts alongside the curves to substantiate competitiveness; current presentation leaves open whether gains stem from the continuous likelihood objective or unstated tuning advantages.
Authors: We accept this point. The revised Table 1 and Figure 3 will include per-model training FLOPs (or equivalent step counts normalized by batch size and sequence length) for every reported result. This addition will allow readers to verify that RePlaid’s PPL of 22.1 and its scaling behavior are obtained under compute budgets comparable to or lower than the cited discrete baselines, supporting attribution to the continuous likelihood training rather than hidden tuning advantages. revision: yes
Circularity Check
No significant circularity; scaling laws and insights are empirically benchmarked and derived from ELBO without reduction to inputs
full rationale
The paper's central results consist of empirical scaling comparisons on external benchmarks (OpenWebText, Duo, MDLM) and a derivation showing that noise-schedule optimization minimizing ELBO variance produces linear cross-entropy. These steps do not reduce by construction to fitted parameters, self-citations, or renamed inputs. The architecture alignment is presented as a methodological choice for fair comparison rather than a definitional equivalence. No load-bearing claim collapses to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- noise schedule parameters
axioms (1)
- standard math The evidence lower bound (ELBO) is a valid surrogate for the true likelihood in continuous diffusion training.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RePlaid exhibits a compute gap of only 20× compared to autoregressive models... new state-of-the-art PPL bound of 22.1
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Block diffusion: Interpolating between autoregressive and diffusion language models
Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tyEyYT267x
work page 2025
-
[3]
Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, 11 A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 17981–17993. Curran Associates, Inc., 2021. URL ...
work page 2021
-
[4]
Dirichlet dif- fusion score model for biological sequence generation
Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, and Jian Zhou. Dirichlet dif- fusion score model for biological sequence generation. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofPro- ceedings of...
work page 2023
-
[5]
Importance Weighted Autoencoders
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[6]
A continuous time framework for discrete denoising models
Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligianni- dis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 28266–28279. Curran Associates, Inc., 202...
work page 2022
-
[7]
Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st Internat...
work page 2024
-
[8]
One billion word benchmark for measuring progress in statistical language mod- eling, 2014
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language mod- eling, 2014. URL https://huggingface.co/datasets/billion-word-benchmark/ lm1b
work page 2014
-
[9]
Analog bits: Generating discrete data using diffusion models with self-conditioning
Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=3itjR9QxFw
work page 2023
-
[10]
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. LangFlow: Continuous diffusion rivals discrete in language modeling.arXiv preprint arXiv:2604.11748, 2026. URL https://arxiv.org/abs/2604.11748. Accessed: May 1, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Categorical flow matching on statistical manifolds
Chaoran Cheng, Jiahan Li, Jian Peng, and Ge Liu. Categorical flow matching on statistical manifolds. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,
-
[12]
URLhttps://openreview.net/forum?id=5fybcQZ0g4
-
[13]
Diffusion posterior sampling for general noisy inverse problems
Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InThe Eleventh Inter- national Conference on Learning Representations, 2023. URLhttps://openreview.net/ forum?id=OnD9zGAGT0k
work page 2023
-
[14]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context.arXiv preprint arXiv:1901.02860, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[15]
Oscar Davis, Samuel Kessler, Mircea Petrache, Ismail Ilkan Ceylan, Michael M. Bronstein, and Joey Bose. Fisher flow matching for generative modeling over discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=6jOScqwdHU. 12
work page 2024
-
[16]
Continuous diffusion for categorical data
Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continu- ous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= GTDKo3Sv9p
work page 2024
-
[18]
Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. https://huggingface.co/datasets/Skylion007/openwebtext, 2019
work page 2019
-
[19]
Likelihood-based diffusion language models
Ishaan Gulrajani and Tatsunori Hashimoto. Likelihood-based diffusion language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=e2MCL6hObn
work page 2023
-
[20]
Mutual information and MMSE in gaussian channels
Dongning Guo, Shlomo Shamai, and Sergio Verdu. Mutual information and MMSE in gaussian channels. InInternational Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., pages 349–349, 2004. doi: 10.1109/ISIT.2004.1365386
-
[21]
Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. SSD-LM: Semi-autoregressive simplex- based diffusion language model for text generation and modular control. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575–11596, To...
-
[22]
DiffusionBERT: Improving generative masked language models with diffusion models
Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. DiffusionBERT: Improving generative masked language models with diffusion models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4521–4...
work page 2023
-
[23]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
spacy: Industrial- strength natural language processing in python
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, et al. spacy: Industrial- strength natural language processing in python. 2020
work page 2020
-
[25]
Argmax flows and multinomial diffusion: Learning categorical distributions
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URLhttps://openreview.net/forum?id=6nbpPqUCIi7
work page 2021
-
[26]
M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Lapla- cian smoothing splines.Communications in Statistics - Simulation and Computation, 18 (3):1059–1076, 1989. doi: 10.1080/03610918908812806. URL https://doi.org/10.1080/ 03610918908812806
-
[27]
Continuous diffusion model for language modeling
Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=VGv5y60sXC
work page 2025
-
[28]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7FuTOWMOc7
work page 2022
-
[29]
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021. URL https: //openreview.net/forum?id=2LdBqxc1Yv. 13
work page 2021
-
[30]
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising.arXiv preprint arXiv:2602.16813, 2026. Accessed: May 1, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Diffusion-LM improves controllable text generation
Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion-LM improves controllable text generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,
-
[32]
URLhttps://openreview.net/forum?id=3s9IrEsjLyk
-
[33]
Discrete diffusion modeling by estimating the ratios of the data distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InForty-first International Conference on Machine Learning,
-
[34]
URLhttps://openreview.net/forum?id=CNicRIVIPA
-
[35]
Latent diffusion for language generation
Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Seo Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=NKdtztladR
work page 2023
-
[36]
DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=2uAaGwlP_V
work page 2022
-
[37]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025
work page 2025
-
[38]
Concrete score matching: Generalized score matching for discrete data
Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,
-
[39]
URLhttps://openreview.net/forum?id=_RL7wtHkPJK
-
[40]
SDEdit: Guided image synthesis and editing with stochastic differential equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=aBsCjcPu_tE
work page 2022
-
[41]
Cosmos: Compressed and smooth latent space for text diffusion modeling
Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Cosmos: Compressed and smooth latent space for text diffusion modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=Rv6Lz84FlZ
work page 2025
-
[42]
Scaling up masked diffusion models on text
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=WNvvwK0tut
work page 2025
-
[43]
Large language diffusion models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=KnqiC0znVF
work page 2025
-
[44]
Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and Jackknife variational inference. InInternational Conference on Learning Representations,
-
[45]
URLhttps://openreview.net/forum?id=HyZoi-WRb
-
[46]
Your absorbing discrete diffusion secretly models the conditional distributions of clean data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=sMyXP8Tanm
work page 2025
-
[47]
Sample4Geo : Hard negative sampling for cross-view geo-localisation
William Peebles and Saining Xie. Scalable diffusion models with transformers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2023. doi: 10.1109/ICCV51070.2023.00387. 14
-
[48]
MAUVE: Measuring the gap between neural text and human text using divergence frontiers
Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URL https: //openreview...
work page 2021
-
[49]
Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S Albergo. Discrete flow maps.arXiv preprint arXiv:2604.09784, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025
Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025
-
[51]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
work page 2019
-
[52]
Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026
Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, ˙Ismail ˙Ilkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026
-
[53]
Simple and effective masked diffusion language models
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Mariano Marro- quin, Justin T Chiu, Alexander M Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=L4uaAR4ArM
work page 2024
-
[54]
Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and V olodymyr Kuleshov. The diffusion duality. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=9P9Y8FOSOk
work page 2025
-
[55]
Esoteric language models.arXiv preprint arXiv:2506.01928, 2025
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat. Esoteric language models.arXiv preprint arXiv:2506.01928, 2025
-
[56]
Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026
-
[57]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=TIdIXIpzhoI
work page 2022
-
[58]
Beyond Chinchilla-optimal: Accounting for inference in language model scaling laws
Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond Chinchilla-optimal: Accounting for inference in language model scaling laws. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=0bmXrtTDUu
work page 2024
-
[59]
Simple guidance mechanisms for discrete diffusion models
Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla- torre, Bernardo P de Almeida, Alexander M Rush, Thomas Pierrot, and V olodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=i5MrJ6g5G1
work page 2025
-
[60]
Simplified and generalized masked diffusion for discrete data
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= xcqSOfHt4g
work page 2024
-
[61]
SlimPajama: A 627B token cleaned and dedupli- cated version of RedPajama, 2023
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hes- tness, and Nolan Dey. SlimPajama: A 627B token cleaned and dedupli- cated version of RedPajama, 2023. URL https://www.cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama
work page 2023
-
[62]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=St1giarCHLP. 15
work page 2021
-
[63]
Maximum likelihood training of score-based diffusion models
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=AklttWFnxS9
work page 2021
-
[64]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2...
work page 2023
-
[65]
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022
Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022
-
[67]
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2023.127063. URL https://www. sciencedirect.com/science/article/pii/S0925231223011864
-
[68]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/ vandermaaten08a.html
work page 2008
-
[70]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL h...
work page 2017
-
[71]
BERT has a mouth, and it must speak: BERT as a Markov random field language model
Alex Wang and Kyunghyun Cho. BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Antoine Bosselut, Asli Celikyilmaz, Marjan Ghazvininejad, Srinivasan Iyer, Urvashi Khandelwal, Hannah Rashkin, and Thomas Wolf, editors,Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30–...
-
[72]
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceedings of the 63rd Annual Meeting of the Associati...
work page 2025
-
[73]
calflops: a flops and params calculate tool for neural networks in pytorch framework,
xiaoju ye. calflops: a flops and params calculate tool for neural networks in pytorch framework,
-
[74]
URLhttps://github.com/MrYxJ/calculate-flops.pytorch
-
[75]
Tuning large neural networks via zero-shot hyperparameter transfer
Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pa...
work page 2021
-
[76]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7B: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Fast sampling of diffusion models with exponential integrator
Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Loek7hfb46P
work page 2023
-
[78]
Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. InThe Thirteenth International Conference on Learning Representations,
-
[79]
URLhttps://openreview.net/forum?id=CTC7CmirNr. 17 Contents A Related Works 19 B Training Algorithm 20 C Derivation of the sequence-level NELBO for Plaid 21 D Sampler Update Formulas 23 E ODE-based Likelihood Estimation 25 F Constant Per-Timestep Diffusion Loss 30 G Linear Information Decay Under Optimality 31 H Per-Timestep CE Under Optimality 32 I Learni...
work page 2048
-
[80]
Tag.Run spaCy’s POS tagger on the decoded text to obtain word-level tags and their character offsets
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.