pith. sign in

arxiv: 2605.18530 · v1 · pith:IC42QEPGnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

Pith reviewed 2026-05-20 11:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML
keywords continuous diffusiondiffusion language modelslanguage modelingscaling lawslikelihood trainingperplexityOpenWebTextnoise schedule
0
0 comments X

The pith

RePlaid shows continuous diffusion language models scale competitively with discrete ones, closing the gap to a 20x compute difference from autoregressive models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper challenges the notion that continuous diffusion is less scalable for language modeling than discrete methods by updating the Plaid model into RePlaid. The key step is aligning its architecture with current discrete diffusion language models while keeping the continuous diffusion process. In this setup, RePlaid demonstrates strong scaling behavior, a small compute overhead relative to autoregressive baselines, and better results than several other continuous and discrete models in specific regimes. Readers might care because it opens the possibility that continuous-valued diffusion could become a practical route for large language models, potentially with advantages in sampling or representation. The authors also provide theory showing that likelihood training leads to a linear loss of information over time, which evenly spreads the denoising task, and that it organizes embeddings into useful structures.

Core claim

RePlaid exhibits a compute gap of only 20× compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime while achieving a new state-of-the-art PPL bound of 22.1 among continuous DLMs on OpenWebText. This is enabled by aligning Plaid's architecture with modern discrete DLMs and using likelihood-based training, which optimizes the noise schedule to yield linear cross-entropy over time and creates structured geometries in embeddings.

What carries the argument

Architecture alignment of continuous diffusion language models with discrete counterparts combined with likelihood optimization that minimizes ELBO variance for linear information loss.

If this is right

  • Continuous DLMs become viable at scale with limited extra compute cost.
  • Performance advantages appear in over-trained settings compared to some discrete models.
  • New state-of-the-art perplexity achieved for continuous diffusion on standard benchmarks.
  • Likelihood training distributes denoising difficulty evenly without custom time adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If continuous diffusion scales well, models may gain flexibility in generating text by operating in continuous space rather than discrete tokens.
  • The linear cross-entropy from optimized schedules could simplify training procedures in other diffusion models.
  • Structured embeddings might lead to improved performance in tasks requiring semantic understanding.

Load-bearing premise

Aligning the architecture of Plaid with modern discrete DLMs fairly isolates the continuous versus discrete difference without confounding effects from training or tuning differences.

What would settle it

Reproducing the OpenWebText experiments and finding RePlaid's perplexity bound significantly above 22.1 or the compute gap exceeding 20x with matched hyperparameter tuning.

Figures

Figures reproduced from arXiv: 2605.18530 by Arash Vahdat, John Thickstun, Morteza Mardani, Shuibai Zhang, Subham Sekhar Sahoo, Wei Guo, Yongxin Chen, Zhihan Yang.

Figure 1
Figure 1. Figure 1: (a-b) IsoFLOP curves identify optimal model sizes (black crosses) across fixed compute budgets. The optimal REPLAID (s.c.) loss exhibits power-law scaling, decreasing at a rate comparable to MDLM. To match AR loss, MDLM and REPLAID (s.c.) require 14× and 20× the compute, respectively. In the over-trained regime below the green line, REPLAID (s.c.) consistently outperforms MDLM (Sec. 3.4). (c) The effect of… view at source ↗
Figure 2
Figure 2. Figure 2: Scaling laws. (a) The compute-optimal REPLAID loss exhibits power-law scaling, decreasing at a rate comparable to AR, MDLM, and Duo. MDLM requires 14× FLOPs to match AR; Duo needs 22×; REPLAID consumes 20× with self-conditioning (s.c.) and 27× without it. (b) The compute-optimal REPLAID (s.c.) uses 1.8× fewer parameters than MDLM and Duo – while outperforming Duo’s loss in (a) – and uses 3.4× fewer than AR… view at source ↗
Figure 3
Figure 3. Figure 3: (a-b) GenPPL and MAUVE on OWT samples (L = 1024, N = 5120). No temperature is used. All models are trained for 1M steps on OWT. (c) GenPPL-entropy frontier as T increases (darker color: higher T). We observe that Duo and FLM have worse entropy than RePlaid (no s.c.) at comparable GenPPL levels [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MAUVE on OWT samples (L = 1024, N = 5120) versus sampling steps T, comparing the ancestral DDPM sampler, DDIM, DPM-Solver++(2M), Heun on REPLAID, with LangFlow and Duo plotted as baselines. steps (as with LangFlow) but further improves upon RePlaid (no s.c.) for T ≥ 64, outperforming discrete DLMs and other continuous DLMs with self-conditioning at high sampling steps [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗
Figure 5
Figure 5. Figure 5: Visualizing geometry of learned embeddings of REPLAID (s.c.) (OWT PPL: 22.1 at 1M). (a) 2D t-SNE plot with each subword colored by its most frequent POS tag. (b) PCA scree plot of E. (c) PCA scree plot of E when an auxiliary CE loss is added (Sec. 5.1), dispersing the embeddings (OWT PPL: 26.1 at 1M). which disrupts the low-rank embedding structure and hurts the PPL; (ii) Making the noise schedule a learna… view at source ↗
Figure 6
Figure 6. Figure 6: Visualizing per-timestep diffusion loss, CE loss, and decoding error for REPLAID (s.c.) (OWT PPL: 24.9), an ablation that learns noise schedule shape but freezes embeddings (OWT PPL: 45.1), and an ablation that learns embeddings but freezes the noise schedule (OWT PPL: 28.0). Models are 250K and the two losses are length-normalized. Empirically, whenever the noise schedule is learned, the per-timestep diff… view at source ↗
Figure 7
Figure 7. Figure 7: IWAE K-curve of (23) on a fixed 1024-sequence OWT-valid subset (solid lines), with matched VDM NELBOs of (4) drawn as dashed references. Dotted curves are the leading-order a + b/K fits on K ≥ 4 with extrapolated asymptotes PPL∞ in the legend; the visible deviation at K ≤ 2 reflects O(1/K2 ) corrections to the Θ(1/K) IWAE bias of Nowozin [40] [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The noise schedule endpoints (γ0, γ1) and the interior shape of γ(t) are separately parameterized; they are trained to minimize the diffusion loss and its estimator variance respectively. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: IsoFLOP curves plot optimal model sizes under fixed compute budgets. The optimal REPLAID loss exhibits power-law scaling, decreasing at a rate comparable to AR and MDLM. MDLM (low var.), Duo, REPLAID (s.c.), and REPLAID (no s.c.) exhibits 14×, 22×, 20×, and 27× worse scaling than AR respectively. In the over-trained region below the green line, REPLAID (s.c.) beats MDLM (low var.). 38 [PITH_FULL_IMAGE:fig… view at source ↗
Figure 10
Figure 10. Figure 10: Exchanging the embedding configurations of MDLM and RePlaid (s.c.) degrades both methods. For comparison, we unify the vertical range of all methods and do not show points outside of this range (e.g., (d)). 39 [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Quality-diversity trade-off of REPLAID against discrete and continuous DLM baselines for uncon￾ditional generation on OWT. Markers denote τ = 1. We use DFM numbers reported in Potaptchik et al. [44] (T = 512 not available). REPLAID is competitive with discrete DLMs and surpasses prior continuous DLMs. 0 20 40 60 80 100 120 Generative PPL Dataset better T = 8 Dataset better T = 16 Dataset better T = 32 Dat… view at source ↗
Figure 12
Figure 12. Figure 12: Quality-diversity trade-off of REPLAID against LangFlow for unconditional generation on OWT. Markers denote τ = 1. High τ ’s lead to degenerate samples for both LangFlow and REPLAID with self￾conditioning on; parts of the curves corresponding to this degenerate behavior is omitted. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparing PCA scree plots for REPLAID (s.c.) and LangFlow, both using self-conditioning, de = 768, and length-normalized embeddings. REPLAID (s.c.) yields a lower-rank embedding geometry while achieving a better PPL bound. 50 100 150 200 250 Principal component 0.000 0.002 0.004 0.006 0.008 Explained variance ratio 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative explained variance [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 14
Figure 14. Figure 14: PCA scree plot for MDLM (low var.) (de = 768 by default) trained on OWT for 1M steps. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Learning the noise schedule leads to a near-linear per-timestep cross-entropy loss regardless of the embedding geometry. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗
read the original abstract

While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper revisits the Plaid continuous diffusion language model and constructs RePlaid by aligning its architecture with modern discrete DLMs. It reports the first scaling law for continuous DLMs, claiming a compute gap of only 20× relative to autoregressive models, outperformance of Duo (with fewer parameters) and MDLM (in the over-trained regime), and a new state-of-the-art PPL bound of 22.1 among continuous DLMs on OpenWebText, along with superior generation quality. The authors also provide theoretical insights showing that likelihood-based noise schedule optimization minimizes ELBO variance to produce linear cross-entropy over time, and that likelihood-trained embeddings induce structured geometries that drive likelihood gains.

Significance. If the empirical scaling results and isolation of the continuous formulation hold after addressing comparison details, the work would be significant for demonstrating that continuous diffusion can scale competitively with discrete approaches for language modeling. This challenges prevailing views on scalability, provides the first explicit scaling law for continuous DLMs, and offers theoretical grounding for likelihood training advantages, potentially broadening research into continuous diffusion models as practical alternatives.

major comments (2)
  1. [§4] §4 (Scaling Experiments): The central claim of a 20× compute gap and fair isolation of continuous vs. discrete effects via architectural alignment requires explicit reporting of matched training budgets, data splits, hyperparameter grids, and optimization trajectories for RePlaid versus Duo, MDLM, and autoregressive baselines; without these, residual differences in noise schedule optimization or embedding geometry could confound attribution to the continuous formulation.
  2. [Table 1] Table 1 / Figure 3 (PPL and scaling curves): The reported SOTA PPL bound of 22.1 and outperformance in the over-trained regime should include per-model training FLOPs or step counts alongside the curves to substantiate competitiveness; current presentation leaves open whether gains stem from the continuous likelihood objective or unstated tuning advantages.
minor comments (3)
  1. [Abstract] The abstract and §3 would benefit from a brief explicit statement of the exact likelihood objective used for RePlaid to distinguish it from prior continuous DLMs.
  2. Figure legends for scaling plots should clarify axis scaling (e.g., compute in FLOPs vs. tokens) and include error bars or multiple seeds for the reported curves.
  3. [§2] Add a short reference or citation to the original Plaid work when describing the base architecture in §2 to aid readers unfamiliar with the lineage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our scaling results and experimental controls. We address each major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Scaling Experiments): The central claim of a 20× compute gap and fair isolation of continuous vs. discrete effects via architectural alignment requires explicit reporting of matched training budgets, data splits, hyperparameter grids, and optimization trajectories for RePlaid versus Duo, MDLM, and autoregressive baselines; without these, residual differences in noise schedule optimization or embedding geometry could confound attribution to the continuous formulation.

    Authors: We agree that greater transparency on experimental controls is valuable. In the revision we will add an expanded experimental setup section and a supplementary table that lists training budgets (FLOPs and steps), data splits, hyperparameter grids, and optimizer settings for RePlaid alongside the corresponding values reported for Duo, MDLM, and the autoregressive baselines. Our architectural alignment fixes the transformer backbone, context length, and embedding dimension across models; noise schedules for all diffusion models were optimized under the same likelihood objective. We will explicitly note any unavoidable differences arising from model-specific implementations while arguing that these do not undermine the isolation of the continuous formulation. revision: yes

  2. Referee: [Table 1] Table 1 / Figure 3 (PPL and scaling curves): The reported SOTA PPL bound of 22.1 and outperformance in the over-trained regime should include per-model training FLOPs or step counts alongside the curves to substantiate competitiveness; current presentation leaves open whether gains stem from the continuous likelihood objective or unstated tuning advantages.

    Authors: We accept this point. The revised Table 1 and Figure 3 will include per-model training FLOPs (or equivalent step counts normalized by batch size and sequence length) for every reported result. This addition will allow readers to verify that RePlaid’s PPL of 22.1 and its scaling behavior are obtained under compute budgets comparable to or lower than the cited discrete baselines, supporting attribution to the continuous likelihood training rather than hidden tuning advantages. revision: yes

Circularity Check

0 steps flagged

No significant circularity; scaling laws and insights are empirically benchmarked and derived from ELBO without reduction to inputs

full rationale

The paper's central results consist of empirical scaling comparisons on external benchmarks (OpenWebText, Duo, MDLM) and a derivation showing that noise-schedule optimization minimizing ELBO variance produces linear cross-entropy. These steps do not reduce by construction to fitted parameters, self-citations, or renamed inputs. The architecture alignment is presented as a methodological choice for fair comparison rather than a definitional equivalence. No load-bearing claim collapses to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard diffusion-model assumptions (ELBO as training objective, Gaussian noise process) and empirical scaling practices; no new invented entities are introduced.

free parameters (1)
  • noise schedule parameters
    Optimized to minimize ELBO variance; treated as learned or tuned quantities rather than fixed by prior theory.
axioms (1)
  • standard math The evidence lower bound (ELBO) is a valid surrogate for the true likelihood in continuous diffusion training.
    Invoked when deriving the training objective and when analyzing variance minimization.

pith-pipeline@v0.9.0 · 5799 in / 1362 out tokens · 49205 ms · 2026-05-20T11:01:50.578696+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 11 internal anchors

  1. [1]

    Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

  2. [2]

    Block diffusion: Interpolating between autoregressive and diffusion language models

    Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tyEyYT267x

  3. [3]

    Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, 11 A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 17981–17993. Curran Associates, Inc., 2021. URL ...

  4. [4]

    Dirichlet dif- fusion score model for biological sequence generation

    Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, and Jian Zhou. Dirichlet dif- fusion score model for biological sequence generation. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofPro- ceedings of...

  5. [5]

    Importance Weighted Autoencoders

    Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015

  6. [6]

    A continuous time framework for discrete denoising models

    Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligianni- dis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 28266–28279. Curran Associates, Inc., 202...

  7. [7]

    Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

    Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st Internat...

  8. [8]

    One billion word benchmark for measuring progress in statistical language mod- eling, 2014

    Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language mod- eling, 2014. URL https://huggingface.co/datasets/billion-word-benchmark/ lm1b

  9. [9]

    Analog bits: Generating discrete data using diffusion models with self-conditioning

    Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=3itjR9QxFw

  10. [10]

    LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

    Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. LangFlow: Continuous diffusion rivals discrete in language modeling.arXiv preprint arXiv:2604.11748, 2026. URL https://arxiv.org/abs/2604.11748. Accessed: May 1, 2026

  11. [11]

    Categorical flow matching on statistical manifolds

    Chaoran Cheng, Jiahan Li, Jian Peng, and Ge Liu. Categorical flow matching on statistical manifolds. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  12. [12]

    URLhttps://openreview.net/forum?id=5fybcQZ0g4

  13. [13]

    Diffusion posterior sampling for general noisy inverse problems

    Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InThe Eleventh Inter- national Conference on Learning Representations, 2023. URLhttps://openreview.net/ forum?id=OnD9zGAGT0k

  14. [14]

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context.arXiv preprint arXiv:1901.02860, 2019

  15. [15]

    Bronstein, and Joey Bose

    Oscar Davis, Samuel Kessler, Mircea Petrache, Ismail Ilkan Ceylan, Michael M. Bronstein, and Joey Bose. Fisher flow matching for generative modeling over discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=6jOScqwdHU. 12

  16. [16]

    Continuous diffusion for categorical data

    Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continu- ous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022

  17. [17]

    Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= GTDKo3Sv9p

  18. [18]

    Openwebtext corpus

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. https://huggingface.co/datasets/Skylion007/openwebtext, 2019

  19. [19]

    Likelihood-based diffusion language models

    Ishaan Gulrajani and Tatsunori Hashimoto. Likelihood-based diffusion language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=e2MCL6hObn

  20. [20]

    Mutual information and MMSE in gaussian channels

    Dongning Guo, Shlomo Shamai, and Sergio Verdu. Mutual information and MMSE in gaussian channels. InInternational Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., pages 349–349, 2004. doi: 10.1109/ISIT.2004.1365386

  21. [21]

    URL https://proceedings.mlr

    Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. SSD-LM: Semi-autoregressive simplex- based diffusion language model for text generation and modular control. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575–11596, To...

  22. [22]

    DiffusionBERT: Improving generative masked language models with diffusion models

    Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. DiffusionBERT: Improving generative masked language models with diffusion models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4521–4...

  23. [23]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

  24. [24]

    spacy: Industrial- strength natural language processing in python

    Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, et al. spacy: Industrial- strength natural language processing in python. 2020

  25. [25]

    Argmax flows and multinomial diffusion: Learning categorical distributions

    Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URLhttps://openreview.net/forum?id=6nbpPqUCIi7

  26. [26]

    Hutchinson

    M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Lapla- cian smoothing splines.Communications in Statistics - Simulation and Computation, 18 (3):1059–1076, 1989. doi: 10.1080/03610918908812806. URL https://doi.org/10.1080/ 03610918908812806

  27. [27]

    Continuous diffusion model for language modeling

    Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=VGv5y60sXC

  28. [28]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7FuTOWMOc7

  29. [29]

    Variational diffusion models

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021. URL https: //openreview.net/forum?id=2LdBqxc1Yv. 13

  30. [30]

    Flow Map Language Models: One-step Language Modeling via Continuous Denoising

    Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising.arXiv preprint arXiv:2602.16813, 2026. Accessed: May 1, 2026

  31. [31]

    Diffusion-LM improves controllable text generation

    Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion-LM improves controllable text generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,

  32. [32]

    URLhttps://openreview.net/forum?id=3s9IrEsjLyk

  33. [33]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InForty-first International Conference on Machine Learning,

  34. [34]

    URLhttps://openreview.net/forum?id=CNicRIVIPA

  35. [35]

    Latent diffusion for language generation

    Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Seo Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=NKdtztladR

  36. [36]

    DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=2uAaGwlP_V

  37. [37]

    DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

  38. [38]

    Concrete score matching: Generalized score matching for discrete data

    Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,

  39. [39]

    URLhttps://openreview.net/forum?id=_RL7wtHkPJK

  40. [40]

    SDEdit: Guided image synthesis and editing with stochastic differential equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=aBsCjcPu_tE

  41. [41]

    Cosmos: Compressed and smooth latent space for text diffusion modeling

    Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Cosmos: Compressed and smooth latent space for text diffusion modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=Rv6Lz84FlZ

  42. [42]

    Scaling up masked diffusion models on text

    Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=WNvvwK0tut

  43. [43]

    Large language diffusion models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=KnqiC0znVF

  44. [44]

    Debiasing evidence approximations: On importance-weighted autoencoders and Jackknife variational inference

    Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and Jackknife variational inference. InInternational Conference on Learning Representations,

  45. [45]

    URLhttps://openreview.net/forum?id=HyZoi-WRb

  46. [46]

    Your absorbing discrete diffusion secretly models the conditional distributions of clean data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=sMyXP8Tanm

  47. [47]

    Sample4Geo : Hard negative sampling for cross-view geo-localisation

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2023. doi: 10.1109/ICCV51070.2023.00387. 14

  48. [48]

    MAUVE: Measuring the gap between neural text and human text using divergence frontiers

    Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URL https: //openreview...

  49. [49]

    Discrete Flow Maps

    Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S Albergo. Discrete flow maps.arXiv preprint arXiv:2604.09784, 2026

  50. [50]

    Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025

    Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025

  51. [51]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  52. [52]

    Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026

    Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, ˙Ismail ˙Ilkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026

  53. [53]

    Simple and effective masked diffusion language models

    Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Mariano Marro- quin, Justin T Chiu, Alexander M Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=L4uaAR4ArM

  54. [54]

    The diffusion duality

    Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and V olodymyr Kuleshov. The diffusion duality. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=9P9Y8FOSOk

  55. [55]

    Esoteric language models.arXiv preprint arXiv:2506.01928, 2025

    Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat. Esoteric language models.arXiv preprint arXiv:2506.01928, 2025

  56. [56]

    Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

    Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

  57. [57]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=TIdIXIpzhoI

  58. [58]

    Beyond Chinchilla-optimal: Accounting for inference in language model scaling laws

    Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond Chinchilla-optimal: Accounting for inference in language model scaling laws. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=0bmXrtTDUu

  59. [59]

    Simple guidance mechanisms for discrete diffusion models

    Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla- torre, Bernardo P de Almeida, Alexander M Rush, Thomas Pierrot, and V olodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=i5MrJ6g5G1

  60. [60]

    Simplified and generalized masked diffusion for discrete data

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= xcqSOfHt4g

  61. [61]

    SlimPajama: A 627B token cleaned and dedupli- cated version of RedPajama, 2023

    Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hes- tness, and Nolan Dey. SlimPajama: A 627B token cleaned and dedupli- cated version of RedPajama, 2023. URL https://www.cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama

  62. [62]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=St1giarCHLP. 15

  63. [63]

    Maximum likelihood training of score-based diffusion models

    Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=AklttWFnxS9

  64. [64]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2...

  65. [65]

    Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

    Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

  66. [66]

    Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022

    Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022

  67. [67]

    Neurocomputing 568, 127063

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2023.127063. URL https://www. sciencedirect.com/science/article/pii/S0925231223011864

  68. [68]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  69. [69]

    Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/ vandermaaten08a.html

  70. [70]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL h...

  71. [71]

    BERT has a mouth, and it must speak: BERT as a Markov random field language model

    Alex Wang and Kyunghyun Cho. BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Antoine Bosselut, Asli Celikyilmaz, Marjan Ghazvininejad, Srinivasan Iyer, Urvashi Khandelwal, Hannah Rashkin, and Thomas Wolf, editors,Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30–...

  72. [72]

    Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference

    Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceedings of the 63rd Annual Meeting of the Associati...

  73. [73]

    calflops: a flops and params calculate tool for neural networks in pytorch framework,

    xiaoju ye. calflops: a flops and params calculate tool for neural networks in pytorch framework,

  74. [74]

    URLhttps://github.com/MrYxJ/calculate-flops.pytorch

  75. [75]

    Tuning large neural networks via zero-shot hyperparameter transfer

    Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pa...

  76. [76]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7B: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

  77. [77]

    Fast sampling of diffusion models with exponential integrator

    Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Loek7hfb46P

  78. [78]

    Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

    Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. InThe Thirteenth International Conference on Learning Representations,

  79. [79]

    − \logp θ(x)

    URLhttps://openreview.net/forum?id=CTC7CmirNr. 17 Contents A Related Works 19 B Training Algorithm 20 C Derivation of the sequence-level NELBO for Plaid 21 D Sampler Update Formulas 23 E ODE-based Likelihood Estimation 25 F Constant Per-Timestep Diffusion Loss 30 G Linear Information Decay Under Optimality 31 H Per-Timestep CE Under Optimality 32 I Learni...

  80. [80]

    Tag.Run spaCy’s POS tagger on the decoded text to obtain word-level tags and their character offsets

Showing first 80 references.