pith. machine review for the scientific record. sign in

arxiv: 2605.10938 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

ELF: Embedded Language Flows

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords diffusion modelsflow matchinglanguage modelingcontinuous embeddingsdiscrete tokensclassifier-free guidancesampling efficiency
0
0 comments X

The pith

Continuous embedding flows generate higher-quality language with fewer sampling steps than discrete diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion and flow models work well for continuous data such as images, yet language models using them have mostly stayed in discrete token space with limited success. This paper introduces Embedded Language Flows, which keep the process in continuous embedding space for most of the generation and switch to discrete tokens only at the final step using a shared network. This design lets the model borrow methods like classifier-free guidance from image diffusion directly. Experiments indicate that this approach produces better text than leading discrete and continuous diffusion language models while using fewer sampling steps. A sympathetic reader would see this as evidence that continuous formulations can simplify and improve diffusion for language.

Core claim

ELF demonstrates that continuous-time Flow Matching in embedding space, with a final shared-weight mapping to discrete tokens, substantially outperforms existing discrete and continuous diffusion language models in generation quality while requiring fewer sampling steps.

What carries the argument

Embedded Language Flows (ELF): continuous diffusion models based on Flow Matching that remain in embedding space until the last time step, where a shared-weight network produces discrete tokens.

If this is right

  • ELF achieves better generation quality than leading discrete and continuous DLMs.
  • It requires fewer sampling steps to reach that quality.
  • Techniques such as classifier-free guidance from continuous domains apply straightforwardly to language.
  • This points to continuous DLMs as a viable direction for language generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar embedding-space strategies might improve diffusion models for other discrete sequences like code or music.
  • The reduced step count could lower inference costs in practical language applications.

Load-bearing premise

The assumption that keeping the model mostly in continuous embedding space with only a final mapping to tokens is sufficient for effective discrete language modeling.

What would settle it

A benchmark experiment on language generation tasks where ELF fails to match or exceed the quality of top discrete DLMs or requires more sampling steps.

Figures

Figures reproduced from arXiv: 2605.10938 by Hanhong Zhao, Jacob Andreas, Kaiming He, Keya Hu, Linlu Qiu, Tianhong Li, Yiyang Lu, Yoon Kim.

Figure 1
Figure 1. Figure 1: ELF achieves lower generative per￾plexity with fewer sampling steps than prior DLMs, without using distillation. ELF achieves this while using 10× fewer training tokens. (Model size: 105M for ELF and 170M for others; dataset: OWT. Detailed comparison in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Conceptual illustration of ELF. Orange points denote data represented in continuous embedding space, and purple lines show denoising trajectories from Gaussian noise to clean embed￾dings. Discretization is applied only at the final time step (t = 1) using a shared-weight network. embedding space by directly denoising continuous representations throughout the flowing process, with discretization considered … view at source ↗
Figure 3
Figure 3. Figure 3: During training, discrete tokens are encoded into clean embeddings x and corrupted to zt, which ELF uses to predict xˆ. The model is trained with either the denoising loss LMSE or the token-wise cross-entropy loss LCE. During inference, ELF starts from Gaussian noise z0 and iteratively denoises embeddings from zt to zt+1. Only at the final step does ELF switch to decoding mode and project the final embeddi… view at source ↗
Figure 4
Figure 4. Figure 4: Ablations on guidance. We evaluate the generative perplexity–entropy trade-off across CFG scales: increasing the scale lowers generative perplexity but reduces entropy. Classifier-free guidance (CFG). Our flow-based con￾tinuous formulation is naturally compatible with CFG, a highly effective technique in standard diffusion models. Therefore, we first study the effect of the CFG scale. As shown in [PITH_FU… view at source ↗
Figure 5
Figure 5. Figure 5: Ablations on key design choices. (a) Embedding choices: we compare contextual vs. non￾contextual embeddings, as well as frozen vs. learnable embeddings; pretrained contextual embeddings achieve the best trade-off. (b) Decoding strategies: We compare a shared-weight denoiser-decoder with a two-stage, separately trained decoder. Both strategies achieve similar trade-offs, but the shared-weight variant extend… view at source ↗
Figure 6
Figure 6. Figure 6: Scaling of ELF models. We compare ELF-B, ELF-M, and ELF￾L. Scaling model size consistently im￾proves the Gen. PPL–entropy frontier. Model scales. We study the scaling behavior of ELF across three model sizes: ELF-B (105M), ELF-M (342M), and ELF-L (652M) (detailed in Appendix Tab. 3). We evaluate each model using both ODE and SDE sam￾pling. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: System-level comparison. ELF-B outperforms both discrete and continuous DLMs trained under similar settings (a), rivals distilled variants of other baselines that require additional rounds of training (b), and uses substantially fewer training tokens (c). 4.2 System-Level Comparison on Unconditional Generation We first compare ELF-B against both discrete DLMs, including MDLM [56] and Duo [57], and continuo… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative examples of text generated by ELF-B. We show an unconditional sample, a German-to-English translation example, and a summarization example, along with their automatic evaluation metrics. Some text is omitted due to space limits; see Appendix E for more examples. fuSeq [79] and CDCD [13]). Some results are taken from the literature and others are reproduced from public codebases. See Appendix Ta… view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of our training pipeline. Starting from the clean embeddings x, we apply different noise schedules in the two modes to obtain corrupted embeddings zt. We then apply self￾conditioning by concatenating either 0 or the previous prediction xˆ ′ along the channel dimension, and project the concatenated embeddings back to the original dimension to form zˆt. Next, we prepend control tokens to the emb… view at source ↗
Figure 10
Figure 10. Figure 10: Effects of prediction targets. We vary the input dimension from 512 to 768 and 1024 by using T5-small, T5-base, and T5-large encoders, respectively. Across all input dimensions, x-prediction remains stable and performs well. In contrast, v-prediction performs well at 512 dimensions but degrades at higher dimensions, while ϵ-prediction collapses across all dimensions from 512 to 1024. The red region indica… view at source ↗
Figure 11
Figure 11. Figure 11 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Effect of the denoising mode probability during training. This probability controls the allocation between denoising and decoding updates in the shared-weight denoiser-decoder model. A denoising mode probability of 0.8 provides the best generative perplexity–entropy trade-off across both ODE and SDE samplers. 5.2 5.3 5.4 Entropy 10 20 30 40 50 60 70 80 G e n. P P L (a) ODE In-context AdaLN-Zero 5.1 5.2 5.… view at source ↗
Figure 13
Figure 13. Figure 13: Effect of conditioning strategies. We compare in-context conditioning with adaLN￾Zero conditioning. In-context conditioning slightly improves performance while substan￾tially reducing the number of model parameters. 5.1 5.2 5.3 5.4 Entropy 10 20 30 40 50 60 70 80 G e n. P P L (a) ODE Muon AdamW 5.1 5.2 5.2 5.3 Entropy 10 20 30 40 50 60 70 80 (b) SDE Muon AdamW [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Effect of time schedule and SDE noise re-injection scale. (a) Logit-normal time schedule consistently improves generative perplexity across different sampling budgets, especially in the few-step regime. (b) The SDE noise re-injection scale γ controls the generative perplexity–entropy trade-off by adjusting the amount of stochastic noise injected during sampling. 1 2 3 4 CFG 18.0 20.0 22.0 24.0 26.0 BLEU (… view at source ↗
Figure 16
Figure 16. Figure 16 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Denoising trajectory of ELF-B. As t increases from 0 to 1, ungrammatical sentences are progressively refined into fluent and grammatical text. semi-autoregressive decoding. We tune the main sampling and guidance hyperparameters and report the best reproduced results we obtain. E Qualitative Examples E.1 Denoising Trajectory [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
read the original abstract

Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes Embedded Language Flows (ELF), a class of continuous-time Flow Matching models for language that operate primarily in continuous embedding space and only discretize to tokens at the final timestep via a shared-weight network. It claims this formulation enables straightforward adaptation of image-domain techniques such as classifier-free guidance and that experiments demonstrate ELF substantially outperforms leading discrete and continuous diffusion language models in generation quality while requiring fewer sampling steps.

Significance. If the experimental claims are substantiated, the work could provide a meaningful path for transferring continuous diffusion and flow-matching advances from images to discrete language modeling, potentially improving sampling efficiency and generation quality without heavy domain-specific redesign.

major comments (1)
  1. Abstract: The central claim that 'ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps' is presented without any quantitative metrics, baselines, datasets, sampling procedures, or statistical controls. This absence is load-bearing because the abstract supplies no evidence against which the superiority assertion can be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting an important point about the abstract. We address the major comment below.

read point-by-point responses
  1. Referee: Abstract: The central claim that 'ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps' is presented without any quantitative metrics, baselines, datasets, sampling procedures, or statistical controls. This absence is load-bearing because the abstract supplies no evidence against which the superiority assertion can be evaluated.

    Authors: We agree that the abstract, in its current form, presents the performance claim at a high level without supporting numbers. The full manuscript contains the requested details in the Experiments section, including quantitative comparisons against leading discrete and continuous DLMs on standard language modeling benchmarks, specific sampling step counts, and the evaluation protocol. To make the abstract more informative and self-contained, we will revise it to incorporate key quantitative highlights (e.g., relative improvements in generation quality metrics and the reduction in sampling steps) while retaining its concise nature. This change directly addresses the concern without misrepresenting the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; abstract contains no derivations

full rationale

The provided document is limited to the abstract, which introduces ELF as a continuous embedding-space model based on Flow Matching and reports empirical outperformance without any equations, parameter-fitting steps, self-citations, or derivation chains. No load-bearing claim reduces by construction to its inputs, and none of the enumerated circularity patterns (self-definitional, fitted-input prediction, self-citation load-bearing, etc.) can be instantiated because no technical derivations are present. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger records high-level assumptions implied by the proposal; no explicit numerical parameters or new physical entities are described.

axioms (1)
  • domain assumption Continuous-time flow matching can be applied effectively to language data represented in continuous embedding space.
    Invoked when the paper proposes ELF as a diffusion model based on continuous-time Flow Matching in embedding space.
invented entities (1)
  • ELF no independent evidence
    purpose: A class of continuous diffusion models for language that remain in embedding space until the final discretization step.
    New model class introduced in the paper.

pith-pipeline@v0.9.0 · 5460 in / 1339 out tokens · 80869 ms · 2026-05-12T03:22:33.995291+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 9 internal anchors

  1. [1]

    Joint distillation for fast likelihood evaluation and sampling in flow-based models

    Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, and Max Simchowitz. Joint distillation for fast likelihood evaluation and sampling in flow-based models. InICLR, 2026. 6

  2. [2]

    Stochastic interpolants: A unifying framework for flows and diffusions.JMLR, 2025

    Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.JMLR, 2025. 2, 3, 15

  3. [3]

    Building normalizing flows with stochastic interpolants

    Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InICLR, 2023. 1, 2, 4

  4. [4]

    Encoder- decoder diffusion language models for efficient training and inference

    Marianne Arriola, Yair Schiff, Hao Phung, Aaron Gokaslan, and V olodymyr Kuleshov. Encoder- decoder diffusion language models for efficient training and inference. InNeurIPS, 2025. 3, 8, 9, 27

  5. [5]

    Structured denoising diffusion models in discrete state-spaces

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. InNeurIPS, 2021. 1, 2, 3

  6. [6]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow matching for in-context image ...

  7. [7]

    Findings of the 2014 workshop on statistical machine translation

    Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ales Tamchyna. Findings of the 2014 workshop on statistical machine translation. InACL Workshop on Statistical Machine Translation, 2014. 2, 6

  8. [8]

    Visual generation without guidance

    Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance. InICML, 2025. 6, 18

  9. [9]

    Analog bits: Generating discrete data using diffusion models with self-conditioning

    Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. InICLR, 2023. 5, 18

  10. [10]

    LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

    Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. Langflow: Continuous diffusion rivals discrete in language modeling.arXiv preprint arXiv:2604.11748, 2026. 2, 3, 6, 8, 15, 25

  11. [11]

    Beyond autoregression: Fast LLMs via self-distillation through time

    Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast LLMs via self-distillation through time. InICLR, 2025. 8

  12. [12]

    The diffusion duality, chapter ii:ψ-samplers and efficient curriculum

    Justin Deschenaux, Caglar Gulcehre, and Subham Sekhar Sahoo. The diffusion duality, chapter ii:ψ-samplers and efficient curriculum. InICLR, 2026. 3

  13. [13]

    H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al

    Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, and Jonas Adler. Continuous diffusion for categor- ical data.arXiv preprint arXiv:2211.15089, 2022. 1, 2, 5, 9, 15, 27

  14. [14]

    Scaling rectified flow Transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow Transformers for high-resolution image synthesis. InICML, 2024. 2, 6

  15. [15]

    Empowering diffusion models on the embedding space for text generation

    Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, and Linli Xu. Empowering diffusion models on the embedding space for text generation. InNAACL, 2024. 2, 15

  16. [16]

    Mean flows for one-step generative modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InNeurIPS, 2025. 6, 18 10

  17. [17]

    Improved Mean Flows: On the Challenges of Fastforward Generative Models

    Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025. 6, 18

  18. [18]

    Openwebtext corpus, 2019

    Aaron Gokaslan and Vanya Cohen. Openwebtext corpus, 2019. 6, 25

  19. [19]

    Diffuseq: Sequence to sequence text generation with diffusion models

    Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. InICLR, 2023. 1, 2, 15

  20. [20]

    Diffucoder: Understanding and improving masked diffusion models for code generation

    Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation. InICLR, 2026. 3

  21. [21]

    Likelihood-based diffusion language models

    Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models. In NeurIPS, 2023. 2, 15

  22. [22]

    SSD-LM: Semi-autoregressive simplex- based diffusion language model for text generation and modular control

    Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. SSD-LM: Semi-autoregressive simplex- based diffusion language model for text generation and modular control. InACL, 2023. 2, 15

  23. [23]

    Diffusionbert: Improving generative masked language models with diffusion models

    Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuan-Jing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. InACL,

  24. [24]

    Query-key normalization for Transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for Transformers. InFindings of EMNLP, 2020. 24

  25. [25]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshops,

  26. [26]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 1, 2, 3, 15, 16

  27. [27]

    Continuous diffusion model for language modeling

    Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling. In NeurIPS, 2025. 2, 15

  28. [28]

    Muon: An optimizer for hidden layers in neural networks

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Technical report, Keller Jordan blog, 2024. 6, 23

  29. [29]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022. 23

  30. [30]

    Flow Map Language Models: One-step Language Modeling via Continuous Denoising

    Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising.arXiv preprint arXiv:2602.16813, 2026. 2, 3, 5, 6, 8, 15, 25

  31. [31]

    Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026

    Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, and Chaoyou Fu. Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026. 3

  32. [32]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 2, 4, 6, 20, 21, 22

  33. [33]

    A survey on diffusion language models,

    Tianyi Li, Mingda Chen, Bowei Guo, and Zhiqiang Shen. A survey on diffusion language models.arXiv preprint arXiv:2508.10875, 2025. 1, 3

  34. [34]

    Diffusion-LM improves controllable text generation

    Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-LM improves controllable text generation. InNeurIPS, 2022. 1, 2, 15

  35. [35]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InACL Workshop on Text Summarization Branches Out, 2004. 6 11

  36. [36]

    Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise

    Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. InICML, 2023. 2, 15

  37. [37]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023. 1, 2, 3, 4, 15

  38. [38]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 1, 2, 3, 4, 15

  39. [39]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 23

  40. [40]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InICML, 2024. 1

  41. [41]

    Latent diffusion for language generation

    Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. InNeurIPS, 2023. 2, 3, 5, 15, 27

  42. [42]

    Diffusion guided language modeling

    Justin Lovelace, Varsha Kishore, Yiwei Chen, and Kilian Q Weinberger. Diffusion guided language modeling. InFindings of ACL, 2024. 3, 15

  43. [43]

    SiT: Exploring flow and diffusion-based generative models with scalable interpolant Transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Exploring flow and diffusion-based generative models with scalable interpolant Transformers. InECCV, 2024. 2, 5, 19

  44. [44]

    Tess: Text-to-text self-conditioned simplex diffusion

    Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. In EACL, 2024. 2, 5, 15

  45. [45]

    Cosmos: Compressed and smooth latent space for text diffusion modeling

    Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Cosmos: Compressed and smooth latent space for text diffusion modeling. In NeurIPS, 2025. 2, 3, 15

  46. [46]

    Cohen, and Mirella Lapata

    Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InEMNLP,

  47. [47]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021. 2, 3, 16

  48. [48]

    Large language diffusion models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InNeurIPS, 2025. 1, 3

  49. [49]

    BLEU: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InACL, 2002. 6

  50. [50]

    Scalable diffusion models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with Transformers. InICCV, 2023. 18, 24

  51. [51]

    Discrete Flow Maps

    Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S Albergo. Discrete flow maps.arXiv preprint arXiv:2604.09784, 2026. 3, 5, 15

  52. [52]

    Language models are unsupervised multitask learners.OpenAI blog, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI blog, 2019. 6

  53. [53]

    Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020. 4, 6, 7, 25

  54. [54]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 2

  55. [55]

    Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026

    Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, ˙Ismail ˙Ilkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026. 3, 15 12

  56. [56]

    Simple and effective masked diffusion language models

    Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InNeurIPS, 2024. 1, 2, 3, 6, 8, 9, 25

  57. [57]

    The diffusion duality

    Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality. InICML, 2025. 1, 2, 3, 6, 8, 9, 25, 27

  58. [58]

    Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

    Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026. 1, 3

  59. [59]

    TEncDM: Understanding the properties of the diffusion model in the space of language model encodings

    Alexander Shabalin, Viacheslav Meshchaninov, Egor Chimbulatov, Vladislav Lapikov, Roman Kim, Grigory Bartosh, Dmitry Molchanov, Sergey Markov, and Dmitry Vetrov. TEncDM: Understanding the properties of the diffusion model in the space of language model encodings. InAAAI, 2025. 3, 5, 15

  60. [60]

    Why gaussian diffusion models fail on discrete data?arXiv preprint arXiv:2604.02028,

    Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, and Dmitry Vetrov. Why gaussian diffusion models fail on discrete data?arXiv preprint arXiv:2604.02028,

  61. [61]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve Transformer.arXiv preprint arXiv:2002.05202, 2020. 24

  62. [62]

    Codar: Continuous diffusion language models are more powerful than you think.arXiv preprint arXiv:2603.02547, 2026

    Junzhe Shen, Jieru Zhao, Ziwei He, and Zhouhan Lin. Codar: Continuous diffusion language models are more powerful than you think.arXiv preprint arXiv:2603.02547, 2026. 2, 3, 15

  63. [63]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, 2015. 1, 2

  64. [64]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR,

  65. [65]

    Seed Diffusion:

    Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv prepri...

  66. [66]

    Self-conditioned embedding diffusion for text generation

    Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, and Rémi Leblond. Self- conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022. 2, 5, 15

  67. [67]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 24

  68. [68]

    Tess 2: A large-scale generalist diffusion language model

    Jaesung Tae, Hamish Ivison, Sachin Kumar, and Arman Cohan. Tess 2: A large-scale generalist diffusion language model. InACL, 2025. 2, 15

  69. [69]

    Diffusion models without classifier- free guidance.arXiv preprint arXiv:2502.12154, 2025

    Zhicong Tang, Jianmin Bao, Dong Chen, and Baining Guo. Diffusion models without classifier- free guidance.arXiv preprint arXiv:2502.12154, 2025. 6, 18

  70. [70]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2

  71. [71]

    Remasking discrete diffusion models with inference-time scaling

    Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. InNeurIPS, 2025. 3

  72. [72]

    InfoDiffusion: Information entropy aware diffusion process for non-autoregressive text generation

    Renzhi Wang, Jing Li, and Piji Li. InfoDiffusion: Information entropy aware diffusion process for non-autoregressive text generation. InFindings of EMNLP, 2023. 2, 15 13

  73. [73]

    Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding

    Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding. InICLR, 2026. 3

  74. [74]

    AR-Diffusion: Auto-regressive diffusion model for text generation

    Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, Jian Jiao, Juntao Li, Jian Guo, Nan Duan, and Weizhu Chen. AR-Diffusion: Auto-regressive diffusion model for text generation. InNeurIPS, 2023. 2, 15

  75. [75]

    Mmada: Multimodal large diffusion language models

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models. InNeurIPS, 2025. 3

  76. [76]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. 1, 3

  77. [77]

    DINOISER: Diffused conditional sequence learning by manipulating noises.Transactions of the Association for Computational Linguistics, 2024

    Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Mingxuan Wang. DINOISER: Diffused conditional sequence learning by manipulating noises.Transactions of the Association for Computational Linguistics, 2024. 2, 15

  78. [78]

    Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

    Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025. 3

  79. [79]

    Seqdiffuseq: Text diffusion with encoder-decoder transformers

    Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Fei Huang, and Songfang Huang. Seqdiffuseq: Text diffusion with encoder-decoder transformers. InNAACL, 2024. 2, 5, 9, 15

  80. [80]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeurIPS, 2019. 24

Showing first 80 references.