arxiv: 2605.10938 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

ELF: Embedded Language Flows

Keya Hu , Linlu Qiu , Yiyang Lu , Hanhong Zhao , Tianhong Li , Yoon Kim , Jacob Andreas , Kaiming He

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords diffusion modelsflow matchinglanguage modelingcontinuous embeddingsdiscrete tokensclassifier-free guidancesampling efficiency

0 comments

The pith

Continuous embedding flows generate higher-quality language with fewer sampling steps than discrete diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion and flow models work well for continuous data such as images, yet language models using them have mostly stayed in discrete token space with limited success. This paper introduces Embedded Language Flows, which keep the process in continuous embedding space for most of the generation and switch to discrete tokens only at the final step using a shared network. This design lets the model borrow methods like classifier-free guidance from image diffusion directly. Experiments indicate that this approach produces better text than leading discrete and continuous diffusion language models while using fewer sampling steps. A sympathetic reader would see this as evidence that continuous formulations can simplify and improve diffusion for language.

Core claim

ELF demonstrates that continuous-time Flow Matching in embedding space, with a final shared-weight mapping to discrete tokens, substantially outperforms existing discrete and continuous diffusion language models in generation quality while requiring fewer sampling steps.

What carries the argument

Embedded Language Flows (ELF): continuous diffusion models based on Flow Matching that remain in embedding space until the last time step, where a shared-weight network produces discrete tokens.

If this is right

ELF achieves better generation quality than leading discrete and continuous DLMs.
It requires fewer sampling steps to reach that quality.
Techniques such as classifier-free guidance from continuous domains apply straightforwardly to language.
This points to continuous DLMs as a viable direction for language generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar embedding-space strategies might improve diffusion models for other discrete sequences like code or music.
The reduced step count could lower inference costs in practical language applications.

Load-bearing premise

The assumption that keeping the model mostly in continuous embedding space with only a final mapping to tokens is sufficient for effective discrete language modeling.

What would settle it

A benchmark experiment on language generation tasks where ELF fails to match or exceed the quality of top discrete DLMs or requires more sampling steps.

Figures

Figures reproduced from arXiv: 2605.10938 by Hanhong Zhao, Jacob Andreas, Kaiming He, Keya Hu, Linlu Qiu, Tianhong Li, Yiyang Lu, Yoon Kim.

**Figure 1.** Figure 1: ELF achieves lower generative perplexity with fewer sampling steps than prior DLMs, without using distillation. ELF achieves this while using 10× fewer training tokens. (Model size: 105M for ELF and 170M for others; dataset: OWT. Detailed comparison in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Conceptual illustration of ELF. Orange points denote data represented in continuous embedding space, and purple lines show denoising trajectories from Gaussian noise to clean embeddings. Discretization is applied only at the final time step (t = 1) using a shared-weight network. embedding space by directly denoising continuous representations throughout the flowing process, with discretization considered … view at source ↗

**Figure 3.** Figure 3: During training, discrete tokens are encoded into clean embeddings x and corrupted to zt, which ELF uses to predict xˆ. The model is trained with either the denoising loss LMSE or the token-wise cross-entropy loss LCE. During inference, ELF starts from Gaussian noise z0 and iteratively denoises embeddings from zt to zt+1. Only at the final step does ELF switch to decoding mode and project the final embeddi… view at source ↗

**Figure 4.** Figure 4: Ablations on guidance. We evaluate the generative perplexity–entropy trade-off across CFG scales: increasing the scale lowers generative perplexity but reduces entropy. Classifier-free guidance (CFG). Our flow-based continuous formulation is naturally compatible with CFG, a highly effective technique in standard diffusion models. Therefore, we first study the effect of the CFG scale. As shown in [PITH_FU… view at source ↗

**Figure 5.** Figure 5: Ablations on key design choices. (a) Embedding choices: we compare contextual vs. noncontextual embeddings, as well as frozen vs. learnable embeddings; pretrained contextual embeddings achieve the best trade-off. (b) Decoding strategies: We compare a shared-weight denoiser-decoder with a two-stage, separately trained decoder. Both strategies achieve similar trade-offs, but the shared-weight variant extend… view at source ↗

**Figure 6.** Figure 6: Scaling of ELF models. We compare ELF-B, ELF-M, and ELFL. Scaling model size consistently improves the Gen. PPL–entropy frontier. Model scales. We study the scaling behavior of ELF across three model sizes: ELF-B (105M), ELF-M (342M), and ELF-L (652M) (detailed in Appendix Tab. 3). We evaluate each model using both ODE and SDE sampling. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: System-level comparison. ELF-B outperforms both discrete and continuous DLMs trained under similar settings (a), rivals distilled variants of other baselines that require additional rounds of training (b), and uses substantially fewer training tokens (c). 4.2 System-Level Comparison on Unconditional Generation We first compare ELF-B against both discrete DLMs, including MDLM [56] and Duo [57], and continuo… view at source ↗

**Figure 8.** Figure 8: Qualitative examples of text generated by ELF-B. We show an unconditional sample, a German-to-English translation example, and a summarization example, along with their automatic evaluation metrics. Some text is omitted due to space limits; see Appendix E for more examples. fuSeq [79] and CDCD [13]). Some results are taken from the literature and others are reproduced from public codebases. See Appendix Ta… view at source ↗

**Figure 9.** Figure 9: Illustration of our training pipeline. Starting from the clean embeddings x, we apply different noise schedules in the two modes to obtain corrupted embeddings zt. We then apply selfconditioning by concatenating either 0 or the previous prediction xˆ ′ along the channel dimension, and project the concatenated embeddings back to the original dimension to form zˆt. Next, we prepend control tokens to the emb… view at source ↗

**Figure 10.** Figure 10: Effects of prediction targets. We vary the input dimension from 512 to 768 and 1024 by using T5-small, T5-base, and T5-large encoders, respectively. Across all input dimensions, x-prediction remains stable and performs well. In contrast, v-prediction performs well at 512 dimensions but degrades at higher dimensions, while ϵ-prediction collapses across all dimensions from 512 to 1024. The red region indica… view at source ↗

**Figure 11.** Figure 11 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Effect of the denoising mode probability during training. This probability controls the allocation between denoising and decoding updates in the shared-weight denoiser-decoder model. A denoising mode probability of 0.8 provides the best generative perplexity–entropy trade-off across both ODE and SDE samplers. 5.2 5.3 5.4 Entropy 10 20 30 40 50 60 70 80 G e n. P P L (a) ODE In-context AdaLN-Zero 5.1 5.2 5.… view at source ↗

**Figure 13.** Figure 13: Effect of conditioning strategies. We compare in-context conditioning with adaLNZero conditioning. In-context conditioning slightly improves performance while substantially reducing the number of model parameters. 5.1 5.2 5.3 5.4 Entropy 10 20 30 40 50 60 70 80 G e n. P P L (a) ODE Muon AdamW 5.1 5.2 5.2 5.3 Entropy 10 20 30 40 50 60 70 80 (b) SDE Muon AdamW [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 15.** Figure 15: Effect of time schedule and SDE noise re-injection scale. (a) Logit-normal time schedule consistently improves generative perplexity across different sampling budgets, especially in the few-step regime. (b) The SDE noise re-injection scale γ controls the generative perplexity–entropy trade-off by adjusting the amount of stochastic noise injected during sampling. 1 2 3 4 CFG 18.0 20.0 22.0 24.0 26.0 BLEU (… view at source ↗

**Figure 16.** Figure 16 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Denoising trajectory of ELF-B. As t increases from 0 to 1, ungrammatical sentences are progressively refined into fluent and grammatical text. semi-autoregressive decoding. We tune the main sampling and guidance hyperparameters and report the best reproduced results we obtain. E Qualitative Examples E.1 Denoising Trajectory [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

read the original abstract

Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes Embedded Language Flows (ELF), a class of continuous-time Flow Matching models for language that operate primarily in continuous embedding space and only discretize to tokens at the final timestep via a shared-weight network. It claims this formulation enables straightforward adaptation of image-domain techniques such as classifier-free guidance and that experiments demonstrate ELF substantially outperforms leading discrete and continuous diffusion language models in generation quality while requiring fewer sampling steps.

Significance. If the experimental claims are substantiated, the work could provide a meaningful path for transferring continuous diffusion and flow-matching advances from images to discrete language modeling, potentially improving sampling efficiency and generation quality without heavy domain-specific redesign.

major comments (1)

Abstract: The central claim that 'ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps' is presented without any quantitative metrics, baselines, datasets, sampling procedures, or statistical controls. This absence is load-bearing because the abstract supplies no evidence against which the superiority assertion can be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting an important point about the abstract. We address the major comment below.

read point-by-point responses

Referee: Abstract: The central claim that 'ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps' is presented without any quantitative metrics, baselines, datasets, sampling procedures, or statistical controls. This absence is load-bearing because the abstract supplies no evidence against which the superiority assertion can be evaluated.

Authors: We agree that the abstract, in its current form, presents the performance claim at a high level without supporting numbers. The full manuscript contains the requested details in the Experiments section, including quantitative comparisons against leading discrete and continuous DLMs on standard language modeling benchmarks, specific sampling step counts, and the evaluation protocol. To make the abstract more informative and self-contained, we will revise it to incorporate key quantitative highlights (e.g., relative improvements in generation quality metrics and the reduction in sampling steps) while retaining its concise nature. This change directly addresses the concern without misrepresenting the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; abstract contains no derivations

full rationale

The provided document is limited to the abstract, which introduces ELF as a continuous embedding-space model based on Flow Matching and reports empirical outperformance without any equations, parameter-fitting steps, self-citations, or derivation chains. No load-bearing claim reduces by construction to its inputs, and none of the enumerated circularity patterns (self-definitional, fitted-input prediction, self-citation load-bearing, etc.) can be instantiated because no technical derivations are present. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger records high-level assumptions implied by the proposal; no explicit numerical parameters or new physical entities are described.

axioms (1)

domain assumption Continuous-time flow matching can be applied effectively to language data represented in continuous embedding space.
Invoked when the paper proposes ELF as a diffusion model based on continuous-time Flow Matching in embedding space.

invented entities (1)

ELF no independent evidence
purpose: A class of continuous diffusion models for language that remain in embedding space until the final discretization step.
New model class introduced in the paper.

pith-pipeline@v0.9.0 · 5460 in / 1339 out tokens · 80869 ms · 2026-05-12T03:22:33.995291+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network... based on continuous-time Flow Matching
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 9 internal anchors

[1]

Joint distillation for fast likelihood evaluation and sampling in flow-based models

Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, and Max Simchowitz. Joint distillation for fast likelihood evaluation and sampling in flow-based models. InICLR, 2026. 6

work page 2026
[2]

Stochastic interpolants: A unifying framework for flows and diffusions.JMLR, 2025

Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.JMLR, 2025. 2, 3, 15

work page 2025
[3]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InICLR, 2023. 1, 2, 4

work page 2023
[4]

Encoder- decoder diffusion language models for efficient training and inference

Marianne Arriola, Yair Schiff, Hao Phung, Aaron Gokaslan, and V olodymyr Kuleshov. Encoder- decoder diffusion language models for efficient training and inference. InNeurIPS, 2025. 3, 8, 9, 27

work page 2025
[5]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. InNeurIPS, 2021. 1, 2, 3

work page 2021
[6]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Findings of the 2014 workshop on statistical machine translation

Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ales Tamchyna. Findings of the 2014 workshop on statistical machine translation. InACL Workshop on Statistical Machine Translation, 2014. 2, 6

work page 2014
[8]

Visual generation without guidance

Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance. InICML, 2025. 6, 18

work page 2025
[9]

Analog bits: Generating discrete data using diffusion models with self-conditioning

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. InICLR, 2023. 5, 18

work page 2023
[10]

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. Langflow: Continuous diffusion rivals discrete in language modeling.arXiv preprint arXiv:2604.11748, 2026. 2, 3, 6, 8, 15, 25

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Beyond autoregression: Fast LLMs via self-distillation through time

Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast LLMs via self-distillation through time. InICLR, 2025. 8

work page 2025
[12]

The diffusion duality, chapter ii:ψ-samplers and efficient curriculum

Justin Deschenaux, Caglar Gulcehre, and Subham Sekhar Sahoo. The diffusion duality, chapter ii:ψ-samplers and efficient curriculum. InICLR, 2026. 3

work page 2026
[13]

H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, and Jonas Adler. Continuous diffusion for categor- ical data.arXiv preprint arXiv:2211.15089, 2022. 1, 2, 5, 9, 15, 27

work page arXiv 2022
[14]

Scaling rectified flow Transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow Transformers for high-resolution image synthesis. InICML, 2024. 2, 6

work page 2024
[15]

Empowering diffusion models on the embedding space for text generation

Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, and Linli Xu. Empowering diffusion models on the embedding space for text generation. InNAACL, 2024. 2, 15

work page 2024
[16]

Mean flows for one-step generative modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InNeurIPS, 2025. 6, 18 10

work page 2025
[17]

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025. 6, 18

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Openwebtext corpus, 2019

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus, 2019. 6, 25

work page 2019
[19]

Diffuseq: Sequence to sequence text generation with diffusion models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. InICLR, 2023. 1, 2, 15

work page 2023
[20]

Diffucoder: Understanding and improving masked diffusion models for code generation

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation. InICLR, 2026. 3

work page 2026
[21]

Likelihood-based diffusion language models

Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models. In NeurIPS, 2023. 2, 15

work page 2023
[22]

SSD-LM: Semi-autoregressive simplex- based diffusion language model for text generation and modular control

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. SSD-LM: Semi-autoregressive simplex- based diffusion language model for text generation and modular control. InACL, 2023. 2, 15

work page 2023
[23]

Diffusionbert: Improving generative masked language models with diffusion models

Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuan-Jing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. InACL,

work page
[24]

Query-key normalization for Transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for Transformers. InFindings of EMNLP, 2020. 24

work page 2020
[25]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshops,

work page
[26]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 1, 2, 3, 15, 16

work page 2020
[27]

Continuous diffusion model for language modeling

Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling. In NeurIPS, 2025. 2, 15

work page 2025
[28]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Technical report, Keller Jordan blog, 2024. 6, 23

work page 2024
[29]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022. 23

work page 2022
[30]

Flow Map Language Models: One-step Language Modeling via Continuous Denoising

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising.arXiv preprint arXiv:2602.16813, 2026. 2, 3, 5, 6, 8, 15, 25

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, and Chaoyou Fu. Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026. 3

work page arXiv 2026
[32]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 2, 4, 6, 20, 21, 22

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

A survey on diffusion language models,

Tianyi Li, Mingda Chen, Bowei Guo, and Zhiqiang Shen. A survey on diffusion language models.arXiv preprint arXiv:2508.10875, 2025. 1, 3

work page arXiv 2025
[34]

Diffusion-LM improves controllable text generation

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-LM improves controllable text generation. InNeurIPS, 2022. 1, 2, 15

work page 2022
[35]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InACL Workshop on Text Summarization Branches Out, 2004. 6 11

work page 2004
[36]

Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise

Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. InICML, 2023. 2, 15

work page 2023
[37]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023. 1, 2, 3, 4, 15

work page 2023
[38]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 1, 2, 3, 4, 15

work page 2023
[39]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 23

work page 2019
[40]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InICML, 2024. 1

work page 2024
[41]

Latent diffusion for language generation

Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. InNeurIPS, 2023. 2, 3, 5, 15, 27

work page 2023
[42]

Diffusion guided language modeling

Justin Lovelace, Varsha Kishore, Yiwei Chen, and Kilian Q Weinberger. Diffusion guided language modeling. InFindings of ACL, 2024. 3, 15

work page 2024
[43]

SiT: Exploring flow and diffusion-based generative models with scalable interpolant Transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Exploring flow and diffusion-based generative models with scalable interpolant Transformers. InECCV, 2024. 2, 5, 19

work page 2024
[44]

Tess: Text-to-text self-conditioned simplex diffusion

Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. In EACL, 2024. 2, 5, 15

work page 2024
[45]

Cosmos: Compressed and smooth latent space for text diffusion modeling

Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Cosmos: Compressed and smooth latent space for text diffusion modeling. In NeurIPS, 2025. 2, 3, 15

work page 2025
[46]

Cohen, and Mirella Lapata

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InEMNLP,

work page
[47]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021. 2, 3, 16

work page 2021
[48]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InNeurIPS, 2025. 1, 3

work page 2025
[49]

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InACL, 2002. 6

work page 2002
[50]

Scalable diffusion models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with Transformers. InICCV, 2023. 18, 24

work page 2023
[51]

Discrete Flow Maps

Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S Albergo. Discrete flow maps.arXiv preprint arXiv:2604.09784, 2026. 3, 5, 15

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Language models are unsupervised multitask learners.OpenAI blog, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI blog, 2019. 6

work page 2019
[53]

Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020. 4, 6, 7, 25

work page 2020
[54]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 2

work page 2022
[55]

Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026

Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, ˙Ismail ˙Ilkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026. 3, 15 12

work page arXiv 2026
[56]

Simple and effective masked diffusion language models

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InNeurIPS, 2024. 1, 2, 3, 6, 8, 9, 25

work page 2024
[57]

The diffusion duality

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality. InICML, 2025. 1, 2, 3, 6, 8, 9, 25, 27

work page 2025
[58]

Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026. 1, 3

work page arXiv 2026
[59]

TEncDM: Understanding the properties of the diffusion model in the space of language model encodings

Alexander Shabalin, Viacheslav Meshchaninov, Egor Chimbulatov, Vladislav Lapikov, Roman Kim, Grigory Bartosh, Dmitry Molchanov, Sergey Markov, and Dmitry Vetrov. TEncDM: Understanding the properties of the diffusion model in the space of language model encodings. InAAAI, 2025. 3, 5, 15

work page 2025
[60]

Why gaussian diffusion models fail on discrete data?arXiv preprint arXiv:2604.02028,

Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, and Dmitry Vetrov. Why gaussian diffusion models fail on discrete data?arXiv preprint arXiv:2604.02028,

work page arXiv
[61]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve Transformer.arXiv preprint arXiv:2002.05202, 2020. 24

work page internal anchor Pith review Pith/arXiv arXiv 2002
[62]

Codar: Continuous diffusion language models are more powerful than you think.arXiv preprint arXiv:2603.02547, 2026

Junzhe Shen, Jieru Zhao, Ziwei He, and Zhouhan Lin. Codar: Continuous diffusion language models are more powerful than you think.arXiv preprint arXiv:2603.02547, 2026. 2, 3, 15

work page arXiv 2026
[63]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, 2015. 1, 2

work page 2015
[64]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR,

work page
[65]

Seed Diffusion:

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv prepri...

work page arXiv 2025
[66]

Self-conditioned embedding diffusion for text generation

Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, and Rémi Leblond. Self- conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022. 2, 5, 15

work page arXiv 2022
[67]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 24

work page 2024
[68]

Tess 2: A large-scale generalist diffusion language model

Jaesung Tae, Hamish Ivison, Sachin Kumar, and Arman Cohan. Tess 2: A large-scale generalist diffusion language model. InACL, 2025. 2, 15

work page 2025
[69]

Diffusion models without classifier- free guidance.arXiv preprint arXiv:2502.12154, 2025

Zhicong Tang, Jianmin Bao, Dong Chen, and Baining Guo. Diffusion models without classifier- free guidance.arXiv preprint arXiv:2502.12154, 2025. 6, 18

work page arXiv 2025
[70]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Remasking discrete diffusion models with inference-time scaling

Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. InNeurIPS, 2025. 3

work page 2025
[72]

InfoDiffusion: Information entropy aware diffusion process for non-autoregressive text generation

Renzhi Wang, Jing Li, and Piji Li. InfoDiffusion: Information entropy aware diffusion process for non-autoregressive text generation. InFindings of EMNLP, 2023. 2, 15 13

work page 2023
[73]

Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding. InICLR, 2026. 3

work page 2026
[74]

AR-Diffusion: Auto-regressive diffusion model for text generation

Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, Jian Jiao, Juntao Li, Jian Guo, Nan Duan, and Weizhu Chen. AR-Diffusion: Auto-regressive diffusion model for text generation. InNeurIPS, 2023. 2, 15

work page 2023
[75]

Mmada: Multimodal large diffusion language models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models. InNeurIPS, 2025. 3

work page 2025
[76]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

DINOISER: Diffused conditional sequence learning by manipulating noises.Transactions of the Association for Computational Linguistics, 2024

Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Mingxuan Wang. DINOISER: Diffused conditional sequence learning by manipulating noises.Transactions of the Association for Computational Linguistics, 2024. 2, 15

work page 2024
[78]

Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025. 3

work page arXiv 2025
[79]

Seqdiffuseq: Text diffusion with encoder-decoder transformers

Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Fei Huang, and Songfang Huang. Seqdiffuseq: Text diffusion with encoder-decoder transformers. InNAACL, 2024. 2, 5, 9, 15

work page 2024
[80]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeurIPS, 2019. 24

work page 2019

Showing first 80 references.