Flow Map Language Models: One-step Language Modeling via Continuous Denoising

Aditi Raghunathan; Chanhyuk Lee; Jaehoon Yoo; Jerry Huang; Jinwoo Kim; Manan Agarwal; Nicholas M. Boffi; Seunghoon Hong; Sheel Shah

arxiv: 2602.16813 · v3 · pith:YYYCE7JAnew · submitted 2026-02-18 · 💻 cs.CL · cs.AI

Flow Map Language Models: One-step Language Modeling via Continuous Denoising

Chanhyuk Lee , Jaehoon Yoo , Manan Agarwal , Sheel Shah , Jerry Huang , Aditi Raghunathan , Seunghoon Hong , Nicholas M. Boffi

show 1 more author

Jinwoo Kim

This is my paper

Pith reviewed 2026-05-21 12:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords continuous flowflow maplanguage modelingone-step generationdiscrete diffusionflow matchingdenoisingsimplex geometry

0 comments

The pith

Continuous flows over one-hot token embeddings enable one-step language generation that exceeds the quality of eight-step discrete diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models can use continuous flows over one-hot embeddings rather than discrete diffusion processes. This formulation creates a unique flow map that can be learned directly with cross-entropy objectives respecting simplex geometry. Distilling the flow model into a flow map model then supports single-step sampling. On the LM1B and OpenWebText datasets, the resulting one-step outputs surpass the quality of recent few-step discrete diffusion language models. The work directly challenges the assumption that discrete noising is required for generative modeling of discrete data.

Core claim

Language models built as continuous flows over one-hot token embeddings admit a unique flow map that can be learned directly and distilled. Both the flow and the distilled flow map are trained with simple cross-entropy losses that respect the probability simplex. The distilled flow map language model produces one-step generations whose quality exceeds the eight-step quality of recent discrete diffusion language models on LM1B and OWT.

What carries the argument

The flow map induced by the continuous flow over one-hot embeddings, which provides a deterministic one-step mapping from noise to data that discrete methods lack.

If this is right

Both the continuous flow and its flow map can be trained end-to-end using cross-entropy objectives that respect simplex geometry.
A flow language model matches the performance of state-of-the-art discrete diffusion baselines on LM1B and OWT.
The distilled flow map language model achieves higher quality in one step than recent discrete diffusion models achieve in eight steps.
The approach questions the necessity of discrete noising processes for generative modeling over discrete modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the unique flow map property generalizes, it could support aggressive step reduction in very large models without separate retraining.
The same continuous-flow-plus-distillation pattern may apply to other discrete sequence domains such as code or biological sequences.
Comparing the three distillation choices identified in the paper could reveal which choice best preserves quality at extreme speedups.

Load-bearing premise

The continuous flow over one-hot embeddings admits a unique flow map that can be learned directly and distilled without losing the quality advantages of the multi-step flow.

What would settle it

Training the flow language model on LM1B or OWT, distilling it into a flow map model, and finding that one-step sample quality does not exceed the eight-step quality of discrete diffusion baselines on the same benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.16813 by Aditi Raghunathan, Chanhyuk Lee, Jaehoon Yoo, Jerry Huang, Jinwoo Kim, Manan Agarwal, Nicholas M. Boffi, Seunghoon Hong, Sheel Shah.

**Figure 1.** Figure 1: Flow map language models. Our FMLM outperforms discrete diffusion models (gray) and matches the 8-step generation performance of distilled discrete diffusion models (light purple) in only one step (dark purple). Today’s frontier language models (LMs) are based on an autoregressive process that produces one token per step [1–3]. While these models leverage parallelism during training through teacher forcin… view at source ↗

**Figure 2.** Figure 2: Overview. (Left) We leverage a simple continuous interpolation between Gaussian noise and a one-hot encoding of language data. (Middle) Our FLM learns a denoiser that predicts the posterior over clean data, which we convert into a flow for sampling. (Right) Our distilled FMLM directly transports states between distant timepoints, enabling few-step generation. substantially reduced to compensate for the ass… view at source ↗

**Figure 3.** Figure 3: Factorization error in discrete diffusion. A toy dataset with two correlated modes new-york and san-diego. (Left) In many-step sampling, both continuous flows and discrete diffusion models generate valid data. (Right) With few-step sampling, the factorized transition of discrete diffusion yields a spurious mixture of all possible combinations (including the invalid pairings new-diego and san-york). 2 Backg… view at source ↗

**Figure 4.** Figure 4: Semigroup on the simplex. Xs,u(x) leaves the simplex, but δs,u(x) and δu,t(Xs,u) always lie on it. δs,t(x) is their convex combination, providing a training signal for distillation. Replacing Ds by the one-hot data x1 in the diagonal term recovers the cross-entropy loss (12) for the single-time denoiser, since KL from a one-hot distribution reduces to cross-entropy. This yields a direct training algorithm … view at source ↗

**Figure 9.** Figure 9: Decoding error rate. Our time reparameterization τ (t) redistributes time so each step contributes uniformly to the denoising signal; time samples shown in ticks. τ (t) = Pe(0) − Pe(t) Pe(0) = 1 − |V | |V | − 1 Pe(t). (25) By construction, this reparameterization redistributes time so that each step contributes equally to reducing the decoding error. We find this choice critical for stable training and g… view at source ↗

**Figure 11.** Figure 11: FLM generation quality. Generation performance of FLM on LM1B (left) and OWT (right) compared to diffusion baselines. FLM outperforms baselines at large step counts. Its performance degrades at low step counts, as it has not yet been distilled into an FMLM. Training. We train our flow-based language model following Section 4 for 1M steps with a batch size of 512 using the Adam optimizer [54] with a learni… view at source ↗

**Figure 12.** Figure 12: FMLM few-step generation. Few-step generation performance of FMLM on LM1B (left) and OWT (right) compared to distilled discrete diffusion. FMLM maintains strong generative perplexity across step counts and achieves state-of-the-art performance in the very few-step regime. Performance degrades slightly as the step count decreases and can be improved with further distillation [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 14.** Figure 14: Qualitative one-step generation. One-step samples from FMLM and distilled discrete diffusion baselines trained on LM1B. FMLM produces coherent, grammatical text, while discrete diffusion baselines generate incoherent token sequences (red, Gen. PPL > 1000) or repetitive tokens with collapsed entropy (red, Entropy < 4). number of sampling steps is varied from 8 to 1024, demonstrating that FLM is competitive… view at source ↗

**Figure 15.** Figure 15: Autoguidance stability. FLM maintains stable generation quality across guidance scales η up to 100, while discrete baselines fail at η ≥ 10. Shaded region shows Gen. PPL > 1000 or entropy < 3.9, indicating nonsensical or collapsed generation. Results shown on LM1B across 128–1024 sampling steps. distillation alone cannot overcome. In contrast, FMLM remains stable across all step counts. On LM1B, our one-s… view at source ↗

**Figure 18.** Figure 18: Valid one-step samples from FMLM. 2 6 9 1 4 6 3 7 8 7 1 3 2 9 8 5 4 6 4 5 8 6 7 3 1 2 9 6 8 2 3 1 9 4 5 7 5 7 1 4 6 2 9 8 3 9 3 4 8 5 7 6 1 2 1 2 5 6 8 9 7 3 4 8 9 7 5 3 4 2 6 1 3 4 6 7 2 1 8 9 5 1 7 9 2 5 6 8 3 4 5 8 4 9 7 3 1 6 2 6 3 4 2 8 1 5 9 7 4 6 3 1 2 5 7 8 9 9 2 7 3 6 8 4 5 1 8 1 5 7 9 4 6 2 3 2 5 8 4 1 9 3 7 6 3 9 1 8 7 7 2 4 5 7 4 6 5 3 2 9 1 8 [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 21.** Figure 21: Eulerian and Lagrangian objectives on the simplex. (Left) The Eulerian teacher ¯δs,t is constructed from Dˆ s(Is) and derivatives of ˆδs,t. (Right) The Lagrangian teacher is constructed from Dˆ t(Xˆ s,t(Is)), requiring an intermediate flow map evaluation off the simplex. In both cases, the teacher may transiently leave the simplex during training due to derivative correction terms, but the cross-entropy l… view at source ↗

**Figure 29.** Figure 29: Samples generated by FLM trained on LM1B with different sampling steps. [PITH_FULL_IMAGE:figures/full_fig_p047_29.png] view at source ↗

**Figure 30.** Figure 30: Samples generated by FLM trained on OWT with different sampling steps. [PITH_FULL_IMAGE:figures/full_fig_p048_30.png] view at source ↗

**Figure 31.** Figure 31: One-step samples generated by FMLM trained on LM1B. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_31.png] view at source ↗

**Figure 32.** Figure 32: One-step samples generated by FMLM trained on OWT [PITH_FULL_IMAGE:figures/full_fig_p050_32.png] view at source ↗

**Figure 33.** Figure 33: One-step samples generated by few-step masked discrete diffusion baselines trained on OWT. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_33.png] view at source ↗

**Figure 34.** Figure 34: One-step samples generated by few-step uniform discrete diffusion baselines trained on OWT. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_34.png] view at source ↗

**Figure 35.** Figure 35: Samples from FMLM trained on LM1B from fixed starting noise and varying the number of steps. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_35.png] view at source ↗

**Figure 36.** Figure 36: Samples from FMLM trained on OWT from fixed starting noise and varying the number of steps. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_36.png] view at source ↗

**Figure 37.** Figure 37: Samples generated by MDLM + SDTT [10] trained on LM1B from fixed initial random seed and varying the number of sampling steps. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_37.png] view at source ↗

**Figure 38.** Figure 38: Samples generated by Duo + DCD [19] trained on LM1B from fixed initial random seed and varying the number of sampling steps. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_38.png] view at source ↗

**Figure 39.** Figure 39: A sample from FMLM+FMTG (Section 5.3), rewarded by safety (TweetVal-Offensive [63], Label=Non-offensive). 57 [PITH_FULL_IMAGE:figures/full_fig_p057_39.png] view at source ↗

**Figure 40.** Figure 40: A sample from FMLM+FMTG (Section 5.3), rewarded by topic (AG News [60], Label=Sports). 58 [PITH_FULL_IMAGE:figures/full_fig_p058_40.png] view at source ↗

read the original abstract

Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. Despite their promise, these models typically produce samples whose quality sharply degrades in the few-step regime, preventing a dramatic speedup in practice. Here, we show that language models based on continuous flows over one-hot token embeddings can outperform discrete diffusion in both quality and speed. Importantly, our continuous formulation defines a unique flow map that can be learned directly for efficient few-step inference, a structure we show is unavailable to discrete methods. In this setting, we show that both the flow and its associated flow map can be learned with simple cross-entropy objectives that respect the simplex geometry of the data, and we identify three distinct choices for flow map distillation whose performance we compare in practice. Using these insights, we build a flow language model (FLM), a continuous flow that matches state-of-the-art discrete diffusion baselines on the One Billion Words (LM1B) and OpenWebText (OWT) datasets. We then distill FLM into a flow map language model (FMLM), whose one-step generation exceeds the 8-step quality of recent few-step discrete diffusion language models. Our work challenges the widely-held hypothesis that discrete noising processes are necessary for generative modeling over discrete modalities and paves the way toward accelerated language modeling at scale. Code is available at https://github.com/david3684/flm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Continuous flows over simplex embeddings can be distilled to a one-step map that beats 8-step discrete diffusion on LM1B and OWT.

read the letter

The main thing here is that the authors build a continuous flow language model over one-hot token embeddings that first matches recent discrete diffusion baselines, then distill it into a flow map model whose single-step samples exceed the quality of 8-step discrete diffusion on LM1B and OpenWebText. The new element is the explicit flow-map structure that comes from the continuous ODE on the simplex, together with three cross-entropy distillation objectives that stay inside the probability geometry. This is not just another discrete noising schedule; the continuous formulation lets them define and learn a direct map rather than iterate a few steps. They also release code, which is useful for checking the implementation details and the exact comparison setup. That part is done cleanly and gives a concrete alternative to the discrete-diffusion line of work. The soft spot is whether the distilled one-step map actually tracks the original continuous trajectories or ends up learning a different shortcut that happens to score well after discretization. The abstract claims uniqueness under the flow, but high-dimensional optimization with cross-entropy can produce averaging behavior instead of faithful ODE paths, and the stress-test note on Lipschitz conditions is reasonable. The reported gains look plausible from the abstract, but the soundness score stays moderate until the full ablation tables and metric breakdowns are checked. This paper is for researchers focused on few-step or non-autoregressive generation who want to explore continuous dynamics on discrete data. A reader already working on diffusion or flow-based language models would get the most out of the distillation comparisons and the simplex-aware losses. It deserves a serious referee because the core claim is testable, the formulation is distinct from prior discrete work, and the empirical direction matters for inference cost. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces continuous flow language models (FLM) defined over one-hot token embeddings, which admit a unique flow map that can be learned directly. It distills the multi-step FLM into a one-step flow map language model (FMLM) using cross-entropy objectives that respect simplex geometry, and reports that the resulting one-step FMLM exceeds the quality of recent 8-step discrete diffusion language models on the LM1B and OpenWebText datasets while matching state-of-the-art discrete baselines in the multi-step regime.

Significance. If the empirical claims hold, the work supplies a continuous alternative to discrete diffusion for discrete modalities, enabling substantially faster inference without the sharp quality drop typically observed in few-step discrete models. The explicit comparison of three distillation choices and the public code release are positive features that support reproducibility and further exploration.

major comments (2)

[Abstract] Abstract: The central claim that the continuous formulation 'defines a unique flow map that can be learned directly' and that distillation preserves (or exceeds) multi-step flow quality is load-bearing for the one-step advantage over discrete methods. The manuscript should state the regularity conditions (e.g., Lipschitz continuity of the velocity field) under which uniqueness is guaranteed and provide empirical diagnostics showing that the learned one-step map follows the underlying ODE trajectory rather than producing averaged or shortcut trajectories in the high-dimensional simplex.
[Results] Results section (comparison tables): The reported outperformance of one-step FMLM over 8-step discrete diffusion baselines must be accompanied by the precise metrics, data splits, and ablation tables that isolate the contribution of each distillation choice. Without these, it is difficult to attribute gains specifically to the flow-map structure rather than to differences in training regime or architecture.

minor comments (2)

[Methods] Notation: Define the velocity field and the flow map operator more explicitly early in the methods section to avoid ambiguity when moving between continuous dynamics and discrete token sampling.
[Introduction] References: Ensure that prior continuous-flow or ODE-based generative models for discrete data are cited to situate the novelty claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the continuous formulation 'defines a unique flow map that can be learned directly' and that distillation preserves (or exceeds) multi-step flow quality is load-bearing for the one-step advantage over discrete methods. The manuscript should state the regularity conditions (e.g., Lipschitz continuity of the velocity field) under which uniqueness is guaranteed and provide empirical diagnostics showing that the learned one-step map follows the underlying ODE trajectory rather than producing averaged or shortcut trajectories in the high-dimensional simplex.

Authors: We agree that the regularity conditions supporting uniqueness merit explicit statement. In the revised manuscript we will add a brief discussion in Section 3 noting that, under the standard assumption that the learned velocity field is Lipschitz continuous (which is satisfied by the neural-network parameterization with bounded weights), the Picard-Lindelöf theorem guarantees a unique flow map. We will also include new empirical diagnostics in the appendix that compare one-step predictions against multi-step ODE integration on held-out sequences, confirming trajectory alignment rather than averaging or shortcut behavior. revision: yes
Referee: [Results] Results section (comparison tables): The reported outperformance of one-step FMLM over 8-step discrete diffusion baselines must be accompanied by the precise metrics, data splits, and ablation tables that isolate the contribution of each distillation choice. Without these, it is difficult to attribute gains specifically to the flow-map structure rather than to differences in training regime or architecture.

Authors: We acknowledge the need for greater transparency in the results. The revised version will expand the main results table to report exact metric values (perplexity and bits-per-character) together with the precise train/validation/test splits used for LM1B and OpenWebText. We will also enlarge the ablation section with a dedicated table that isolates each of the three distillation objectives while controlling for architecture size and training compute, thereby clarifying the contribution of the flow-map structure itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent continuous dynamics and distillation

full rationale

The paper's central claims rest on defining a new continuous flow over one-hot embeddings, showing it admits a flow map learnable via cross-entropy, and empirically comparing distillation variants against discrete baselines on LM1B and OWT. No equation reduces a performance prediction to a fitted constant or prior result from the same authors; uniqueness is asserted as a property of the continuous ODE rather than imported via self-citation chain or ansatz smuggling. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the existence of a continuous flow on the probability simplex whose associated flow map can be distilled while preserving quality; no explicit free parameters or invented particles are named in the abstract.

axioms (1)

domain assumption A continuous flow over one-hot embeddings admits a unique flow map that can be learned directly for few-step inference.
Stated in the abstract as the key structural property unavailable to discrete methods.

pith-pipeline@v0.9.0 · 5820 in / 1277 out tokens · 36545 ms · 2026-05-21T12:19:08.947284+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

our continuous formulation defines a unique flow map that can be learned directly... two-time denoiser δs,t(x):=x+(1-s)vs,t(x) ... δs,t(x)l ∈ Δ^{|V|-1} ... semigroup condition
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

continuous flows over one-hot token embeddings... simplex geometry of the data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
Drifting Objectives for Refining Discrete Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.
Sampling from Flow Language Models via Marginal-Conditioned Bridges
cs.LG 2026-05 unverdicted novelty 7.0

Marginal-conditioned bridges enable training-free sampling from Flow Language Models by drawing clean one-hot endpoints from factorized posteriors and using Ornstein-Uhlenbeck bridges, preserving token marginals and r...
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
cs.LG 2026-05 unverdicted novelty 6.0

DiLaDiff augments masked diffusion LMs with latent space modeling and consistency distillation to improve token correlation capture and inference speed.
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language
cs.CL 2026-05 conditional novelty 6.0

RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
cs.CL 2026-05 unverdicted novelty 6.0

Joint training of the latent space with the diffusion process produces a competitive latent diffusion language model that is faster than existing discrete and continuous diffusion baselines.
Coupling Models for One-Step Discrete Generation
cs.LG 2026-05 unverdicted novelty 6.0

Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 9 Pith papers · 30 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. (page 1)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. (page 1)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. (page 1)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Mercury: Ultra-Fast Language Models Based on Diffusion

Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 1, 2025. (pages 1 and 2)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Gemini diffusion

Google DeepMind. Gemini diffusion. https://deepmind.google/models/gemini-diffusion/, 2025. Accessed: 2026-01-25. (page 1) 19

work page 2025
[6]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025. (page 1)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Beyond autoregression: Fast llms via self-distillation through time.arXiv preprint arXiv:2410.21035, 2024

Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time.arXiv preprint arXiv:2410.21035, 2024. (pages 1, 3, 14, 26, and 45)

work page arXiv 2024
[8]

Diffusion language models

Sander Dieleman. Diffusion language models. https://benanne.github.io/2023/01/09/diffusion -language.html, 2023. Accessed: 2026-01-25. (page 2)

work page 2023
[9]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024. (pages 2, 12, and 26)

work page arXiv 2024
[10]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025. (pages 2, 4, 44, 46, and 55)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Parallelbench: Understanding the trade-offs of parallel decoding in diffusion llms.arXiv preprint arXiv:2510.04767, 2025

Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjae Lee, Yuchen Zeng, Shuibai Zhang, Coleman Hooper, Yuezhou Hu, Hyung Il Koo, Nam Ik Cho, et al. Parallelbench: Understanding the trade-offs of parallel decoding in diffusion llms.arXiv preprint arXiv:2510.04767, 2025. (pages 2, 4, and 26)

work page arXiv 2025
[12]

Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343,

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343,

work page
[13]

(pages 2, 4, and 26)

work page
[14]

Continuous diffusion for categorical data

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022. (pages 2, 4, 5, 10, 16, 17, 26, and 40)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. (pages 2 and 4)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797, 2023. (pages 2, 4, 5, and 34)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. (page 2)

work page internal anchor Pith review Pith/arXiv arXiv 2011
[18]

How to build a consistency model: Learning flow maps via self-distillation.arXiv preprint arXiv:2505.18825, 2025

Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation.arXiv preprint arXiv:2505.18825, 2025. (pages 2, 6, 7, 26, 27, 28, 41, and 42)

work page arXiv 2025
[19]

Boffi, Michael S

Nicholas M. Boffi, Michael S. Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models.arXiv:2406.07507, 2025. (pages 2, 6, 7, 26, 27, 28, 31, and 41)

work page arXiv 2025
[20]

The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and Volodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025. (pages 2, 10, 11, 13, 14, 26, 43, 44, 45, 46, and 56)

work page arXiv 2025
[21]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models. arXiv preprint arXiv:2510.22510, 2025. (pages 2, 10, 13, 16, 17, 26, and 43)

work page arXiv 2025
[22]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. (page 2) 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Attractor dynamics and parallelism in a connectionist sequential machine

Michael I Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 8, 1986. (page 3)

work page 1986
[24]

Finding structure in time.Cognitive science, 14(2):179–211, 1990

Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990. (page 3)

work page 1990
[25]

A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003

Yoshua Bengio, R´ ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003. (page 3)

work page 2003
[26]

Non-Autoregressive Neural Machine Translation

Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural machine translation.arXiv preprint arXiv:1711.02281, 2017. (pages 3 and 26)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31, 2018

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31, 2018. (page 3)

work page 2018
[28]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021. (pages 3, 6, and 26)

work page 2021
[29]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023. (page 3)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024. (pages 3, 11, 13, 26, 43, and 44)

work page 2024
[31]

Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024. (pages 3, 16, and 26)

work page arXiv 2024
[32]

Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385,

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385,

work page
[33]

(pages 3, 6, and 26)

work page
[34]

A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022. (pages 4, 6, 8, and 26)

work page 2022
[35]

Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 2023

Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 2023. (pages 4 and 26)

work page 2023
[36]

Self- conditioned embedding diffusion for text generation,

Robin Strudel, Corentin Tallec, Florent Altch´ e, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022. (pages 4 and 26)

work page arXiv 2022
[37]

Latent diffusion for language generation.Advances in Neural Information Processing Systems, 36:56998–57025, 2023

Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation.Advances in Neural Information Processing Systems, 36:56998–57025, 2023. (pages 4 and 26)

work page 2023
[38]

Analog bits: Generating discrete data using diffusion models with self-conditioning

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202, 2022. (pages 4 and 26)

work page arXiv 2022
[39]

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575–11596,

work page
[40]

(pages 4, 16, and 26) 21

work page
[41]

Tess: Text-to-text self-conditioned simplex diffusion

Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2347–2361, 2024. (pages 4, 16, and 26)

work page 2024
[42]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. (pages 4 and 5)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. (page 4)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Variational flow matching for graph generation.Advances in Neural Information Processing Systems, 37:11735–11764, 2024

Floor Eijkelboom, Grigory Bartosh, Christian Andersson Naesseth, Max Welling, and Jan-Willem van de Meent. Variational flow matching for graph generation.Advances in Neural Information Processing Systems, 37:11735–11764, 2024. (page 5)

work page 2024
[45]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025. (pages 6, 7, 17, 18, 26, and 27)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025. (pages 6, 7, 26, and 28)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Terminal velocity matching.arXiv preprint arXiv:2511.19797, 2025

Linqi Zhou, Mathias Parger, Ayaan Haque, and Jiaming Song. Terminal velocity matching.arXiv preprint arXiv:2511.19797, 2025. (pages 7, 26, 27, and 28)

work page arXiv 2025
[48]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. (pages 7, 26, and 27)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. arXiv preprint arXiv:2410.12557, 2024. (pages 7, 10, 26, 27, 28, and 43)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Entropic time schedulers for generative diffusion models.arXiv preprint arXiv:2504.13612, 2025

Dejan Stancevic, Florian Handke, and Luca Ambrogioni. Entropic time schedulers for generative diffusion models.arXiv preprint arXiv:2504.13612, 2025. (pages 10 and 40)

work page arXiv 2025
[51]

Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024

Tero Karras, Miika Aittala, Tuomas Kynk¨ a¨ anniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024. (pages 11 and 16)

work page 2024
[52]

Jerry Huang, Justin Lin, Sheel Shah, Kartik Nair, and Nicholas M. Boffi. How to guide your flow: Steering flow maps for rapid test-time alignment, 2025. Forthcoming. (pages 11, 16, and 17)

work page 2025
[53]

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling.arXiv preprint arXiv:1312.3005, 2013. (page 11)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[54]

Openwebtext corpus.http://Skylion007.github.io/OpenWebTe xtCorpus, 2019

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/OpenWebTe xtCorpus, 2019. (page 11)

work page 2019
[55]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. (page 11)

work page 2023
[56]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. (page 11)

work page 2024
[57]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. (pages 12 and 43) 22

work page internal anchor Pith review Pith/arXiv arXiv 2014
[58]

Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, and Max Simchowitz. Joint distillation for fast likelihood evaluation and sampling in flow-based models. arXiv preprint arXiv:2512.02636, 2025. (page 12)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019. (pages 12 and 45)

work page 2019
[60]

Continuous Diffusion Model for Language Modeling

Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling.arXiv preprint arXiv:2502.11564, 2025. (pages 13, 16, 17, 26, and 43)

work page arXiv 2025
[61]

Distillation of discrete diffusion through dimensional correlations.arXiv preprint arXiv:2410.08709, 2024

Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Distillation of discrete diffusion through dimensional correlations.arXiv preprint arXiv:2410.08709, 2024. (pages 14, 26, and 44)

work page arXiv 2024
[62]

Texygen: A benchmarking platform for text generation models

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. InThe 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100, 2018. (pages 15 and 44)

work page 2018
[63]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015. (pages 16, 46, and 58)

work page 2015
[64]

Neural network acceptability judgments

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019. (page 16)

work page 2019
[65]

Learning word vectors for sentiment analysis

Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011. (page 16)

work page 2011
[66]

Ellie Pavlick and Tom Kwiatkowski

Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650, Online, November 2020. Association for Computational Lin...

work page doi:10.18653/v1/2020 2020
[67]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. (page 16)

work page 2019
[68]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. (page 16)

work page internal anchor Pith review Pith/arXiv arXiv 1907
[69]

Tess 2: A large-scale generalist diffusion language model.arXiv preprint arXiv:2502.13917, 2025

Jaesung Tae, Hamish Ivison, Sachin Kumar, and Arman Cohan. Tess 2: A large-scale generalist diffusion language model.arXiv preprint arXiv:2502.13917, 2025. (pages 16, 17, and 26)

work page arXiv 2025
[70]

Can continuous-time diffusion models generate and solve globally constrained discrete problems? a study on sudoku.arXiv preprint arXiv:2601.20363, 2026

Mariia Drozdova. Can continuous-time diffusion models generate and solve globally constrained discrete problems? a study on sudoku.arXiv preprint arXiv:2601.20363, 2026. (page 18)

work page arXiv 2026
[71]

Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. Do large language models latently perform multi-hop reasoning? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10210–10229, 2024. (page 18)

work page 2024
[72]

Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025

Jaehoon Yoo, Wonjung Kim, and Seunghoon Hong. Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025. (page 26) 23

work page arXiv 2025
[73]

Continuously augmented discrete diffusion model for categorical generative modeling, 2025

Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling.arXiv preprint arXiv:2510.01329, 2025. (page 26)

work page arXiv 2025
[74]

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022. (page 26)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[75]

Categorical flow matching on statistical manifolds

Chaoran Cheng, Jiahan Li, Jian Peng, and Ge Liu. Categorical flow matching on statistical manifolds. Advances in Neural Information Processing Systems, 37:54787–54819, 2024. (page 26)

work page 2024
[76]

Fisher flow matching for generative modeling over discrete data.Advances in Neural Information Processing Systems, 37:139054–139084, 2024

Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙I Ceylan, Michael Bronstein, and Avishek J Bose. Fisher flow matching for generative modeling over discrete data.Advances in Neural Information Processing Systems, 37:139054–139084, 2024. (page 26)

work page 2024
[77]

Simplex-to-euclidean bijections for categorical flow matching.arXiv preprint arXiv:2510.27480, 2025

Bernardo Williams, Victor M Yeom-Song, Marcelo Hartmann, and Arto Klami. Simplex-to-euclidean bijections for categorical flow matching.arXiv preprint arXiv:2510.27480, 2025. (page 26)

work page arXiv 2025
[78]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023. (pages 26 and 27)

work page 2023
[79]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. (page 26)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[80]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527, 2025. (page 26)

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. (page 1)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. (page 1)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. (page 1)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Mercury: Ultra-Fast Language Models Based on Diffusion

Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 1, 2025. (pages 1 and 2)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Gemini diffusion

Google DeepMind. Gemini diffusion. https://deepmind.google/models/gemini-diffusion/, 2025. Accessed: 2026-01-25. (page 1) 19

work page 2025

[6] [6]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025. (page 1)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Beyond autoregression: Fast llms via self-distillation through time.arXiv preprint arXiv:2410.21035, 2024

Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time.arXiv preprint arXiv:2410.21035, 2024. (pages 1, 3, 14, 26, and 45)

work page arXiv 2024

[8] [8]

Diffusion language models

Sander Dieleman. Diffusion language models. https://benanne.github.io/2023/01/09/diffusion -language.html, 2023. Accessed: 2026-01-25. (page 2)

work page 2023

[9] [9]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908, 2024. (pages 2, 12, and 26)

work page arXiv 2024

[10] [10]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025. (pages 2, 4, 44, 46, and 55)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Parallelbench: Understanding the trade-offs of parallel decoding in diffusion llms.arXiv preprint arXiv:2510.04767, 2025

Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjae Lee, Yuchen Zeng, Shuibai Zhang, Coleman Hooper, Yuezhou Hu, Hyung Il Koo, Nam Ik Cho, et al. Parallelbench: Understanding the trade-offs of parallel decoding in diffusion llms.arXiv preprint arXiv:2510.04767, 2025. (pages 2, 4, and 26)

work page arXiv 2025

[12] [12]

Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343,

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343,

work page

[13] [13]

(pages 2, 4, and 26)

work page

[14] [14]

Continuous diffusion for categorical data

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022. (pages 2, 4, 5, 10, 16, 17, 26, and 40)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. (pages 2 and 4)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797, 2023. (pages 2, 4, 5, and 34)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. (page 2)

work page internal anchor Pith review Pith/arXiv arXiv 2011

[18] [18]

How to build a consistency model: Learning flow maps via self-distillation.arXiv preprint arXiv:2505.18825, 2025

Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation.arXiv preprint arXiv:2505.18825, 2025. (pages 2, 6, 7, 26, 27, 28, 41, and 42)

work page arXiv 2025

[19] [19]

Boffi, Michael S

Nicholas M. Boffi, Michael S. Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models.arXiv:2406.07507, 2025. (pages 2, 6, 7, 26, 27, 28, 31, and 41)

work page arXiv 2025

[20] [20]

The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and Volodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025. (pages 2, 10, 11, 13, 14, 26, 43, 44, 45, 46, and 56)

work page arXiv 2025

[21] [21]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models. arXiv preprint arXiv:2510.22510, 2025. (pages 2, 10, 13, 16, 17, 26, and 43)

work page arXiv 2025

[22] [22]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. (page 2) 20

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Attractor dynamics and parallelism in a connectionist sequential machine

Michael I Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 8, 1986. (page 3)

work page 1986

[24] [24]

Finding structure in time.Cognitive science, 14(2):179–211, 1990

Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990. (page 3)

work page 1990

[25] [25]

A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003

Yoshua Bengio, R´ ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model.Journal of machine learning research, 3(Feb):1137–1155, 2003. (page 3)

work page 2003

[26] [26]

Non-Autoregressive Neural Machine Translation

Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural machine translation.arXiv preprint arXiv:1711.02281, 2017. (pages 3 and 26)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31, 2018

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models.Advances in Neural Information Processing Systems, 31, 2018. (page 3)

work page 2018

[28] [28]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021. (pages 3, 6, and 26)

work page 2021

[29] [29]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023. (page 3)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024. (pages 3, 11, 13, 26, 43, and 44)

work page 2024

[31] [31]

Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024. (pages 3, 16, and 26)

work page arXiv 2024

[32] [32]

Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385,

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385,

work page

[33] [33]

(pages 3, 6, and 26)

work page

[34] [34]

A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022. (pages 4, 6, 8, and 26)

work page 2022

[35] [35]

Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 2023

Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 2023. (pages 4 and 26)

work page 2023

[36] [36]

Self- conditioned embedding diffusion for text generation,

Robin Strudel, Corentin Tallec, Florent Altch´ e, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022. (pages 4 and 26)

work page arXiv 2022

[37] [37]

Latent diffusion for language generation.Advances in Neural Information Processing Systems, 36:56998–57025, 2023

Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation.Advances in Neural Information Processing Systems, 36:56998–57025, 2023. (pages 4 and 26)

work page 2023

[38] [38]

Analog bits: Generating discrete data using diffusion models with self-conditioning

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202, 2022. (pages 4 and 26)

work page arXiv 2022

[39] [39]

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575–11596,

work page

[40] [40]

(pages 4, 16, and 26) 21

work page

[41] [41]

Tess: Text-to-text self-conditioned simplex diffusion

Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2347–2361, 2024. (pages 4, 16, and 26)

work page 2024

[42] [42]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. (pages 4 and 5)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. (page 4)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Variational flow matching for graph generation.Advances in Neural Information Processing Systems, 37:11735–11764, 2024

Floor Eijkelboom, Grigory Bartosh, Christian Andersson Naesseth, Max Welling, and Jan-Willem van de Meent. Variational flow matching for graph generation.Advances in Neural Information Processing Systems, 37:11735–11764, 2024. (page 5)

work page 2024

[45] [45]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025. (pages 6, 7, 17, 18, 26, and 27)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025. (pages 6, 7, 26, and 28)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Terminal velocity matching.arXiv preprint arXiv:2511.19797, 2025

Linqi Zhou, Mathias Parger, Ayaan Haque, and Jiaming Song. Terminal velocity matching.arXiv preprint arXiv:2511.19797, 2025. (pages 7, 26, 27, and 28)

work page arXiv 2025

[48] [48]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. (pages 7, 26, and 27)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. arXiv preprint arXiv:2410.12557, 2024. (pages 7, 10, 26, 27, 28, and 43)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Entropic time schedulers for generative diffusion models.arXiv preprint arXiv:2504.13612, 2025

Dejan Stancevic, Florian Handke, and Luca Ambrogioni. Entropic time schedulers for generative diffusion models.arXiv preprint arXiv:2504.13612, 2025. (pages 10 and 40)

work page arXiv 2025

[51] [51]

Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024

Tero Karras, Miika Aittala, Tuomas Kynk¨ a¨ anniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024. (pages 11 and 16)

work page 2024

[52] [52]

Jerry Huang, Justin Lin, Sheel Shah, Kartik Nair, and Nicholas M. Boffi. How to guide your flow: Steering flow maps for rapid test-time alignment, 2025. Forthcoming. (pages 11, 16, and 17)

work page 2025

[53] [53]

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling.arXiv preprint arXiv:1312.3005, 2013. (page 11)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[54] [54]

Openwebtext corpus.http://Skylion007.github.io/OpenWebTe xtCorpus, 2019

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/OpenWebTe xtCorpus, 2019. (page 11)

work page 2019

[55] [55]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. (page 11)

work page 2023

[56] [56]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. (page 11)

work page 2024

[57] [57]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. (pages 12 and 43) 22

work page internal anchor Pith review Pith/arXiv arXiv 2014

[58] [58]

Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, and Max Simchowitz. Joint distillation for fast likelihood evaluation and sampling in flow-based models. arXiv preprint arXiv:2512.02636, 2025. (page 12)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019. (pages 12 and 45)

work page 2019

[60] [60]

Continuous Diffusion Model for Language Modeling

Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling.arXiv preprint arXiv:2502.11564, 2025. (pages 13, 16, 17, 26, and 43)

work page arXiv 2025

[61] [61]

Distillation of discrete diffusion through dimensional correlations.arXiv preprint arXiv:2410.08709, 2024

Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Distillation of discrete diffusion through dimensional correlations.arXiv preprint arXiv:2410.08709, 2024. (pages 14, 26, and 44)

work page arXiv 2024

[62] [62]

Texygen: A benchmarking platform for text generation models

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. InThe 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100, 2018. (pages 15 and 44)

work page 2018

[63] [63]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015. (pages 16, 46, and 58)

work page 2015

[64] [64]

Neural network acceptability judgments

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019. (page 16)

work page 2019

[65] [65]

Learning word vectors for sentiment analysis

Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011. (page 16)

work page 2011

[66] [66]

Ellie Pavlick and Tom Kwiatkowski

Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650, Online, November 2020. Association for Computational Lin...

work page doi:10.18653/v1/2020 2020

[67] [67]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. (page 16)

work page 2019

[68] [68]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. (page 16)

work page internal anchor Pith review Pith/arXiv arXiv 1907

[69] [69]

Tess 2: A large-scale generalist diffusion language model.arXiv preprint arXiv:2502.13917, 2025

Jaesung Tae, Hamish Ivison, Sachin Kumar, and Arman Cohan. Tess 2: A large-scale generalist diffusion language model.arXiv preprint arXiv:2502.13917, 2025. (pages 16, 17, and 26)

work page arXiv 2025

[70] [70]

Can continuous-time diffusion models generate and solve globally constrained discrete problems? a study on sudoku.arXiv preprint arXiv:2601.20363, 2026

Mariia Drozdova. Can continuous-time diffusion models generate and solve globally constrained discrete problems? a study on sudoku.arXiv preprint arXiv:2601.20363, 2026. (page 18)

work page arXiv 2026

[71] [71]

Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. Do large language models latently perform multi-hop reasoning? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10210–10229, 2024. (page 18)

work page 2024

[72] [72]

Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025

Jaehoon Yoo, Wonjung Kim, and Seunghoon Hong. Redi: Rectified discrete flow.arXiv preprint arXiv:2507.15897, 2025. (page 26) 23

work page arXiv 2025

[73] [73]

Continuously augmented discrete diffusion model for categorical generative modeling, 2025

Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling.arXiv preprint arXiv:2510.01329, 2025. (page 26)

work page arXiv 2025

[74] [74]

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022. (page 26)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[75] [75]

Categorical flow matching on statistical manifolds

Chaoran Cheng, Jiahan Li, Jian Peng, and Ge Liu. Categorical flow matching on statistical manifolds. Advances in Neural Information Processing Systems, 37:54787–54819, 2024. (page 26)

work page 2024

[76] [76]

Fisher flow matching for generative modeling over discrete data.Advances in Neural Information Processing Systems, 37:139054–139084, 2024

Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙I Ceylan, Michael Bronstein, and Avishek J Bose. Fisher flow matching for generative modeling over discrete data.Advances in Neural Information Processing Systems, 37:139054–139084, 2024. (page 26)

work page 2024

[77] [77]

Simplex-to-euclidean bijections for categorical flow matching.arXiv preprint arXiv:2510.27480, 2025

Bernardo Williams, Victor M Yeom-Song, Marcelo Hartmann, and Arto Klami. Simplex-to-euclidean bijections for categorical flow matching.arXiv preprint arXiv:2510.27480, 2025. (page 26)

work page arXiv 2025

[78] [78]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023. (pages 26 and 27)

work page 2023

[79] [79]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. (page 26)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[80] [80]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527, 2025. (page 26)

work page internal anchor Pith review Pith/arXiv arXiv 2025