Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

Arash Vahdat; John Thickstun; Morteza Mardani; Shuibai Zhang; Subham Sekhar Sahoo; Wei Guo; Yongxin Chen; Zhihan Yang

arxiv: 2605.18530 · v1 · pith:IC42QEPGnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

Zhihan Yang , Wei Guo , Shuibai Zhang , Subham Sekhar Sahoo , Yongxin Chen , Arash Vahdat , Morteza Mardani , John Thickstun This is my paper

Pith reviewed 2026-05-20 11:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML

keywords continuous diffusiondiffusion language modelslanguage modelingscaling lawslikelihood trainingperplexityOpenWebTextnoise schedule

0 comments

The pith

RePlaid shows continuous diffusion language models scale competitively with discrete ones, closing the gap to a 20x compute difference from autoregressive models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper challenges the notion that continuous diffusion is less scalable for language modeling than discrete methods by updating the Plaid model into RePlaid. The key step is aligning its architecture with current discrete diffusion language models while keeping the continuous diffusion process. In this setup, RePlaid demonstrates strong scaling behavior, a small compute overhead relative to autoregressive baselines, and better results than several other continuous and discrete models in specific regimes. Readers might care because it opens the possibility that continuous-valued diffusion could become a practical route for large language models, potentially with advantages in sampling or representation. The authors also provide theory showing that likelihood training leads to a linear loss of information over time, which evenly spreads the denoising task, and that it organizes embeddings into useful structures.

Core claim

RePlaid exhibits a compute gap of only 20× compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime while achieving a new state-of-the-art PPL bound of 22.1 among continuous DLMs on OpenWebText. This is enabled by aligning Plaid's architecture with modern discrete DLMs and using likelihood-based training, which optimizes the noise schedule to yield linear cross-entropy over time and creates structured geometries in embeddings.

What carries the argument

Architecture alignment of continuous diffusion language models with discrete counterparts combined with likelihood optimization that minimizes ELBO variance for linear information loss.

If this is right

Continuous DLMs become viable at scale with limited extra compute cost.
Performance advantages appear in over-trained settings compared to some discrete models.
New state-of-the-art perplexity achieved for continuous diffusion on standard benchmarks.
Likelihood training distributes denoising difficulty evenly without custom time adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If continuous diffusion scales well, models may gain flexibility in generating text by operating in continuous space rather than discrete tokens.
The linear cross-entropy from optimized schedules could simplify training procedures in other diffusion models.
Structured embeddings might lead to improved performance in tasks requiring semantic understanding.

Load-bearing premise

Aligning the architecture of Plaid with modern discrete DLMs fairly isolates the continuous versus discrete difference without confounding effects from training or tuning differences.

What would settle it

Reproducing the OpenWebText experiments and finding RePlaid's perplexity bound significantly above 22.1 or the compute gap exceeding 20x with matched hyperparameter tuning.

Figures

Figures reproduced from arXiv: 2605.18530 by Arash Vahdat, John Thickstun, Morteza Mardani, Shuibai Zhang, Subham Sekhar Sahoo, Wei Guo, Yongxin Chen, Zhihan Yang.

**Figure 1.** Figure 1: (a-b) IsoFLOP curves identify optimal model sizes (black crosses) across fixed compute budgets. The optimal REPLAID (s.c.) loss exhibits power-law scaling, decreasing at a rate comparable to MDLM. To match AR loss, MDLM and REPLAID (s.c.) require 14× and 20× the compute, respectively. In the over-trained regime below the green line, REPLAID (s.c.) consistently outperforms MDLM (Sec. 3.4). (c) The effect of… view at source ↗

**Figure 2.** Figure 2: Scaling laws. (a) The compute-optimal REPLAID loss exhibits power-law scaling, decreasing at a rate comparable to AR, MDLM, and Duo. MDLM requires 14× FLOPs to match AR; Duo needs 22×; REPLAID consumes 20× with self-conditioning (s.c.) and 27× without it. (b) The compute-optimal REPLAID (s.c.) uses 1.8× fewer parameters than MDLM and Duo – while outperforming Duo’s loss in (a) – and uses 3.4× fewer than AR… view at source ↗

**Figure 3.** Figure 3: (a-b) GenPPL and MAUVE on OWT samples (L = 1024, N = 5120). No temperature is used. All models are trained for 1M steps on OWT. (c) GenPPL-entropy frontier as T increases (darker color: higher T). We observe that Duo and FLM have worse entropy than RePlaid (no s.c.) at comparable GenPPL levels [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: MAUVE on OWT samples (L = 1024, N = 5120) versus sampling steps T, comparing the ancestral DDPM sampler, DDIM, DPM-Solver++(2M), Heun on REPLAID, with LangFlow and Duo plotted as baselines. steps (as with LangFlow) but further improves upon RePlaid (no s.c.) for T ≥ 64, outperforming discrete DLMs and other continuous DLMs with self-conditioning at high sampling steps [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 5.** Figure 5: Visualizing geometry of learned embeddings of REPLAID (s.c.) (OWT PPL: 22.1 at 1M). (a) 2D t-SNE plot with each subword colored by its most frequent POS tag. (b) PCA scree plot of E. (c) PCA scree plot of E when an auxiliary CE loss is added (Sec. 5.1), dispersing the embeddings (OWT PPL: 26.1 at 1M). which disrupts the low-rank embedding structure and hurts the PPL; (ii) Making the noise schedule a learna… view at source ↗

**Figure 6.** Figure 6: Visualizing per-timestep diffusion loss, CE loss, and decoding error for REPLAID (s.c.) (OWT PPL: 24.9), an ablation that learns noise schedule shape but freezes embeddings (OWT PPL: 45.1), and an ablation that learns embeddings but freezes the noise schedule (OWT PPL: 28.0). Models are 250K and the two losses are length-normalized. Empirically, whenever the noise schedule is learned, the per-timestep diff… view at source ↗

**Figure 7.** Figure 7: IWAE K-curve of (23) on a fixed 1024-sequence OWT-valid subset (solid lines), with matched VDM NELBOs of (4) drawn as dashed references. Dotted curves are the leading-order a + b/K fits on K ≥ 4 with extrapolated asymptotes PPL∞ in the legend; the visible deviation at K ≤ 2 reflects O(1/K2 ) corrections to the Θ(1/K) IWAE bias of Nowozin [40] [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: The noise schedule endpoints (γ0, γ1) and the interior shape of γ(t) are separately parameterized; they are trained to minimize the diffusion loss and its estimator variance respectively. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

**Figure 9.** Figure 9: IsoFLOP curves plot optimal model sizes under fixed compute budgets. The optimal REPLAID loss exhibits power-law scaling, decreasing at a rate comparable to AR and MDLM. MDLM (low var.), Duo, REPLAID (s.c.), and REPLAID (no s.c.) exhibits 14×, 22×, 20×, and 27× worse scaling than AR respectively. In the over-trained region below the green line, REPLAID (s.c.) beats MDLM (low var.). 38 [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 10.** Figure 10: Exchanging the embedding configurations of MDLM and RePlaid (s.c.) degrades both methods. For comparison, we unify the vertical range of all methods and do not show points outside of this range (e.g., (d)). 39 [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗

**Figure 11.** Figure 11: Quality-diversity trade-off of REPLAID against discrete and continuous DLM baselines for unconditional generation on OWT. Markers denote τ = 1. We use DFM numbers reported in Potaptchik et al. [44] (T = 512 not available). REPLAID is competitive with discrete DLMs and surpasses prior continuous DLMs. 0 20 40 60 80 100 120 Generative PPL Dataset better T = 8 Dataset better T = 16 Dataset better T = 32 Dat… view at source ↗

**Figure 12.** Figure 12: Quality-diversity trade-off of REPLAID against LangFlow for unconditional generation on OWT. Markers denote τ = 1. High τ ’s lead to degenerate samples for both LangFlow and REPLAID with selfconditioning on; parts of the curves corresponding to this degenerate behavior is omitted. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗

**Figure 13.** Figure 13: Comparing PCA scree plots for REPLAID (s.c.) and LangFlow, both using self-conditioning, de = 768, and length-normalized embeddings. REPLAID (s.c.) yields a lower-rank embedding geometry while achieving a better PPL bound. 50 100 150 200 250 Principal component 0.000 0.002 0.004 0.006 0.008 Explained variance ratio 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative explained variance [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 14.** Figure 14: PCA scree plot for MDLM (low var.) (de = 768 by default) trained on OWT for 1M steps. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_14.png] view at source ↗

**Figure 15.** Figure 15: Learning the noise schedule leads to a near-linear per-timestep cross-entropy loss regardless of the embedding geometry. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗

read the original abstract

While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Continuous diffusion closes much of the scaling gap to discrete methods after architecture alignment, but the isolation of that effect still needs tighter checks.

read the letter

The main thing to know is that this paper demonstrates continuous diffusion language models can scale competitively with discrete ones after aligning the architecture to modern discrete DLMs, achieving a new state-of-the-art perplexity of 22.1 on OpenWebText and a compute gap of only 20 times compared to autoregressive models. It also outperforms some baselines like Duo and MDLM in certain regimes. What the paper does well is present the first explicit scaling law for continuous DLMs along with empirical curves that support the competitiveness claim. The theoretical insight on optimizing the noise schedule to minimize ELBO variance leading to linear cross-entropy over time is a solid contribution, as is the observation that likelihood-based embedding optimization creates structured geometries. These elements provide both practical results and some explanatory power for why likelihood training helps. The softer part is the isolation of the continuous versus discrete effect. While architecture alignment is a reasonable approach, the lack of detailed information on exact training budgets, data splits, and hyperparameter matching raises the possibility that some gains come from tuning differences or optimization trajectories rather than the diffusion formulation itself. The stress-test concern about residual confounding holds some weight here, and it would be good to see more ablations confirming that the advantages are not due to those factors. This work is aimed at researchers exploring alternatives to autoregressive language modeling, particularly those interested in diffusion models for text generation and their scaling properties. Readers focused on non-autoregressive methods or controllability in generation would find value in the benchmarks and scaling study. Overall, the paper shows clear thinking with empirical support and deserves a serious referee to evaluate the claims more thoroughly, especially around the comparison fairness. I recommend sending this to peer review.

Referee Report

2 major / 3 minor

Summary. The paper revisits the Plaid continuous diffusion language model and constructs RePlaid by aligning its architecture with modern discrete DLMs. It reports the first scaling law for continuous DLMs, claiming a compute gap of only 20× relative to autoregressive models, outperformance of Duo (with fewer parameters) and MDLM (in the over-trained regime), and a new state-of-the-art PPL bound of 22.1 among continuous DLMs on OpenWebText, along with superior generation quality. The authors also provide theoretical insights showing that likelihood-based noise schedule optimization minimizes ELBO variance to produce linear cross-entropy over time, and that likelihood-trained embeddings induce structured geometries that drive likelihood gains.

Significance. If the empirical scaling results and isolation of the continuous formulation hold after addressing comparison details, the work would be significant for demonstrating that continuous diffusion can scale competitively with discrete approaches for language modeling. This challenges prevailing views on scalability, provides the first explicit scaling law for continuous DLMs, and offers theoretical grounding for likelihood training advantages, potentially broadening research into continuous diffusion models as practical alternatives.

major comments (2)

[§4] §4 (Scaling Experiments): The central claim of a 20× compute gap and fair isolation of continuous vs. discrete effects via architectural alignment requires explicit reporting of matched training budgets, data splits, hyperparameter grids, and optimization trajectories for RePlaid versus Duo, MDLM, and autoregressive baselines; without these, residual differences in noise schedule optimization or embedding geometry could confound attribution to the continuous formulation.
[Table 1] Table 1 / Figure 3 (PPL and scaling curves): The reported SOTA PPL bound of 22.1 and outperformance in the over-trained regime should include per-model training FLOPs or step counts alongside the curves to substantiate competitiveness; current presentation leaves open whether gains stem from the continuous likelihood objective or unstated tuning advantages.

minor comments (3)

[Abstract] The abstract and §3 would benefit from a brief explicit statement of the exact likelihood objective used for RePlaid to distinguish it from prior continuous DLMs.
Figure legends for scaling plots should clarify axis scaling (e.g., compute in FLOPs vs. tokens) and include error bars or multiple seeds for the reported curves.
[§2] Add a short reference or citation to the original Plaid work when describing the base architecture in §2 to aid readers unfamiliar with the lineage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our scaling results and experimental controls. We address each major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (Scaling Experiments): The central claim of a 20× compute gap and fair isolation of continuous vs. discrete effects via architectural alignment requires explicit reporting of matched training budgets, data splits, hyperparameter grids, and optimization trajectories for RePlaid versus Duo, MDLM, and autoregressive baselines; without these, residual differences in noise schedule optimization or embedding geometry could confound attribution to the continuous formulation.

Authors: We agree that greater transparency on experimental controls is valuable. In the revision we will add an expanded experimental setup section and a supplementary table that lists training budgets (FLOPs and steps), data splits, hyperparameter grids, and optimizer settings for RePlaid alongside the corresponding values reported for Duo, MDLM, and the autoregressive baselines. Our architectural alignment fixes the transformer backbone, context length, and embedding dimension across models; noise schedules for all diffusion models were optimized under the same likelihood objective. We will explicitly note any unavoidable differences arising from model-specific implementations while arguing that these do not undermine the isolation of the continuous formulation. revision: yes
Referee: [Table 1] Table 1 / Figure 3 (PPL and scaling curves): The reported SOTA PPL bound of 22.1 and outperformance in the over-trained regime should include per-model training FLOPs or step counts alongside the curves to substantiate competitiveness; current presentation leaves open whether gains stem from the continuous likelihood objective or unstated tuning advantages.

Authors: We accept this point. The revised Table 1 and Figure 3 will include per-model training FLOPs (or equivalent step counts normalized by batch size and sequence length) for every reported result. This addition will allow readers to verify that RePlaid’s PPL of 22.1 and its scaling behavior are obtained under compute budgets comparable to or lower than the cited discrete baselines, supporting attribution to the continuous likelihood training rather than hidden tuning advantages. revision: yes

Circularity Check

0 steps flagged

No significant circularity; scaling laws and insights are empirically benchmarked and derived from ELBO without reduction to inputs

full rationale

The paper's central results consist of empirical scaling comparisons on external benchmarks (OpenWebText, Duo, MDLM) and a derivation showing that noise-schedule optimization minimizing ELBO variance produces linear cross-entropy. These steps do not reduce by construction to fitted parameters, self-citations, or renamed inputs. The architecture alignment is presented as a methodological choice for fair comparison rather than a definitional equivalence. No load-bearing claim collapses to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard diffusion-model assumptions (ELBO as training objective, Gaussian noise process) and empirical scaling practices; no new invented entities are introduced.

free parameters (1)

noise schedule parameters
Optimized to minimize ELBO variance; treated as learned or tuned quantities rather than fixed by prior theory.

axioms (1)

standard math The evidence lower bound (ELBO) is a valid surrogate for the true likelihood in continuous diffusion training.
Invoked when deriving the training objective and when analyzing variance minimization.

pith-pipeline@v0.9.0 · 5799 in / 1362 out tokens · 49205 ms · 2026-05-20T11:01:50.578696+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RePlaid exhibits a compute gap of only 20× compared to autoregressive models... new state-of-the-art PPL bound of 22.1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 11 internal anchors

[1]

Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tyEyYT267x

work page 2025
[3]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, 11 A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 17981–17993. Curran Associates, Inc., 2021. URL ...

work page 2021
[4]

Dirichlet dif- fusion score model for biological sequence generation

Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, and Jian Zhou. Dirichlet dif- fusion score model for biological sequence generation. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofPro- ceedings of...

work page 2023
[5]

Importance Weighted Autoencoders

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[6]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligianni- dis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 28266–28279. Curran Associates, Inc., 202...

work page 2022
[7]

Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st Internat...

work page 2024
[8]

One billion word benchmark for measuring progress in statistical language mod- eling, 2014

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language mod- eling, 2014. URL https://huggingface.co/datasets/billion-word-benchmark/ lm1b

work page 2014
[9]

Analog bits: Generating discrete data using diffusion models with self-conditioning

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=3itjR9QxFw

work page 2023
[10]

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. LangFlow: Continuous diffusion rivals discrete in language modeling.arXiv preprint arXiv:2604.11748, 2026. URL https://arxiv.org/abs/2604.11748. Accessed: May 1, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Categorical flow matching on statistical manifolds

Chaoran Cheng, Jiahan Li, Jian Peng, and Ge Liu. Categorical flow matching on statistical manifolds. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page
[12]

URLhttps://openreview.net/forum?id=5fybcQZ0g4

work page
[13]

Diffusion posterior sampling for general noisy inverse problems

Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InThe Eleventh Inter- national Conference on Learning Representations, 2023. URLhttps://openreview.net/ forum?id=OnD9zGAGT0k

work page 2023
[14]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context.arXiv preprint arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[15]

Bronstein, and Joey Bose

Oscar Davis, Samuel Kessler, Mircea Petrache, Ismail Ilkan Ceylan, Michael M. Bronstein, and Joey Bose. Fisher flow matching for generative modeling over discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=6jOScqwdHU. 12

work page 2024
[16]

Continuous diffusion for categorical data

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continu- ous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= GTDKo3Sv9p

work page 2024
[18]

Openwebtext corpus

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. https://huggingface.co/datasets/Skylion007/openwebtext, 2019

work page 2019
[19]

Likelihood-based diffusion language models

Ishaan Gulrajani and Tatsunori Hashimoto. Likelihood-based diffusion language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=e2MCL6hObn

work page 2023
[20]

Mutual information and MMSE in gaussian channels

Dongning Guo, Shlomo Shamai, and Sergio Verdu. Mutual information and MMSE in gaussian channels. InInternational Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., pages 349–349, 2004. doi: 10.1109/ISIT.2004.1365386

work page doi:10.1109/isit.2004.1365386 2004
[21]

Smith, and Mike Lewis

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. SSD-LM: Semi-autoregressive simplex- based diffusion language model for text generation and modular control. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575–11596, To...

work page doi:10.18653/v1/2023 2023
[22]

DiffusionBERT: Improving generative masked language models with diffusion models

Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. DiffusionBERT: Improving generative masked language models with diffusion models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4521–4...

work page 2023
[23]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

spacy: Industrial- strength natural language processing in python

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, et al. spacy: Industrial- strength natural language processing in python. 2020

work page 2020
[25]

Argmax flows and multinomial diffusion: Learning categorical distributions

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URLhttps://openreview.net/forum?id=6nbpPqUCIi7

work page 2021
[26]

Hutchinson

M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Lapla- cian smoothing splines.Communications in Statistics - Simulation and Computation, 18 (3):1059–1076, 1989. doi: 10.1080/03610918908812806. URL https://doi.org/10.1080/ 03610918908812806

work page doi:10.1080/03610918908812806 1989
[27]

Continuous diffusion model for language modeling

Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=VGv5y60sXC

work page 2025
[28]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7FuTOWMOc7

work page 2022
[29]

Variational diffusion models

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021. URL https: //openreview.net/forum?id=2LdBqxc1Yv. 13

work page 2021
[30]

Flow Map Language Models: One-step Language Modeling via Continuous Denoising

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising.arXiv preprint arXiv:2602.16813, 2026. Accessed: May 1, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Diffusion-LM improves controllable text generation

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion-LM improves controllable text generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,

work page
[32]

URLhttps://openreview.net/forum?id=3s9IrEsjLyk

work page
[33]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InForty-first International Conference on Machine Learning,

work page
[34]

URLhttps://openreview.net/forum?id=CNicRIVIPA

work page
[35]

Latent diffusion for language generation

Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Seo Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=NKdtztladR

work page 2023
[36]

DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=2uAaGwlP_V

work page 2022
[37]

DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

work page 2025
[38]

Concrete score matching: Generalized score matching for discrete data

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,

work page
[39]

URLhttps://openreview.net/forum?id=_RL7wtHkPJK

work page
[40]

SDEdit: Guided image synthesis and editing with stochastic differential equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=aBsCjcPu_tE

work page 2022
[41]

Cosmos: Compressed and smooth latent space for text diffusion modeling

Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Cosmos: Compressed and smooth latent space for text diffusion modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=Rv6Lz84FlZ

work page 2025
[42]

Scaling up masked diffusion models on text

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=WNvvwK0tut

work page 2025
[43]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=KnqiC0znVF

work page 2025
[44]

Debiasing evidence approximations: On importance-weighted autoencoders and Jackknife variational inference

Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and Jackknife variational inference. InInternational Conference on Learning Representations,

work page
[45]

URLhttps://openreview.net/forum?id=HyZoi-WRb

work page
[46]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=sMyXP8Tanm

work page 2025
[47]

Sample4Geo : Hard negative sampling for cross-view geo-localisation

William Peebles and Saining Xie. Scalable diffusion models with transformers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2023. doi: 10.1109/ICCV51070.2023.00387. 14

work page doi:10.1109/iccv51070.2023.00387 2023
[48]

MAUVE: Measuring the gap between neural text and human text using divergence frontiers

Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URL https: //openreview...

work page 2021
[49]

Discrete Flow Maps

Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S Albergo. Discrete flow maps.arXiv preprint arXiv:2604.09784, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025

work page arXiv 2025
[51]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[52]

Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026

Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, ˙Ismail ˙Ilkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026

work page arXiv 2026
[53]

Simple and effective masked diffusion language models

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Mariano Marro- quin, Justin T Chiu, Alexander M Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=L4uaAR4ArM

work page 2024
[54]

The diffusion duality

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and V olodymyr Kuleshov. The diffusion duality. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=9P9Y8FOSOk

work page 2025
[55]

Esoteric language models.arXiv preprint arXiv:2506.01928, 2025

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat. Esoteric language models.arXiv preprint arXiv:2506.01928, 2025

work page arXiv 2025
[56]

Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

work page arXiv 2026
[57]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=TIdIXIpzhoI

work page 2022
[58]

Beyond Chinchilla-optimal: Accounting for inference in language model scaling laws

Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond Chinchilla-optimal: Accounting for inference in language model scaling laws. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=0bmXrtTDUu

work page 2024
[59]

Simple guidance mechanisms for discrete diffusion models

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla- torre, Bernardo P de Almeida, Alexander M Rush, Thomas Pierrot, and V olodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=i5MrJ6g5G1

work page 2025
[60]

Simplified and generalized masked diffusion for discrete data

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= xcqSOfHt4g

work page 2024
[61]

SlimPajama: A 627B token cleaned and dedupli- cated version of RedPajama, 2023

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hes- tness, and Nolan Dey. SlimPajama: A 627B token cleaned and dedupli- cated version of RedPajama, 2023. URL https://www.cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama

work page 2023
[62]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=St1giarCHLP. 15

work page 2021
[63]

Maximum likelihood training of score-based diffusion models

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=AklttWFnxS9

work page 2021
[64]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2...

work page 2023
[65]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022

Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022

work page arXiv 2022
[67]

Neurocomputing 568, 127063

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2023.127063. URL https://www. sciencedirect.com/science/article/pii/S0925231223011864

work page doi:10.1016/j.neucom.2023.127063 2024
[68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/ vandermaaten08a.html

work page 2008
[70]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL h...

work page 2017
[71]

BERT has a mouth, and it must speak: BERT as a Markov random field language model

Alex Wang and Kyunghyun Cho. BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Antoine Bosselut, Asli Celikyilmaz, Marjan Ghazvininejad, Srinivasan Iyer, Urvashi Khandelwal, Hannah Rashkin, and Thomas Wolf, editors,Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30–...

work page doi:10.18653/v1/w19-2304 2019
[72]

Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceedings of the 63rd Annual Meeting of the Associati...

work page 2025
[73]

calflops: a flops and params calculate tool for neural networks in pytorch framework,

xiaoju ye. calflops: a flops and params calculate tool for neural networks in pytorch framework,

work page
[74]

URLhttps://github.com/MrYxJ/calculate-flops.pytorch

work page
[75]

Tuning large neural networks via zero-shot hyperparameter transfer

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pa...

work page 2021
[76]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7B: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Fast sampling of diffusion models with exponential integrator

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Loek7hfb46P

work page 2023
[78]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. InThe Thirteenth International Conference on Learning Representations,

work page
[79]

− \logp θ(x)

URLhttps://openreview.net/forum?id=CTC7CmirNr. 17 Contents A Related Works 19 B Training Algorithm 20 C Derivation of the sequence-level NELBO for Plaid 21 D Sampler Update Formulas 23 E ODE-based Likelihood Estimation 25 F Constant Per-Timestep Diffusion Loss 30 G Linear Information Decay Under Optimality 31 H Per-Timestep CE Under Optimality 32 I Learni...

work page 2048
[80]

Tag.Run spaCy’s POS tagger on the decoded text to obtain word-level tags and their character offsets

work page

Showing first 80 references.

[1] [1]

Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tyEyYT267x

work page 2025

[3] [3]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, 11 A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 17981–17993. Curran Associates, Inc., 2021. URL ...

work page 2021

[4] [4]

Dirichlet dif- fusion score model for biological sequence generation

Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, and Jian Zhou. Dirichlet dif- fusion score model for biological sequence generation. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofPro- ceedings of...

work page 2023

[5] [5]

Importance Weighted Autoencoders

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[6] [6]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligianni- dis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 28266–28279. Curran Associates, Inc., 202...

work page 2022

[7] [7]

Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st Internat...

work page 2024

[8] [8]

One billion word benchmark for measuring progress in statistical language mod- eling, 2014

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language mod- eling, 2014. URL https://huggingface.co/datasets/billion-word-benchmark/ lm1b

work page 2014

[9] [9]

Analog bits: Generating discrete data using diffusion models with self-conditioning

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=3itjR9QxFw

work page 2023

[10] [10]

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, and Ge Liu. LangFlow: Continuous diffusion rivals discrete in language modeling.arXiv preprint arXiv:2604.11748, 2026. URL https://arxiv.org/abs/2604.11748. Accessed: May 1, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Categorical flow matching on statistical manifolds

Chaoran Cheng, Jiahan Li, Jian Peng, and Ge Liu. Categorical flow matching on statistical manifolds. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page

[12] [12]

URLhttps://openreview.net/forum?id=5fybcQZ0g4

work page

[13] [13]

Diffusion posterior sampling for general noisy inverse problems

Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InThe Eleventh Inter- national Conference on Learning Representations, 2023. URLhttps://openreview.net/ forum?id=OnD9zGAGT0k

work page 2023

[14] [14]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context.arXiv preprint arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[15] [15]

Bronstein, and Joey Bose

Oscar Davis, Samuel Kessler, Mircea Petrache, Ismail Ilkan Ceylan, Michael M. Bronstein, and Joey Bose. Fisher flow matching for generative modeling over discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=6jOScqwdHU. 12

work page 2024

[16] [16]

Continuous diffusion for categorical data

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continu- ous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= GTDKo3Sv9p

work page 2024

[18] [18]

Openwebtext corpus

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. https://huggingface.co/datasets/Skylion007/openwebtext, 2019

work page 2019

[19] [19]

Likelihood-based diffusion language models

Ishaan Gulrajani and Tatsunori Hashimoto. Likelihood-based diffusion language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=e2MCL6hObn

work page 2023

[20] [20]

Mutual information and MMSE in gaussian channels

Dongning Guo, Shlomo Shamai, and Sergio Verdu. Mutual information and MMSE in gaussian channels. InInternational Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., pages 349–349, 2004. doi: 10.1109/ISIT.2004.1365386

work page doi:10.1109/isit.2004.1365386 2004

[21] [21]

Smith, and Mike Lewis

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. SSD-LM: Semi-autoregressive simplex- based diffusion language model for text generation and modular control. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575–11596, To...

work page doi:10.18653/v1/2023 2023

[22] [22]

DiffusionBERT: Improving generative masked language models with diffusion models

Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. DiffusionBERT: Improving generative masked language models with diffusion models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4521–4...

work page 2023

[23] [23]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

spacy: Industrial- strength natural language processing in python

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, et al. spacy: Industrial- strength natural language processing in python. 2020

work page 2020

[25] [25]

Argmax flows and multinomial diffusion: Learning categorical distributions

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URLhttps://openreview.net/forum?id=6nbpPqUCIi7

work page 2021

[26] [26]

Hutchinson

M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Lapla- cian smoothing splines.Communications in Statistics - Simulation and Computation, 18 (3):1059–1076, 1989. doi: 10.1080/03610918908812806. URL https://doi.org/10.1080/ 03610918908812806

work page doi:10.1080/03610918908812806 1989

[27] [27]

Continuous diffusion model for language modeling

Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=VGv5y60sXC

work page 2025

[28] [28]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7FuTOWMOc7

work page 2022

[29] [29]

Variational diffusion models

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021. URL https: //openreview.net/forum?id=2LdBqxc1Yv. 13

work page 2021

[30] [30]

Flow Map Language Models: One-step Language Modeling via Continuous Denoising

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M Boffi, and Jinwoo Kim. Flow map language models: One-step language modeling via continuous denoising.arXiv preprint arXiv:2602.16813, 2026. Accessed: May 1, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Diffusion-LM improves controllable text generation

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion-LM improves controllable text generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,

work page

[32] [32]

URLhttps://openreview.net/forum?id=3s9IrEsjLyk

work page

[33] [33]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InForty-first International Conference on Machine Learning,

work page

[34] [34]

URLhttps://openreview.net/forum?id=CNicRIVIPA

work page

[35] [35]

Latent diffusion for language generation

Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Seo Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=NKdtztladR

work page 2023

[36] [36]

DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=2uAaGwlP_V

work page 2022

[37] [37]

DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

work page 2025

[38] [38]

Concrete score matching: Generalized score matching for discrete data

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems,

work page

[39] [39]

URLhttps://openreview.net/forum?id=_RL7wtHkPJK

work page

[40] [40]

SDEdit: Guided image synthesis and editing with stochastic differential equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=aBsCjcPu_tE

work page 2022

[41] [41]

Cosmos: Compressed and smooth latent space for text diffusion modeling

Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Cosmos: Compressed and smooth latent space for text diffusion modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=Rv6Lz84FlZ

work page 2025

[42] [42]

Scaling up masked diffusion models on text

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=WNvvwK0tut

work page 2025

[43] [43]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=KnqiC0znVF

work page 2025

[44] [44]

Debiasing evidence approximations: On importance-weighted autoencoders and Jackknife variational inference

Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and Jackknife variational inference. InInternational Conference on Learning Representations,

work page

[45] [45]

URLhttps://openreview.net/forum?id=HyZoi-WRb

work page

[46] [46]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=sMyXP8Tanm

work page 2025

[47] [47]

Sample4Geo : Hard negative sampling for cross-view geo-localisation

William Peebles and Saining Xie. Scalable diffusion models with transformers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2023. doi: 10.1109/ICCV51070.2023.00387. 14

work page doi:10.1109/iccv51070.2023.00387 2023

[48] [48]

MAUVE: Measuring the gap between neural text and human text using divergence frontiers

Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URL https: //openreview...

work page 2021

[49] [49]

Discrete Flow Maps

Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, and Michael S Albergo. Discrete flow maps.arXiv preprint arXiv:2604.09784, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Candi: Hybrid discrete-continuous diffusion models.arXiv preprint arXiv:2510.22510, 2025

work page arXiv 2025

[51] [51]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019

[52] [52]

Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026

Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, ˙Ismail ˙Ilkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps.arXiv preprint arXiv:2602.12233, 2026

work page arXiv 2026

[53] [53]

Simple and effective masked diffusion language models

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Mariano Marro- quin, Justin T Chiu, Alexander M Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=L4uaAR4ArM

work page 2024

[54] [54]

The diffusion duality

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and V olodymyr Kuleshov. The diffusion duality. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=9P9Y8FOSOk

work page 2025

[55] [55]

Esoteric language models.arXiv preprint arXiv:2506.01928, 2025

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat. Esoteric language models.arXiv preprint arXiv:2506.01928, 2025

work page arXiv 2025

[56] [56]

Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, and Ante Jukic. Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

work page arXiv 2026

[57] [57]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=TIdIXIpzhoI

work page 2022

[58] [58]

Beyond Chinchilla-optimal: Accounting for inference in language model scaling laws

Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond Chinchilla-optimal: Accounting for inference in language model scaling laws. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=0bmXrtTDUu

work page 2024

[59] [59]

Simple guidance mechanisms for discrete diffusion models

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla- torre, Bernardo P de Almeida, Alexander M Rush, Thomas Pierrot, and V olodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=i5MrJ6g5G1

work page 2025

[60] [60]

Simplified and generalized masked diffusion for discrete data

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= xcqSOfHt4g

work page 2024

[61] [61]

SlimPajama: A 627B token cleaned and dedupli- cated version of RedPajama, 2023

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hes- tness, and Nolan Dey. SlimPajama: A 627B token cleaned and dedupli- cated version of RedPajama, 2023. URL https://www.cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama

work page 2023

[62] [62]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=St1giarCHLP. 15

work page 2021

[63] [63]

Maximum likelihood training of score-based diffusion models

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=AklttWFnxS9

work page 2021

[64] [64]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2...

work page 2023

[65] [65]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022

Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022

work page arXiv 2022

[67] [67]

Neurocomputing 568, 127063

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2023.127063. URL https://www. sciencedirect.com/science/article/pii/S0925231223011864

work page doi:10.1016/j.neucom.2023.127063 2024

[68] [68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/ vandermaaten08a.html

work page 2008

[70] [70]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL h...

work page 2017

[71] [71]

BERT has a mouth, and it must speak: BERT as a Markov random field language model

Alex Wang and Kyunghyun Cho. BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Antoine Bosselut, Asli Celikyilmaz, Marjan Ghazvininejad, Srinivasan Iyer, Urvashi Khandelwal, Hannah Rashkin, and Thomas Wolf, editors,Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30–...

work page doi:10.18653/v1/w19-2304 2019

[72] [72]

Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceedings of the 63rd Annual Meeting of the Associati...

work page 2025

[73] [73]

calflops: a flops and params calculate tool for neural networks in pytorch framework,

xiaoju ye. calflops: a flops and params calculate tool for neural networks in pytorch framework,

work page

[74] [74]

URLhttps://github.com/MrYxJ/calculate-flops.pytorch

work page

[75] [75]

Tuning large neural networks via zero-shot hyperparameter transfer

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pa...

work page 2021

[76] [76]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7B: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

Fast sampling of diffusion models with exponential integrator

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Loek7hfb46P

work page 2023

[78] [78]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. InThe Thirteenth International Conference on Learning Representations,

work page

[79] [79]

− \logp θ(x)

URLhttps://openreview.net/forum?id=CTC7CmirNr. 17 Contents A Related Works 19 B Training Algorithm 20 C Derivation of the sequence-level NELBO for Plaid 21 D Sampler Update Formulas 23 E ODE-based Likelihood Estimation 25 F Constant Per-Timestep Diffusion Loss 30 G Linear Information Decay Under Optimality 31 H Per-Timestep CE Under Optimality 32 I Learni...

work page 2048

[80] [80]

Tag.Run spaCy’s POS tagger on the decoded text to obtain word-level tags and their character offsets

work page