Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

Alain Durmus; Dario Shariatian; Eric Moulines; Eric P. Xing; Samson Gourevitch; Umut Simsekli; Yazid Janati

arxiv: 2605.22765 · v1 · pith:NIZTTQOXnew · submitted 2026-05-21 · 💻 cs.LG · stat.ML

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

Samson Gourevitch , Yazid Janati , Dario Shariatian , Umut Simsekli , Eric Moulines , Eric P. Xing , Alain Durmus This is my paper

Pith reviewed 2026-05-22 06:40 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords discrete diffusionuniform diffusionmasked diffusionleave-one-out posteriorabsorbing statedenoisinglanguage modelingreverse dynamics

0 comments

The pith

Uniform diffusion models are optimized by a leave-one-out posterior rather than the direct denoising posterior, and an absorbing-state version matches masked diffusion performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the usual plug-in parameterization in uniform diffusion models actually trains a leave-one-out predictor that ignores each token's own noisy observation. This creates a mismatch with the standard cross-entropy training objective. By deriving exact conversions between the denoiser, leave-one-out posterior, and score, the authors separate parameterization choices from the training target. They also present an absorbing-state reformulation that keeps the original joint distribution but breaks sampling into simpler masked-diffusion-style steps with carry-over unmasking and remasking. Experiments on language modeling show consistent gains from the leave-one-out approach and performance that matches or exceeds masked diffusion.

Core claim

In uniform diffusion models the standard plug-in bridge is optimized by a leave-one-out posterior that predicts each clean token without using its own noisy version. Exact conversions exist between the denoiser output, this leave-one-out posterior, and the score function. These conversions allow an informed predictor-corrector sampler and improved temperature sampling at inference time with no retraining. An absorbing-state reformulation preserves the original UDM joint law while reducing sampling to masked-diffusion-like operations that have simpler posteriors, natural carry-over unmasking, and a built-in remasking mechanism.

What carries the argument

The leave-one-out posterior, which predicts each clean token from all other noisy observations while ignoring its own.

If this is right

Leave-one-out parameterizations improve generation quality for uniform diffusion on language modeling tasks.
The absorbing-state construction achieves performance that matches or exceeds masked diffusion models.
An informed predictor-corrector sampler and temperature sampling based on the leave-one-out predictor improve inference without any retraining.
The conversions between denoiser, leave-one-out posterior, and score disentangle parameterization choices from the training objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mismatch between plug-in parameterization and true denoising posterior may appear in other discrete diffusion settings beyond language.
The absorbing reformulation could reduce implementation complexity when porting uniform diffusion code to new data types.
Improved sampling from the leave-one-out predictor might generalize to continuous diffusion models that use analogous bridge constructions.

Load-bearing premise

The absorbing-state reformulation keeps exactly the same joint probability law over sequences as the original uniform diffusion process.

What would settle it

Train a standard UDM and a leave-one-out version on the same language dataset, then measure whether the leave-one-out version produces lower perplexity or higher-quality generated text under the same sampling budget.

Figures

Figures reproduced from arXiv: 2605.22765 by Alain Durmus, Dario Shariatian, Eric Moulines, Eric P. Xing, Samson Gourevitch, Umut Simsekli, Yazid Janati.

**Figure 2.** Figure 2: Top-p sampling Gen-PPL frontier, obtained by sweeping (p ∈ [0.8, 1.0]). Top-p sampling is applied to the denoiser, denoiser converted into LOO denoiser and LOO denoiser. Predictor-corrector We next evaluate the predictor-corrector sampler (Algorithm 5) described at the end of Section 3.2. As detailed in Appendix E, the LOO parameterization gives access to a corrector step without training an auxiliary mode… view at source ↗

**Figure 3.** Figure 3: Gen-PPL frontiers. Temperature sampling Gen-PPL frontier, obtained by sweeping the temperature [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Left: Sudoku solve rate as a function of NFE for the best learning rate of each method. Right: AUDM, resampled AUDM, UDM, and MDM Gen-PPL frontier. The vertical line corresponds to the dataset entropy. The Sudoku results in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between the leave-one-out posterior [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity of the leave-one-out prediction to the local observation [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗

**Figure 7.** Figure 7: Generative frontiers with varying temperature across NFEs for the denoiser and leave-one [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗

**Figure 8.** Figure 8: Top-p frontiers across all NFEs for the denoiser and leave-one-out parameterizations [PITH_FULL_IMAGE:figures/full_fig_p044_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison between predictor-corrector sampling and top- [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗

**Figure 10.** Figure 10: Full frontier comparison for AUDM, resampled AUDM, UDM, and MDM across all [PITH_FULL_IMAGE:figures/full_fig_p045_10.png] view at source ↗

read the original abstract

Discrete diffusion models are often trained through clean-data prediction, but the prediction can be used in different ways to define the reverse dynamics. In Masked Diffusion Models (MDM) these choices largely coincide, whereas in Uniform Diffusion Models (UDM) they do not. We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This identifies a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. We characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score. These conversions allow us to disentangle parameterization and training objective. Our results also lead to inference improvements without any additional training through an informed predictor-corrector sampler and improved temperature sampling based on the leave-one-out predictor. We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking mechanism. On language modeling, leave-one-out parameterizations consistently improve UDM generation, while the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design. The code and models can be found at https://github.com/samsongourevitch/rev_udm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins down a real mismatch between the plug-in bridge and the usual denoising objective in uniform diffusion, then gives conversions and an absorbing reformulation that let you close much of the gap with masked models through better parameterization and sampling.

read the letter

The main thing to know is that uniform diffusion models have been optimizing against the wrong target. The standard plug-in bridge is not driven by the denoising posterior but by a leave-one-out posterior that predicts the clean token without seeing its own noisy version. This creates a mismatch with the cross-entropy objective people actually train on, and the paper derives the exact conversions between denoiser, leave-one-out posterior, and score to fix it. From there they build an informed predictor-corrector sampler and temperature adjustments that improve generation on language modeling without retraining. The absorbing-state reformulation is the other concrete move: it rewrites the process so sampling uses masked-style operations with carry-over unmasking and remasking while claiming to keep the original joint law unchanged. Experiments show these changes make uniform models competitive with or better than masked ones, which supports their point that the practical gap comes more from parameterization and sampling design than from the marginal choice itself. Code is released, which is helpful for checking the details. The soft spot is the absorbing reformulation. The abstract states that the new forward kernels preserve the joint distribution exactly, but the equivalence of marginals at every time step needs a clear derivation to rule out small divergences in the continuous-time limit. If that lemma is fully worked out in the paper, the claim holds; otherwise it is the part that would need the most scrutiny. This is for people working on discrete diffusion for sequences who want to understand why one setup outperforms another and how to adjust sampling without starting over. Readers focused on language modeling or practical reverse-process design will get usable tools and results. The technical content and empirical signals are strong enough that it deserves a serious referee.

Referee Report

2 major / 2 minor

Summary. The paper claims that in Uniform Diffusion Models (UDM) the standard plug-in bridge parameterization is optimized by a leave-one-out posterior rather than the denoising posterior, identifying a mismatch with the cross-entropy objective. It derives exact conversions between the denoiser, leave-one-out posterior, and score. It further introduces an absorbing-state reformulation that preserves the original UDM joint law while enabling masked-diffusion-like sampling operations with simpler posteriors, carry-over unmasking, and natural remasking. Empirical results on language modeling show consistent improvements from leave-one-out parameterizations and that the absorbing construction matches or surpasses masked diffusion.

Significance. If the derivations hold and the reformulation exactly preserves the joint law, the work clarifies why masked and uniform diffusion differ in practice, attributing gaps more to parameterization and sampling than to marginal choices. The leave-one-out predictor, informed predictor-corrector sampler, and temperature sampling offer training-free inference gains. Code and model release aids reproducibility.

major comments (2)

Abstract, second paragraph: the central claim that the absorbing-state reformulation 'preserves the UDM joint law' while decomposing sampling into masked-diffusion-like operations with carry-over unmasking and remasking lacks an explicit derivation of forward-kernel equivalence or finite-time transition-matrix equality. This equivalence is load-bearing; without a lemma showing identical marginals at every t (or infinitesimal-generator agreement), it is unclear whether the processes remain equivalent or diverge at O(dt) when remasking probability is nonzero in the continuous-time limit.
Derivations of conversions (referenced in abstract): the exact conversions between denoiser, leave-one-out posterior, and score are used to disentangle parameterization from objective. A concrete check is needed that these conversions do not reduce by construction to quantities already fitted by the training objective, to confirm they provide new information rather than tautological reparameterization.

minor comments (2)

Notation for the leave-one-out posterior should be introduced with an explicit equation early in the main text rather than only in the abstract.
The experimental section would benefit from reporting variance across multiple runs or statistical tests for the reported generation improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. We address each major comment in detail below. Where the comments identify opportunities for greater clarity or additional supporting material, we have revised the paper accordingly.

read point-by-point responses

Referee: Abstract, second paragraph: the central claim that the absorbing-state reformulation 'preserves the UDM joint law' while decomposing sampling into masked-diffusion-like operations with carry-over unmasking and remasking lacks an explicit derivation of forward-kernel equivalence or finite-time transition-matrix equality. This equivalence is load-bearing; without a lemma showing identical marginals at every t (or infinitesimal-generator agreement), it is unclear whether the processes remain equivalent or diverge at O(dt) when remasking probability is nonzero in the continuous-time limit.

Authors: We thank the referee for this important observation. Section 4 of the manuscript constructs the absorbing-state reformulation by re-expressing the uniform forward process as a mixture of an absorbing state and a masked diffusion process, with the reverse process using carry-over unmasking and a natural remasking step. The construction is designed so that the joint law over clean and noisy sequences remains identical to the original UDM at every finite time. To make the equivalence fully rigorous in the continuous-time setting, we will add a new lemma (Lemma 4.1) that explicitly equates the infinitesimal generators of the two processes and proves that the finite-time marginal distributions coincide for all t, including when the remasking probability is strictly positive. The proof proceeds by showing that the transition rates match exactly and that the resulting Kolmogorov forward equations yield identical solutions. We agree this lemma strengthens the presentation and removes any ambiguity about O(dt) discrepancies. revision: yes
Referee: Derivations of conversions (referenced in abstract): the exact conversions between denoiser, leave-one-out posterior, and score are used to disentangle parameterization from objective. A concrete check is needed that these conversions do not reduce by construction to quantities already fitted by the training objective, to confirm they provide new information rather than tautological reparameterization.

Authors: We appreciate the request for an explicit non-tautological check. The model is trained by minimizing cross-entropy against the denoising posterior. The leave-one-out posterior, however, is obtained by analytically removing the contribution of the token’s own noisy observation from the denoiser output (see Equations 3–5). This adjustment is not part of the training loss and therefore yields a distinct predictor. In the revised manuscript we will insert a short remark and a small numerical illustration in Section 3.3: we apply the conversion formulas to a trained denoiser on a validation batch and show that the resulting leave-one-out probabilities differ measurably from the raw denoising probabilities; we further demonstrate that substituting the leave-one-out predictor into the informed predictor-corrector sampler produces the reported generation improvements. These results confirm that the conversions extract usable information beyond what is directly optimized by the training objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives explicit conversions between the denoiser, leave-one-out posterior, and score, then introduces an absorbing-state reformulation asserted to preserve the original UDM joint law while enabling masked-diffusion-style operations. These steps are presented as independent technical results with stated characterizations and empirical outcomes on language modeling, rather than tautological reductions to fitted inputs, self-definitions, or load-bearing self-citations. No equations or claims in the abstract reduce a prediction or central result to its own construction by definition; the derivations appear self-contained against external benchmarks and falsifiable via the reported generation improvements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

From the abstract alone, the central claims rest on the existence of exact conversions between denoiser, leave-one-out posterior and score, and on the absorbing reformulation preserving the UDM joint law; no explicit free parameters, standard mathematical axioms, or new invented entities with independent evidence are detailed.

invented entities (1)

absorbing-state reformulation no independent evidence
purpose: Decompose uniform diffusion into masked-diffusion-like sampling operations while preserving the original joint law
Introduced to enable simpler denoising posteriors, carry-over unmasking and natural remasking; no external falsifiable evidence provided in abstract

pith-pipeline@v0.9.0 · 5836 in / 1379 out tokens · 48965 ms · 2026-05-22T06:40:50.499383+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior... We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the absorbing-state reformulation... preserves the UDM joint law

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 11 internal anchors

[1]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021
[2]

A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

work page 2022
[3]

Gen- erative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design,

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gen- erative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024. URLhttps://arxiv.org/abs/2402.04997

work page arXiv 2024
[4]

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling, 2014. URLhttps://arxiv.org/abs/1312.3005

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

Fast sampling via discrete non-markov diffusion models with predetermined transition time

Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via discrete non-markov diffusion models with predetermined transition time. Advances in Neural Information Processing Systems, 37:106870–106905, 2024

work page 2024
[6]

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summariza- tion of long documents, 2018. URLhttps://arxiv.org/abs/1804.05685

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

The Diffusion Duality, Chapter II: $\Psi$-Samplers

Justin Deschenaux, Caglar Gulcehre, and Subham Sekhar Sahoo. The diffusion duality, chapter ii:ψ-samplers and efficient curriculum, 2026. URLhttps://arxiv.org/abs/2602.21185

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Openwebtext corpus.http://Skylion007.github.io/ OpenWebTextCorpus, 2019

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/ OpenWebTextCorpus, 2019

work page 2019
[9]

Gemini Diffusion: Google DeepMind’s experimental research model.https://blog.google/technology/google-deepmind/gemini-diffusion/, May 2025

Google DeepMind. Gemini Diffusion: Google DeepMind’s experimental research model.https://blog.google/technology/google-deepmind/gemini-diffusion/, May 2025. Accessed: 2026-05-06

work page 2025
[10]

Vector quantized diffusion model for text-to-image synthesis, 2022

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis, 2022. URL https://arxiv.org/abs/2111.14822

work page arXiv 2022
[11]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH. 13

work page 2020
[12]

Argmax flows: Learning categorical distributions with normalizing flows

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr ´e, and Max Welling. Argmax flows: Learning categorical distributions with normalizing flows. InThird Symposium on Advances in Approximate Bayesian Inference, 2021

work page 2021
[13]

Analyzing hogwild parallel gaus- sian gibbs sampling

Matthew J Johnson, James Saunderson, and Alan Willsky. Analyzing hogwild parallel gaus- sian gibbs sampling. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2013/file/b51a15f38...

work page 2013
[14]

Mercury: Ultra-fast language models based on diffusion,

Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and V olodymyr Kuleshov. Mercury: Ultra-fast language models based on diffusion,

work page
[15]

URLhttps://arxiv.org/abs/2506.17298

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Think while you generate: Discrete diffusion with planned denoising

Sulin Liu, Juno Nam, Andrew Campbell, Hannes St¨ark, Yilun Xu, Tommi Jaakkola, and Rafael G´omez-Bombarelli. Think while you generate: Discrete diffusion with planned denoising. arXiv preprint arXiv:2410.06264, 2024

work page arXiv 2024
[17]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning, pages 32819–32848. PMLR, 2024

work page 2024
[18]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large an- notated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330,

work page
[19]

URLhttps://aclanthology.org/J93-2004/

work page 2004
[20]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. URLhttps://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Scaling up masked diffusion models on text

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=WNvvwK0tut

work page 2025
[22]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. URLhttps: //arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URLhttps: //arxiv.org/abs/1606.06031

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Discrete markov probabilistic models: An improved discrete score- based framework with sharp convergence bounds under minimal assumptions

Le-Tuyet-Nhi PHAM, Dario Shariatian, Antonio Ocello, Giovanni Conforti, and Alain Oliviero Durmus. Discrete markov probabilistic models: An improved discrete score- based framework with sharp convergence bounds under minimal assumptions. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=biJiSMLGOV

work page 2025
[27]

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Generative frontiers: Why evaluation matters for diffusion language models, 2026. URLhttps://arxiv.org/abs/2604.02718

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024. 14

work page 2024
[29]

The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

work page arXiv 2025
[30]

Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla- torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and V olodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024

work page arXiv 2024
[31]

Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

work page 2024
[32]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=St1giarCHLP

work page 2021
[33]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021
[34]

Score-based continuous- time discrete diffusion models

Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous- time discrete diffusion models. InThe Eleventh International Conference on Learning Repre- sentations, 2023. URLhttps://openreview.net/forum?id=BYWWwSY2G5s

work page 2023
[35]

Scaling behavior of discrete diffusion language models

Dimitri von R ¨utte, Janis Fluri, Omead Pooladzandi, Bernhard Sch ¨olkopf, Thomas Hof- mann, and Antonio Orvieto. Scaling behavior of discrete diffusion language models. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=GDYaNzxt9T

work page 2026
[36]

Generalized interpolating discrete diffusion, 2025

Dimitri von R ¨utte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Sch¨olkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion, 2025. URLhttps://arxiv.org/ abs/2503.04482

work page arXiv 2025
[37]

S., and Kuleshov, V

Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling, 2026. URLhttps://arxiv.org/ abs/2503.00307

work page arXiv 2026
[38]

Beyond autoregression: Discrete diffu- sion for complex reasoning and planning.arXiv preprint arXiv:2410.14157,

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning, 2025. URLhttps://arxiv.org/abs/2410.14157

work page arXiv 2025
[39]

Character-level Convolutional Networks for Text Classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016. URLhttps://arxiv.org/abs/1509.01626

work page internal anchor Pith review Pith/arXiv arXiv 2016
[40]

Informed correctors for discrete diffusion models, 2025

Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, and Scott Linderman. Informed correctors for discrete diffusion models, 2025. URLhttps://arxiv.org/abs/ 2407.21243

work page arXiv 2025
[41]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

work page arXiv 2024
[42]

A reparameterized discrete diffusion model for text generation, 2024

Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation, 2024. URLhttps://arxiv.org/abs/2302.05737. 15 Appendix Outline The appendix is organized as follows. Appendix A proves the leave-one-out optimality result, gives the conversion formulas between the leave-one-out denoiser, denoiser and score, ...

work page arXiv 2024
[43]

Outside this support, the conditional densities may be defined arbitrarily. Proposition 5.It holds for anyx t such thatp t(xt)>0, pℓ 0|t(xℓ 0|xt) =p t(x−ℓ t )qℓ t|0(xℓ t|xℓ 0)p loo,ℓ 0|t (xℓ 0|x−ℓ t )/pt(xt).(24) Conversely, suppose thatq ℓ t|0(xℓ t|xℓ 0)>0for anyx 0 andx t, it holds for anyx −ℓ t ,p t(x−ℓ t )>0, ploo,ℓ 0|t (xℓ 0|x−ℓ t ) = pt(xt)pℓ 0|t(xℓ...

work page
[44]

In UDM this condition is satisfied for everyt >0, so the two repre- sentations can always be converted into one another

(26) Therefore, the conversion from the denoiser to the leave-one-out posterior is available exactly when the forward has full support. In UDM this condition is satisfied for everyt >0, so the two repre- sentations can always be converted into one another. In MDM, by contrast, the condition fails for unmasked positions. Ifx ℓ t =m, the likelihood is const...

work page
[45]

For MDM, the conversion is explicit only on the support of the forward process

=α t⟨xℓ t,x ℓ 0⟩+ 1−α t K , the first identity above yields pℓ 0|t(·|xt) = Cat (1−α t)ˆxloo 0|t (xt)ℓ +Kα t⟨xℓ t, ˆxloo 0|t (xt)ℓ⟩xℓ t 1−α t +Kα t⟨xℓ t, ˆxloo 0|t (xt)ℓ⟩ .(27) Conversely, the inverse relation can be written as ˆxloo 0|t (xt)ℓ = (1 + (K−1)α t) ˆx0|t(xt)ℓ −Kα t⟨xℓ t, ˆx0|t(xt)ℓ⟩xℓ t 1 + (K−1)α t −Kα t⟨xℓ t, ˆx0|t(xt)ℓ⟩ .(28) Where ˆx0|t(xt)...

work page
[46]

If insteadx ℓ t ̸=m, thenq ℓ t|0(xℓ t|xℓ

= 1−α t for everyx ℓ 0 ∈V, so pℓ 0|t(·|xt) = Cat(ˆxloo 0|t (xt)ℓ). If insteadx ℓ t ̸=m, thenq ℓ t|0(xℓ t|xℓ

work page
[47]

Therefore, on unmasked positions, the denoiser no longer contains enough information to recon- struct the leave-one-out posterior, so the inverse formula is not available

=α t1{xℓ 0 =x ℓ t}, hence pℓ 0|t(·|xt) = Cat(xℓ t). Therefore, on unmasked positions, the denoiser no longer contains enough information to recon- struct the leave-one-out posterior, so the inverse formula is not available. Remark 2.When these conversion formulas hold in both directions, as they do for UDM, the uniqueness of the denoiser as a minimizer of...

work page
[48]

pℓ 0|t(xℓ 0|xt).(29) This already shows that the score can be parameterized from the denoiser. The same quantity can also be expressed directly in terms of the leave-one-out denoiser: ⟨yℓ,s t(xt)ℓ⟩= qℓ t|0(yℓ|ˆxloo 0|t (xt)ℓ) qℓt|0(xℓ t|ˆxloo 0|t (xt)ℓ) ,(30) for everyy∈Xsuch thaty −ℓ =x −ℓ t . Proof.By definition of the score, for everyy∈Xsuch thaty −ℓ =...

work page
[49]

Therefore, pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t )

=p t(x−ℓ t )ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Therefore, pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Since the mapν7→q ℓ t|0(yℓ|ν)is affine in its first argument, this becomes pt(y) =p t(x−ℓ t )qℓ t|0(yℓ|ˆxloo 0|t (xt)ℓ). Takingy=x t yields in the same way pt(xt) =p t(x−ℓ t )qℓ t|0(xℓ t|ˆxloo 0|t (xt)ℓ). Dividing the two equations gives ⟨...

work page
[50]

also considers a LOO denoiser parameterization which follows from Proposition 6. Therefore, if ˆxθ 0(xt, t)is a model for the leave-one-out denoiser, one may use the parameterization pθ(xt, t) =α tˆxθ 0(xt, t) + (1−α t)πℓ .(35) The same restriction is still needed after this reparameterization. If ˆxθ 0(xt, t)ℓ may depend onx ℓ t, the minimizer of (34) ma...

work page
[51]

Therefore pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t )

=p t(x−ℓ t )ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Therefore pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t ). 21 Figure 5: Comparison between the leave-one-out posterior ˆxloo 0|t (xt)and the denoising posterior ˆx0|t(xt). Using the affine extension of the forward kernel in its first argument, this becomes pt(y) =p t(x−ℓ t )qℓ t|0(yℓ|ˆxloo 0|t (xt)...

work page
[52]

Cat(xℓ ti+1;1/K) +1 τ ℓ>ti+1 X xℓ ti+1 Cat(xℓ ti;x ℓ ti+1) Cat(xℓ ti+1;x ℓ 0) =1 τ ℓ≤tiCat(xℓ ti;1/K) +1 ti<τ ℓ≤ti+1Cat(xℓ ti;x ℓ

work page
[53]

This proves the induction step

+1 τ ℓ>ti+1Cat(xℓ ti;x ℓ 0) = Cat xℓ ti;1 τ ℓ>ti xℓ 0 +1 τ ℓ≤ti 1/K . This proves the induction step. Hence (50) holds for every time of the grid, and therefore for every t. Lemma 3.If ˜xt(τ) ℓ := xℓ t ifτ ℓ > t, mifτ ℓ ≤t, then p0|t(x0|xt,τ) =p mask 0|t (x0|˜xt(τ)). Proof.By Bayes’ rule and (50), p0|t(x0|xt,τ) = p0(x0)qt|0(xt|x0,τ)P ˜x0 p0(˜x0)qt|0(xt|˜x...

work page
[54]

canonical

The lawj s|0 exactly reweights these two possibilities. The lifted reverse transition is then ¯ps|t(xs,τ s|xt,τ t) := X x0 js|0(τs|x0,x s)q s|0,t(xs|x0,x t)p 0|t(x0|xt,τ t).(53) For a grid0 =t 0 <· · ·< t n = 1, let¯p0:n denote the corresponding path law, with initialization ¯ptn(xtn ,τ tn) :=p tn(xtn)jtn(τtn |xtn). Ifα tn = 0, this reduces toX tn ∼υ ⊗L a...

work page
[55]

and later used by [38]. The idea is to combine two autoregressive streams, one left-to-right and one right-to-left, and to offset the representations so that the output at positionℓnever attends to the input token at the same position, while still depending on all the other positions. Equivalently, in the continuous relaxation used to describe the archite...

work page
[56]

contract

at fixed checkpoint, and plotting generative perplexity against the resulting entropy. For top-pfrontiers we usep∈ {0.80,0.85,0.90,0.92,0.94,0.96,0.98,1.00}andNFE∈ {8,16,32,64,128,256,512,1024}. For temperature frontiers we useT∈ {0.80,0.82, . . . ,1.10} over the same NFE grid. The predictor-corrector experiments are run on OWT with the confidence- based ...

work page 2021

[1] [1]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021

[2] [2]

A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

work page 2022

[3] [3]

Gen- erative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design,

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Gen- erative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design, 2024. URLhttps://arxiv.org/abs/2402.04997

work page arXiv 2024

[4] [4]

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling, 2014. URLhttps://arxiv.org/abs/1312.3005

work page internal anchor Pith review Pith/arXiv arXiv 2014

[5] [5]

Fast sampling via discrete non-markov diffusion models with predetermined transition time

Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via discrete non-markov diffusion models with predetermined transition time. Advances in Neural Information Processing Systems, 37:106870–106905, 2024

work page 2024

[6] [6]

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summariza- tion of long documents, 2018. URLhttps://arxiv.org/abs/1804.05685

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

The Diffusion Duality, Chapter II: $\Psi$-Samplers

Justin Deschenaux, Caglar Gulcehre, and Subham Sekhar Sahoo. The diffusion duality, chapter ii:ψ-samplers and efficient curriculum, 2026. URLhttps://arxiv.org/abs/2602.21185

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Openwebtext corpus.http://Skylion007.github.io/ OpenWebTextCorpus, 2019

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/ OpenWebTextCorpus, 2019

work page 2019

[9] [9]

Gemini Diffusion: Google DeepMind’s experimental research model.https://blog.google/technology/google-deepmind/gemini-diffusion/, May 2025

Google DeepMind. Gemini Diffusion: Google DeepMind’s experimental research model.https://blog.google/technology/google-deepmind/gemini-diffusion/, May 2025. Accessed: 2026-05-06

work page 2025

[10] [10]

Vector quantized diffusion model for text-to-image synthesis, 2022

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis, 2022. URL https://arxiv.org/abs/2111.14822

work page arXiv 2022

[11] [11]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH. 13

work page 2020

[12] [12]

Argmax flows: Learning categorical distributions with normalizing flows

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr ´e, and Max Welling. Argmax flows: Learning categorical distributions with normalizing flows. InThird Symposium on Advances in Approximate Bayesian Inference, 2021

work page 2021

[13] [13]

Analyzing hogwild parallel gaus- sian gibbs sampling

Matthew J Johnson, James Saunderson, and Alan Willsky. Analyzing hogwild parallel gaus- sian gibbs sampling. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2013/file/b51a15f38...

work page 2013

[14] [14]

Mercury: Ultra-fast language models based on diffusion,

Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and V olodymyr Kuleshov. Mercury: Ultra-fast language models based on diffusion,

work page

[15] [15]

URLhttps://arxiv.org/abs/2506.17298

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Think while you generate: Discrete diffusion with planned denoising

Sulin Liu, Juno Nam, Andrew Campbell, Hannes St¨ark, Yilun Xu, Tommi Jaakkola, and Rafael G´omez-Bombarelli. Think while you generate: Discrete diffusion with planned denoising. arXiv preprint arXiv:2410.06264, 2024

work page arXiv 2024

[17] [17]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning, pages 32819–32848. PMLR, 2024

work page 2024

[18] [18]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large an- notated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330,

work page

[19] [19]

URLhttps://aclanthology.org/J93-2004/

work page 2004

[20] [20]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. URLhttps://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Scaling up masked diffusion models on text

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=WNvvwK0tut

work page 2025

[22] [22]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models, 2025. URLhttps: //arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URLhttps: //arxiv.org/abs/1606.06031

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [25]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Discrete markov probabilistic models: An improved discrete score- based framework with sharp convergence bounds under minimal assumptions

Le-Tuyet-Nhi PHAM, Dario Shariatian, Antonio Ocello, Giovanni Conforti, and Alain Oliviero Durmus. Discrete markov probabilistic models: An improved discrete score- based framework with sharp convergence bounds under minimal assumptions. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=biJiSMLGOV

work page 2025

[27] [27]

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Generative frontiers: Why evaluation matters for diffusion language models, 2026. URLhttps://arxiv.org/abs/2604.02718

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024. 14

work page 2024

[29] [29]

The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The diffusion duality.arXiv preprint arXiv:2506.10892, 2025

work page arXiv 2025

[30] [30]

Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla- torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and V olodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models.arXiv preprint arXiv:2412.10193, 2024

work page arXiv 2024

[31] [31]

Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

work page 2024

[32] [32]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=St1giarCHLP

work page 2021

[33] [33]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021

[34] [34]

Score-based continuous- time discrete diffusion models

Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous- time discrete diffusion models. InThe Eleventh International Conference on Learning Repre- sentations, 2023. URLhttps://openreview.net/forum?id=BYWWwSY2G5s

work page 2023

[35] [35]

Scaling behavior of discrete diffusion language models

Dimitri von R ¨utte, Janis Fluri, Omead Pooladzandi, Bernhard Sch ¨olkopf, Thomas Hof- mann, and Antonio Orvieto. Scaling behavior of discrete diffusion language models. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=GDYaNzxt9T

work page 2026

[36] [36]

Generalized interpolating discrete diffusion, 2025

Dimitri von R ¨utte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Sch¨olkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion, 2025. URLhttps://arxiv.org/ abs/2503.04482

work page arXiv 2025

[37] [37]

S., and Kuleshov, V

Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling, 2026. URLhttps://arxiv.org/ abs/2503.00307

work page arXiv 2026

[38] [38]

Beyond autoregression: Discrete diffu- sion for complex reasoning and planning.arXiv preprint arXiv:2410.14157,

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning, 2025. URLhttps://arxiv.org/abs/2410.14157

work page arXiv 2025

[39] [39]

Character-level Convolutional Networks for Text Classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016. URLhttps://arxiv.org/abs/1509.01626

work page internal anchor Pith review Pith/arXiv arXiv 2016

[40] [40]

Informed correctors for discrete diffusion models, 2025

Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, and Scott Linderman. Informed correctors for discrete diffusion models, 2025. URLhttps://arxiv.org/abs/ 2407.21243

work page arXiv 2025

[41] [41]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

work page arXiv 2024

[42] [42]

A reparameterized discrete diffusion model for text generation, 2024

Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation, 2024. URLhttps://arxiv.org/abs/2302.05737. 15 Appendix Outline The appendix is organized as follows. Appendix A proves the leave-one-out optimality result, gives the conversion formulas between the leave-one-out denoiser, denoiser and score, ...

work page arXiv 2024

[43] [43]

Outside this support, the conditional densities may be defined arbitrarily. Proposition 5.It holds for anyx t such thatp t(xt)>0, pℓ 0|t(xℓ 0|xt) =p t(x−ℓ t )qℓ t|0(xℓ t|xℓ 0)p loo,ℓ 0|t (xℓ 0|x−ℓ t )/pt(xt).(24) Conversely, suppose thatq ℓ t|0(xℓ t|xℓ 0)>0for anyx 0 andx t, it holds for anyx −ℓ t ,p t(x−ℓ t )>0, ploo,ℓ 0|t (xℓ 0|x−ℓ t ) = pt(xt)pℓ 0|t(xℓ...

work page

[44] [44]

In UDM this condition is satisfied for everyt >0, so the two repre- sentations can always be converted into one another

(26) Therefore, the conversion from the denoiser to the leave-one-out posterior is available exactly when the forward has full support. In UDM this condition is satisfied for everyt >0, so the two repre- sentations can always be converted into one another. In MDM, by contrast, the condition fails for unmasked positions. Ifx ℓ t =m, the likelihood is const...

work page

[45] [45]

For MDM, the conversion is explicit only on the support of the forward process

=α t⟨xℓ t,x ℓ 0⟩+ 1−α t K , the first identity above yields pℓ 0|t(·|xt) = Cat (1−α t)ˆxloo 0|t (xt)ℓ +Kα t⟨xℓ t, ˆxloo 0|t (xt)ℓ⟩xℓ t 1−α t +Kα t⟨xℓ t, ˆxloo 0|t (xt)ℓ⟩ .(27) Conversely, the inverse relation can be written as ˆxloo 0|t (xt)ℓ = (1 + (K−1)α t) ˆx0|t(xt)ℓ −Kα t⟨xℓ t, ˆx0|t(xt)ℓ⟩xℓ t 1 + (K−1)α t −Kα t⟨xℓ t, ˆx0|t(xt)ℓ⟩ .(28) Where ˆx0|t(xt)...

work page

[46] [46]

If insteadx ℓ t ̸=m, thenq ℓ t|0(xℓ t|xℓ

= 1−α t for everyx ℓ 0 ∈V, so pℓ 0|t(·|xt) = Cat(ˆxloo 0|t (xt)ℓ). If insteadx ℓ t ̸=m, thenq ℓ t|0(xℓ t|xℓ

work page

[47] [47]

Therefore, on unmasked positions, the denoiser no longer contains enough information to recon- struct the leave-one-out posterior, so the inverse formula is not available

=α t1{xℓ 0 =x ℓ t}, hence pℓ 0|t(·|xt) = Cat(xℓ t). Therefore, on unmasked positions, the denoiser no longer contains enough information to recon- struct the leave-one-out posterior, so the inverse formula is not available. Remark 2.When these conversion formulas hold in both directions, as they do for UDM, the uniqueness of the denoiser as a minimizer of...

work page

[48] [48]

pℓ 0|t(xℓ 0|xt).(29) This already shows that the score can be parameterized from the denoiser. The same quantity can also be expressed directly in terms of the leave-one-out denoiser: ⟨yℓ,s t(xt)ℓ⟩= qℓ t|0(yℓ|ˆxloo 0|t (xt)ℓ) qℓt|0(xℓ t|ˆxloo 0|t (xt)ℓ) ,(30) for everyy∈Xsuch thaty −ℓ =x −ℓ t . Proof.By definition of the score, for everyy∈Xsuch thaty −ℓ =...

work page

[49] [49]

Therefore, pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t )

=p t(x−ℓ t )ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Therefore, pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Since the mapν7→q ℓ t|0(yℓ|ν)is affine in its first argument, this becomes pt(y) =p t(x−ℓ t )qℓ t|0(yℓ|ˆxloo 0|t (xt)ℓ). Takingy=x t yields in the same way pt(xt) =p t(x−ℓ t )qℓ t|0(xℓ t|ˆxloo 0|t (xt)ℓ). Dividing the two equations gives ⟨...

work page

[50] [50]

also considers a LOO denoiser parameterization which follows from Proposition 6. Therefore, if ˆxθ 0(xt, t)is a model for the leave-one-out denoiser, one may use the parameterization pθ(xt, t) =α tˆxθ 0(xt, t) + (1−α t)πℓ .(35) The same restriction is still needed after this reparameterization. If ˆxθ 0(xt, t)ℓ may depend onx ℓ t, the minimizer of (34) ma...

work page

[51] [51]

Therefore pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t )

=p t(x−ℓ t )ploo,ℓ 0|t (xℓ 0|x−ℓ t ). Therefore pt(y) =p t(x−ℓ t ) X xℓ 0∈V qℓ t|0(yℓ|xℓ 0)ploo,ℓ 0|t (xℓ 0|x−ℓ t ). 21 Figure 5: Comparison between the leave-one-out posterior ˆxloo 0|t (xt)and the denoising posterior ˆx0|t(xt). Using the affine extension of the forward kernel in its first argument, this becomes pt(y) =p t(x−ℓ t )qℓ t|0(yℓ|ˆxloo 0|t (xt)...

work page

[52] [52]

Cat(xℓ ti+1;1/K) +1 τ ℓ>ti+1 X xℓ ti+1 Cat(xℓ ti;x ℓ ti+1) Cat(xℓ ti+1;x ℓ 0) =1 τ ℓ≤tiCat(xℓ ti;1/K) +1 ti<τ ℓ≤ti+1Cat(xℓ ti;x ℓ

work page

[53] [53]

This proves the induction step

+1 τ ℓ>ti+1Cat(xℓ ti;x ℓ 0) = Cat xℓ ti;1 τ ℓ>ti xℓ 0 +1 τ ℓ≤ti 1/K . This proves the induction step. Hence (50) holds for every time of the grid, and therefore for every t. Lemma 3.If ˜xt(τ) ℓ := xℓ t ifτ ℓ > t, mifτ ℓ ≤t, then p0|t(x0|xt,τ) =p mask 0|t (x0|˜xt(τ)). Proof.By Bayes’ rule and (50), p0|t(x0|xt,τ) = p0(x0)qt|0(xt|x0,τ)P ˜x0 p0(˜x0)qt|0(xt|˜x...

work page

[54] [54]

canonical

The lawj s|0 exactly reweights these two possibilities. The lifted reverse transition is then ¯ps|t(xs,τ s|xt,τ t) := X x0 js|0(τs|x0,x s)q s|0,t(xs|x0,x t)p 0|t(x0|xt,τ t).(53) For a grid0 =t 0 <· · ·< t n = 1, let¯p0:n denote the corresponding path law, with initialization ¯ptn(xtn ,τ tn) :=p tn(xtn)jtn(τtn |xtn). Ifα tn = 0, this reduces toX tn ∼υ ⊗L a...

work page

[55] [55]

and later used by [38]. The idea is to combine two autoregressive streams, one left-to-right and one right-to-left, and to offset the representations so that the output at positionℓnever attends to the input token at the same position, while still depending on all the other positions. Equivalently, in the continuous relaxation used to describe the archite...

work page

[56] [56]

contract

at fixed checkpoint, and plotting generative perplexity against the resulting entropy. For top-pfrontiers we usep∈ {0.80,0.85,0.90,0.92,0.94,0.96,0.98,1.00}andNFE∈ {8,16,32,64,128,256,512,1024}. For temperature frontiers we useT∈ {0.80,0.82, . . . ,1.10} over the same NFE grid. The predictor-corrector experiments are run on OWT with the confidence- based ...

work page 2021